DeepGini
Deep neural network (DNN) based systems have been deployed to assist various tasks, including many safety-critical scenarios such as autonomous driving and medical image diagnostics. In company with the DNN-based systems’ fantastic accuracy on the well-defined tasks, these systems could also exhibit incorrect behaviors and thus severe accidents and losses. Therefore, beyond the conventional accuracy-based evaluation, the testing method that can assist developers in detecting incorrect behaviors in the earlier stage is critical for quality assurance of these systems. However, given the fact that automated oracle is often not available, testing DNN-based system usually requires prohibitively expensive human efforts to label the testing data. In this paper, to reduce the efforts in labeling the testing data of DNN-based systems, we propose DeepGini, a test prioritization technique for assisting developers in identifying the tests that can reveal the incorrect behavior. DeepGini is designed based on a statistical perspective of DNN, which allows us to transform the problem of measuring the likelihood of misclassification to the problem of measuing the impurity of data set. To validate our technique, we conduct an extensive empirical study on four popular datasets. The experiment results show that DeepGini outperforms the neuron-coverage-based test prioritization in terms of both efficacy and efficiency. …

Graph-Guided Fused LASSO (GFLASSO)
Let X be a matrix of size n × p , with n observations and p predictors and Y a matrix of size n × k, with the same n observations and k responses, say, 1390 distinct electronics purchase records in 73 countries, to predict the ratings of 50 Netflix productions over all 73 countries. Models well poised for modeling pairs of high-dimensional datasets include orthogonal two-way Partial Least Squares (O2PLS), Canonical Correlation Analysis (CCA) and Co-Inertia Analysis (CIA), all of which involving matrix decomposition. Additionally, since these models are based on latent variables (that is, projections based on the original predictors), the computational efficiency comes at a cost of interpretability. However, this trade-off does not always pay off, and can be reverted with the direct prediction of k individual responses from selected features in X, in a unified regression framework that takes into account the relationships among the responses. Mathematically, the GFLASSO borrows the regularization of the LASSO discussed above and builds the model on the graph dependency structure underlying Y, as quantified by the k × k correlation matrix (that is the ‘strength of association’ that you read about earlier). As a result, similar (or dissimilar) responses will be explained by a similar (or dissimilar) subset of selected predictors. …

Stochastic Continuous Greedy++ (SCG++)
In this paper, we develop Stochastic Continuous Greedy++ (SCG++), the first efficient variant of a conditional gradient method for maximizing a continuous submodular function subject to a convex constraint. Concretely, for a monotone and continuous DR-submodular function, SCG++ achieves a tight $[(1-1/e)\text{OPT} -\epsilon]$ solution while using $O(1/\epsilon^2)$ stochastic oracle queries and $O(1/\epsilon)$ calls to the linear optimization oracle. The best previously known algorithms either achieve a suboptimal $[(1/2)\text{OPT} -\epsilon]$ solution with $O(1/\epsilon^2)$ stochastic gradients or the tight $[(1-1/e)\text{OPT} -\epsilon]$ solution with suboptimal $O(1/\epsilon^3)$ stochastic gradients. SCG++ enjoys optimality in terms of both approximation guarantee and stochastic stochastic oracle queries. Our novel variance reduction method naturally extends to stochastic convex minimization. More precisely, we develop Stochastic Frank-Wolfe++ (SFW++) that achieves an $\epsilon$-approximate optimum with only $O(1/\epsilon)$ calls to the linear optimization oracle while using $O(1/\epsilon^2)$ stochastic oracle queries in total. Therefore, SFW++ is the first efficient projection-free algorithm that achieves the optimum complexity $O(1/\epsilon^2)$ in terms of stochastic oracle queries. …

Datasheets for Datasets
Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects’ demographics and consent as applicable. We develop prototypes of datasheets for two well-known datasets: Labeled Faces in The Wild~\cite{lfw} and the Pang \& Lee Polarity Dataset~\cite{polarity}. …