We introduce Generative Neural Machine Translation (GNMT), a latent variable architecture which is designed to model the semantics of the source and target sentences. We modify an encoder-decoder translation model by adding a latent variable as a language agnostic representation which is encouraged to learn the meaning of the sentence. GNMT achieves competitive BLEU scores on pure translation tasks, and is superior when there are missing words in the source sentence. We augment the model to facilitate multilingual translation and semi-supervised learning without adding parameters. This framework significantly reduces overfitting when there is limited paired data available, and is effective for translating between pairs of languages not seen during training.
Fairness in algorithmic decision-making processes is attracting increasing concern. When an algorithm is applied to human-related decision-making an estimator solely optimizing its predictive power can learn biases on the existing data, which motivates us the notion of fairness in machine learning. while several different notions are studied in the literature, little studies are done on how these notions affect the individuals. We demonstrate such a comparison between several policies induced by well-known fairness criteria, including the color-blind (CB), the demographic parity (DP), and the equalized odds (EO). We show that the EO is the only criterion among them that removes group-level disparity. Empirical studies on the social welfare and disparity of these policies are conducted.
We present group equivariant capsule networks, a framework to introduce guaranteed equivariance and invariance properties to the capsule network idea. We restrict pose vectors and learned transformations to be elements of a group, which allows us to prove equivariance of pose vectors and invariance of activations under application of the group law. Requirements are a modified spatial aggregation method for capsules and a generic routing by agreement algorithm with abstract rules, which we both present in this work. Further, we connect our equivariant capsule networks with work from the field of group convolutional networks, which consist of convolutions that are equivariant under applications of the group law. Through this connection, we are able to provide intuitions of how both methods relate and are able to combine both approaches in one deep neural network architecture, combining the strengths from both fields. The resulting framework allows sparse evaluation of feature maps defined over groups, provides control over specific equivariance and invariance properties and can use routing by agreement instead of pooling operations. It provides interpretable and equivariant representation vectors as output capsules, which disentangle evidence of object existence from its pose.
We consider the estimation and inference in a system of high-dimensional regression equations allowing for temporal and cross-sectional dependency in covariates and error processes, covering rather general forms of weak dependence. A sequence of large-scale regressions with LASSO is applied to reduce the dimensionality, and an overall penalty level is carefully chosen by a block multiplier bootstrap procedure to account for multiplicity of the equations and dependencies in the data. Correspondingly, oracle properties with a jointly selected tuning parameter are derived. We further provide high-quality de-biased simultaneous inference on the many target parameters of the system. We provide bootstrap consistency results of the test procedure, which are based on a general Bahadur representation for the $Z$-estimators with dependent data. Simulations demonstrate good performance of the proposed inference procedure. Finally, we apply the method to quantify spillover effects of textual sentiment indices in a financial market and to test the connectedness among sectors.
Many data mining tasks cannot be completely addressed by automated processes, such as sentiment analysis and image classification. Crowdsourcing is an effective way to harness the human cognitive ability to process these machine-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower, we can easily involve hundreds of thousands of ordi- nary workers (i.e., the crowd) to address these machine-hard tasks. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowd-powered data mining. We rst give an overview of crowdsourcing, and then summarize the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data mining. Next we review crowd-powered data mining operations, including classification, clustering, pattern mining, outlier detection, knowledge base construction and enrichment. Finally, we provide the emerging challenges in crowdsourced data mining.
Deep neural networks, and in particular recurrent networks, are promising candidates to control autonomous agents that interact in real-time with the physical world. However, this requires a seamless integration of temporal features into the network’s architecture. For the training of and inference with recurrent neural networks, they are usually rolled out over time, and different rollouts exist. Conventionally, during inference the layers of a network are computed in a sequential manner resulting in sparse temporal integration of information and long response times. In this study, we present a theoretical framework to describe the set of all rollouts and demonstrate their differences in solving specific tasks. We prove that certain rollouts, also with only skip and no recurrent connections, enable earlier and more frequent responses, and show empirically that these early responses have better performance. The streaming rollout maximizes these properties and, in addition, enables a fully parallel execution of the network reducing the runtime on massively parallel devices. Additionally, we provide an open-source toolbox to design, train, evaluate, and online-interact with streaming rollouts.
We draw attention to an important, yet largely overlooked aspect of evaluating fairness for automated decision making systems—namely risk and welfare considerations. Our proposed family of measures corresponds to the long-established formulations of cardinal social welfare in economics. We come to this proposal by taking the perspective of a rational, risk-averse individual who is going to be subject to algorithmic decision making and is faced with the task of choosing between several algorithmic alternatives behind a Rawlsian veil of ignorance. The convex formulation of our measures allows us to integrate them as a constraint into any convex loss minimization pipeline. Our empirical analysis reveals interesting trade-offs between our proposal and (a) prediction accuracy, (b) group discrimination, and (c) Dwork et al.’s notion of individual fairness. Furthermore and perhaps most importantly, our work provides both theoretical and empirical evidence suggesting that a lower-bound on our measures often leads to bounded inequality in algorithmic outcomes; hence presenting the first computationally feasible mechanism for bounding individual-level (un)fairness.
In the field of machine learning, data understanding is the practice of getting initial insights in unknown datasets. Such knowledge-intensive tasks require a lot of documentation, which is necessary for data scientists to grasp the meaning of the data. Usually, documentation is separate from the data in various external documents, diagrams, spreadsheets and tools which causes considerable look up overhead. Moreover, other supporting applications are not able to consume and utilize such unstructured data. That is why we propose a methodology that uses a single semantic model that interlinks data with its documentation. Hence, data scientists are able to directly look up the connected information about the data by simply following links. Equally, they can browse the documentation which always refers to the data. Furthermore, the model can be used by other approaches providing additional support, like searching, comparing, integrating or visualizing data. To showcase our approach we also demonstrate an early prototype.
All it takes to identify the computer programs which are Artificial Intelligence is to give them a test and award AI to those that pass the test. Let us say that the scores they earn at the test will be called IQ. We cannot pinpoint a minimum IQ threshold that a program has to cover in order to be AI, however, we will choose a certain value. Thus, our definition for AI will be any program the IQ of which is above the chosen value. While this idea has already been implemented in [3], here we will revisit this construct in order to introduce certain improvements.
Generative adversarial networks (GANs) aim to generate realistic data from some prior distribution (e.g., Gaussian noises). However, such prior distribution is often independent of real data and thus may lose semantic information (e.g., geometric structure or content in images) of data. In practice, the semantic information might be represented by some latent distribution learned from data, which, however, is hard to be used for sampling in GANs. In this paper, rather than sampling from the pre-defined prior distribution, we propose a Local Coordinate Coding (LCC) based sampling method to improve GANs. We derive a generalization bound for LCC based GANs and prove that a small dimensional input is sufficient to achieve good generalization. Extensive experiments on various real-world datasets demonstrate the effectiveness of the proposed method.
This paper introduces multivariate Poisson autoregressive models with exogenous covariates (PoARX) for modelling multivariate time series of counts. We obtain conditions for the PoARX process to be stationary and ergodic before proposing a computationally efficient procedure for estimation of parameters by the method of inference functions (IFM) and obtaining asymptotic normality of these estimators. Lastly, we demonstrate an application to count data for the number of people entering and exiting a building, and show how the different aspects of the model combine to produce a strong predictive model. We conclude by suggesting some further areas of application and by listing directions for future work.
In this paper, we propose a new weight initialization method called even initialization for wide and deep nonlinear neural networks with the ReLU activation function. We prove that no poor local minimum exists in the initial loss landscape in the wide and deep nonlinear neural network initialized by the even initialization method that we propose. Specifically, in the initial loss landscape of such a wide and deep ReLU neural network model, the following four statements hold true: 1) the loss function is non-convex and non-concave; 2) every local minimum is a global minimum; 3) every critical point that is not a global minimum is a saddle point; and 4) bad saddle points exist. We also show that the weight values initialized by the even initialization method are contained in those initialized by both of the (often used) standard initialization and He initialization methods.
Visual question answering (VQA) requires joint comprehension of images and natural language questions, where many questions can’t be directly or clearly answered from visual content but require reasoning from structured human knowledge with confirmation from visual content. This paper proposes visual knowledge memory network (VKMN) to address this issue, which seamlessly incorporates structured human knowledge and deep visual features into memory networks in an end-to-end learning framework. Comparing to existing methods for leveraging external knowledge for supporting VQA, this paper stresses more on two missing mechanisms. First is the mechanism for integrating visual contents with knowledge facts. VKMN handles this issue by embedding knowledge triples (subject, relation, target) and deep visual features jointly into the visual knowledge features. Second is the mechanism for handling multiple knowledge facts expanding from question and answer pairs. VKMN stores joint embedding using key-value pair structure in the memory networks so that it is easy to handle multiple facts. Experiments show that the proposed method achieves promising results on both VQA v1.0 and v2.0 benchmarks, while outperforms state-of-the-art methods on the knowledge-reasoning related questions.
Encoder-decoder based Sequence to Sequence learning (S2S) has made remarkable progress in recent years. Different network architectures have been used in the encoder/decoder. Among them, Convolutional Neural Networks (CNN) and Self Attention Networks (SAN) are the prominent ones. The two architectures achieve similar performances but use very different ways to encode and decode context: CNN use convolutional layers to focus on the local connectivity of the sequence, while SAN uses self-attention layers to focus on global semantics. In this work we propose Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion. During the encoding step, we develop a double path architecture to maintain the information coming from different paths with convolutional layers and self-attention layers separately. To effectively use the encoded context, we develop a cross attention module with gating and use it to automatically pick up the information needed during the decoding step. By deeply integrating the two paths with cross attention, both types of information are combined and well exploited. Experiments show that our proposed method can significantly improve the performance of sequence to sequence learning over state-of-the-art systems.
We propose a method for maximizing a partial area under a receiver operating characteristic (ROC) curve (pAUC) for binary classification tasks. In binary classification tasks, accuracy is the most commonly used as a measure of classifier performance. In some applications such as anomaly detection and diagnostic testing, accuracy is not an appropriate measure since prior probabilties are often greatly biased. Although in such cases the pAUC has been utilized as a performance measure, few methods have been proposed for directly maximizing the pAUC. This optimization is achieved by using a scoring function. The conventional approach utilizes a linear function as the scoring function. In contrast we newly introduce nonlinear scoring functions for this purpose. Specifically, we present two types of nonlinear scoring functions based on generative models and deep neural networks. We show experimentally that nonlinear scoring fucntions improve the conventional methods through the application of a binary classification of real and bogus objects obtained with the Hyper Suprime-Cam on the Subaru telescope.
The objective of this paper is to design novel multi-layer neural network architectures for multiscale simulations of flows taking into account the observed data and physical modeling concepts. Our approaches use deep learning concepts combined with local multiscale model reduction methodologies to predict flow dynamics. Using reduced-order model concepts is important for constructing robust deep learning architectures since the reduced-order models provide fewer degrees of freedom. Flow dynamics can be thought of as multi-layer networks. More precisely, the solution (e.g., pressures and saturations) at the time instant $n+1$ depends on the solution at the time instant $n$ and input parameters, such as permeability fields, forcing terms, and initial conditions. One can regard the solution as a multi-layer network, where each layer, in general, is a nonlinear forward map and the number of layers relates to the internal time steps. We will rely on rigorous model reduction concepts to define unknowns and connections for each layer. In each layer, our reduced-order models will provide a forward map, which will be modified (‘trained’) using available data. It is critical to use reduced-order models for this purpose, which will identify the regions of influence and the appropriate number of variables. Because of the lack of available data, the training will be supplemented with computational data as needed and the interpolation between data-rich and data-deficient models. We will also use deep learning algorithms to train the elements of the reduced model discrete system. We will present main ingredients of our approach and numerical results. Numerical results show that using deep learning and multiscale models, we can improve the forward models, which are conditioned to the available data.
Active learning (AL) aims to enable training high performance classifiers with low annotation cost by predicting which subset of unlabelled instances would be most beneficial to label. The importance of AL has motivated extensive research, proposing a wide variety of manually designed AL algorithms with diverse theoretical and intuitive motivations. In contrast to this body of research, we propose to treat active learning algorithm design as a meta-learning problem and learn the best criterion from data. We model an active learning algorithm as a deep neural network that inputs the base learner state and the unlabelled point set and predicts the best point to annotate next. Training this active query policy network with reinforcement learning, produces the best non-myopic policy for a given dataset. The key challenge in achieving a general solution to AL then becomes that of learner generalisation, particularly across heterogeneous datasets. We propose a multi-task dataset-embedding approach that allows dataset-agnostic active learners to be trained. Our evaluation shows that AL algorithms trained in this way can directly generalise across diverse problems.