Machine Learning Methods Economists Should Know About

We discuss the relevance of the recent Machine Learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the machine learning literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, as well as matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics, methods that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, problems that include causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.

Truly Batch Apprenticeship Learning with Deep Successor Features

We introduce a novel apprenticeship learning algorithm to learn an expert’s underlying reward structure in off-policy model-free \emph{batch} settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks—health care, finance or industrial processes —where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of sepsis management in the Intensive Care Unit.

Ensemble Methods for Causal Effects in Panel Data Settings

This paper studies a panel data setting where the goal is to estimate causal effects of an intervention by predicting the counterfactual values of outcomes for treated units, had they not received the treatment. Several approaches have been proposed for this problem, including regression methods, synthetic control methods and matrix completion methods. This paper considers an ensemble approach, and shows that it performs better than any of the individual methods in several economic datasets. Matrix completion methods are often given the most weight by the ensemble, but this clearly depends on the setting. We argue that ensemble methods present a fruitful direction for further research in the causal panel data setting.

Knowledge-driven Encode, Retrieve, Paraphrase for Medical Image Report Generation

Generating long and semantic-coherent reports to describe medical images poses great challenges towards bridging visual and linguistic modalities, incorporating medical domain knowledge, and generating realistic and accurate descriptions. We propose a novel Knowledge-driven Encode, Retrieve, Paraphrase (KERP) approach which reconciles traditional knowledge- and retrieval-based methods with modern learning-based methods for accurate and robust medical report generation. Specifically, KERP decomposes medical report generation into explicit medical abnormality graph learning and subsequent natural language modeling. KERP first employs an Encode module that transforms visual features into a structured abnormality graph by incorporating prior medical knowledge; then a Retrieve module that retrieves text templates based on the detected abnormalities; and lastly, a Paraphrase module that rewrites the templates according to specific cases. The core of KERP is a proposed generic implementation unit—Graph Transformer (GTR) that dynamically transforms high-level semantics between graph-structured data of multiple domains such as knowledge graphs, images and sequences. Experiments show that the proposed approach generates structured and robust reports supported with accurate abnormality description and explainable attentive regions, achieving the state-of-the-art results on two medical report benchmarks, with the best medical abnormality and disease classification accuracy and improved human evaluation performance.

Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction

Knowledge Bases (KBs) require constant up-dating to reflect changes to the world they represent. For general purpose KBs, this is often done through Relation Extraction (RE), the task of predicting KB relations expressed in text mentioning entities known to the KB. One way to improve RE is to use KB Embeddings (KBE) for link prediction. However, despite clear connections between RE and KBE, little has been done toward properly unifying these models systematically. We help close the gap with a framework that unifies the learning of RE and KBE models leading to significant improvements over the state-of-the-art in RE. The code is available at https://github. com/billy-inn/HRERE.

f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning

When labeled training data is scarce, a promising data augmentation approach is to generate visual features of unknown classes using their attributes. To learn the class conditional distribution of CNN features, these models rely on pairs of image features and class attributes. Hence, they can not make use of the abundance of unlabeled data samples. In this paper, we tackle any-shot learning problems i.e. zero-shot and few-shot, in a unified feature generating framework that operates in both inductive and transductive learning settings. We develop a conditional generative model that combines the strength of VAE and GANs and in addition, via an unconditional discriminator, learns the marginal feature distribution of unlabeled images. We empirically show that our model learns highly discriminative CNN features for five datasets, i.e. CUB, SUN, AWA and ImageNet, and establish a new state-of-the-art in any-shot learning, i.e. inductive and transductive (generalized) zero- and few-shot learning settings. We also demonstrate that our learned features are interpretable: we visualize them by inverting them back to the pixel space and we explain them by generating textual arguments of why they are associated with a certain label.

Combining Transfer Learning And Segmentation Information with GANs for Training Data Independent Image Registration

Registration is an important task in automated medical image analysis. Although deep learning (DL) based image registration methods out perform time consuming conventional approaches, they are heavily dependent on training data and do not generalize well for new images types. We present a DL based approach that can register an image pair which is different from the training images. This is achieved by training generative adversarial networks (GANs) in combination with segmentation information and transfer learning. Experiments on chest Xray and brain MR images show that our method gives better registration performance over conventional methods.

Capturing the symptoms of malicious code in electronic documents by file’s entropy signal combined with Machine learning

Abstract-Email cyber-attacks based on malicious documents have become the popular techniques in today’s sophisticated attacks. In the past, persistent efforts have been made to detect such attacks. But there are still some common defects in the existing methods including unable to capture unknown attacks, high overhead of resource and time, and just can be used to detect specific formats of documents. In this study, a new Framework named ESRMD (Entropy signal Reflects the Malicious document) is proposed, which can detect malicious document based on the entropy distribution of the file. In essence, ESRMD is a machine learning classifier. What makes it distinctive is that it extracts global and structural entropy features from the entropy of the malicious documents rather than the structural data or metadata of the file, enduing it the ability to deal with various document formats and against the parser-confusion and obfuscated attacks. In order to assess the validity of the model, we conducted extensive experiments on a collected dataset with 10381 samples in it, which contains malware (51.47%) and benign (48.53%) samples. The results show that our model can achieve a good performance on the true positive rate, precision and ROC with the value of 96.00%, 96.69% and 99.2% respectively. We also compared ESRMD with some leading antivirus engines and prevalent tools. The results showed that our framework can achieve a better performance compared with these engines and tools.

Posterior-based proposals for speeding up Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) is widely used for Bayesian inference in models of complex systems. Performance, however, is often unsatisfactory in models with many latent variables due to so-called poor mixing, necessitating development of application specific implementations. This limits rigorous use of real-world data to inform development and testing of models in applications ranging from statistical genetics to finance. This paper introduces ‘posterior-based proposals’ (PBPs), a new type of MCMC update applicable to a huge class of statistical models (whose conditional dependence structures are represented by directed acyclic graphs). PBPs generates large joint updates in parameter and latent variable space, whilst retaining good acceptance rates (typically 33 percent). Evaluation against standard approaches (Gibbs or Metropolis-Hastings updates) shows performance improvements by a factor of 2 to over 100 for widely varying model types: an individual-based model for disease diagnostic test data, a financial stochastic volatility model and mixed and generalised linear mixed models used in statistical genetics. PBPs are competitive with similarly targeted state-of-the-art approaches such as Hamiltonian MCMC and particle MCMC, and importantly work under scenarios where these approaches do not. PBPs therefore represent an additional general purpose technique that can be usefully applied in a wide variety of contexts.

Knowledge Aware Conversation Generation with Reasoning on Augmented Graph

Two types of knowledge, factoid knowledge from graphs and non-factoid knowledge from unstructured documents, have been studied for knowledge aware open-domain conversation generation, in which edge information in graphs can help generalization of knowledge selectors, and text sentences of non-factoid knowledge can provide rich information for response generation. Fusion of knowledge triples and sentences might yield mutually reinforcing advantages for conversation generation, but there is less study on that. To address this challenge, we propose a knowledge aware chatting machine with three components, augmented knowledge graph containing both factoid and non-factoid knowledge, knowledge selector, and response generator. For knowledge selection on the graph, we formulate it as a problem of multi-hop graph reasoning that is more flexible in comparison with previous one-hop knowledge selection models. To fully leverage long text information that differentiates our graph from others, we improve a state of the art reasoning algorithm with machine reading comprehension technology. We demonstrate that supported by such unified knowledge and knowledge selection method, our system can generate more appropriate and informative responses than baselines.

MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning

In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. We have demonstrated competitive performances on MobileNet V1/V2 networks, up to 9.0/9.9 higher ImageNet accuracy than V1/V2. Compared to the previous state-of-the-art AutoML-based pruning methods, like AMC and NetAdapt, we achieve higher or comparable accuracy under various conditions.

Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB

To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly installed. However, due to the big amounts of data produced, only simple aggregates are stored. This removes outliers and hides fluctuations that could indicate problems. As a solution we propose compressing time series with dimensions using a model-based method we name Multi-model Group Compression (MMGC). MMGC adaptively compresses groups of correlated time series with dimensions using an extensible set of models within a user-defined error bound (possibly zero). To partition time series into groups, we propose a set of primitives for efficiently describing correlation for data sets of varying sizes. We also propose efficient query processing algorithms for executing multi-dimensional aggregate queries on models instead of data points. Last, we provide an open-source implementation of our methods as extensions to the model-based Time Series Management System (TSMS) ModelarDB. ModelarDB interfaces with the stock versions of Apache Spark and Apache Cassandra and thus can reuse existing infrastructure. Through an evaluation we show that, compared to widely used systems, our extended ModelarDB provides up to 11 times faster ingestion due to high compression, 65 times better compression due to the adaptivity of MMGC, 92 times faster aggregate queries as they are executed on models, and close to linear scalability while also being extensible and supporting online query processing.

Learning-to-Learn Stochastic Gradient Descent with Biased Regularization

We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach.

On the use of Deep Autoencoders for Efficient Embedded Reinforcement Learning

In autonomous embedded systems, it is often vital to reduce the amount of actions taken in the real world and energy required to learn a policy. Training reinforcement learning agents from high dimensional image representations can be very expensive and time consuming. Autoencoders are deep neural network used to compress high dimensional data such as pixelated images into small latent representations. This compression model is vital to efficiently learn policies, especially when learning on embedded systems. We have implemented this model on the NVIDIA Jetson TX2 embedded GPU, and evaluated the power consumption, throughput, and energy consumption of the autoencoders for various CPU/GPU core combinations, frequencies, and model parameters. Additionally, we have shown the reconstructions generated by the autoencoder to analyze the quality of the generated compressed representation and also the performance of the reinforcement learning agent. Finally, we have presented an assessment of the viability of training these models on embedded systems and their usefulness in developing autonomous policies. Using autoencoders, we were able to achieve 4-5 \times improved performance compared to a baseline RL agent with a convolutional feature extractor, while using less than 2W of power.

Towards a framework for the evolution of artificial general intelligence

In this work, a novel framework for the emergence of general intelligence is proposed, where agents evolve through environmental rewards and learn throughout their lifetime without supervision, i.e., self-supervised learning through embodiment. The chosen control mechanism for agents is a biologically plausible neuron model based on spiking neural networks. Network topologies become more complex through evolution, i.e., the topology is not fixed, while the synaptic weights of the networks cannot be inherited, i.e., newborn brains are not trained and have no innate knowledge of the environment. What is subject to the evolutionary process is the network topology, the type of neurons, and the type of learning. This process ensures that controllers that are passed through the generations have the intrinsic ability to learn and adapt during their lifetime in mutable environments. We envision that the described approach may lead to the emergence of the simplest form of artificial general intelligence.

MaxSkew and MultiSkew: Two R Packages for Detecting, Measuring and Removing Multivariate Skewness

Skewness plays a relevant role in several multivariate statistical techniques. Sometimes it is used to recover data features, as in cluster analysis. In other circumstances, skewness impairs the performances of statistical methods, as in the Hotelling’s one-sample test. In both cases, there is the need to check the symmetry of the underlying distribution, either by visual inspection or by formal testing. The R packages MaxSkew and MultiSkew address these issues by measuring, testing and removing skewness from multivariate data. Skewness is assessed by the third multivariate cumulant and its functions. The hypothesis of symmetry is tested either nonparametrically, with the bootstrap, or parametrically, under the normality assumption. Skewness is removed or at least alleviated by projecting the data onto appropriate linear subspaces. Usages of MaxSkew and MultiSkew are illustrated with the Iris dataset.

Scale-Adaptive Neural Dense Features: Learning via Hierarchical Context Aggregation

How do computers and intelligent agents view the world around them? Feature extraction and representation constitutes one the basic building blocks towards answering this question. Traditionally, this has been done with carefully engineered hand-crafted techniques such as HOG, SIFT or ORB. However, there is no “one size fits all” approach that satisfies all requirements. In recent years, the rising popularity of deep learning has resulted in a myriad of end-to-end solutions to many computer vision problems. These approaches, while successful, tend to lack scalability and can’t easily exploit information learned by other systems. Instead, we propose SAND features, a dedicated deep learning solution to feature extraction capable of providing hierarchical context information. This is achieved by employing sparse relative labels indicating relationships of similarity/dissimilarity between image locations. The nature of these labels results in an almost infinite set of dissimilar examples to choose from. We demonstrate how the selection of negative examples during training can be used to modify the feature space and vary it’s properties. To demonstrate the generality of this approach, we apply the proposed features to a multitude of tasks, each requiring different properties. This includes disparity estimation, semantic segmentation, self-localisation and SLAM. In all cases, we show how incorporating SAND features results in better or comparable results to the baseline, whilst requiring little to no additional training. Code can be found at: https://…/SAND_features

Dual Graph Attention Networks for Deep Latent Representation of Multifaceted Social Effects in Recommender Systems

Social recommendation leverages social information to solve data sparsity and cold-start problems in traditional collaborative filtering methods. However, most existing models assume that social effects from friend users are static and under the forms of constant weights or fixed constraints. To relax this strong assumption, in this paper, we propose dual graph attention networks to collaboratively learn representations for two-fold social effects, where one is modeled by a user-specific attention weight and the other is modeled by a dynamic and context-aware attention weight. We also extend the social effects in user domain to item domain, so that information from related items can be leveraged to further alleviate the data sparsity problem. Furthermore, considering that different social effects in two domains could interact with each other and jointly influence user preferences for items, we propose a new policy-based fusion strategy based on contextual multi-armed bandit to weigh interactions of various social effects. Experiments on one benchmark dataset and a commercial dataset verify the efficacy of the key components in our model. The results show that our model achieves great improvement for recommendation accuracy compared with other state-of-the-art social recommendation methods.

Real-Time Robotic Search using Hierarchical Spatial Point Processes

Aerial robots hold great potential for aiding Search and Rescue (SAR) efforts over large areas. Traditional approaches typically searches an area exhaustively, thereby ignoring that the density of victims varies based on predictable factors, such as the terrain, population density and the type of disaster. We present a probabilistic model to automate SAR planning, with explicit minimization of the expected time to discovery. The proposed model is a hierarchical spatial point process with three interacting spatial fields for i) the point patterns of persons in the area, ii) the probability of detecting persons and iii) the probability of injury. This structure allows inclusion of informative priors from e.g. geographic or cell phone traffic data, while falling back to latent Gaussian processes when priors are missing or inaccurate. To solve this problem in real-time, we propose a combination of fast approximate inference using Integrated Nested Laplace Approximation (INLA), and a novel Monte Carlo tree search tailored to the problem. Experiments using data simulated from real world GIS maps show that the framework outperforms traditional search strategies, and finds up to ten times more injured in the crucial first hours.

dpUGC: Learn Differentially Private Representation for User Generated Contents

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and data- independent, which facilitates the deployment and sharing. The source code is available at https://…/dpText.

Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis

Researchers increasingly use meta-analysis to synthesize the results of several studies in order to estimate a common effect. When the outcome variable is continuous, standard meta-analytic approaches assume that the primary studies report the sample mean and standard deviation of the outcome. However, when the outcome is skewed, authors sometimes summarize the data by reporting the sample median and one or both of (i) the minimum and maximum values and (ii) the first and third quartiles, but do not report the mean or standard deviation. To include these studies in meta-analysis, several methods have been developed to estimate the sample mean and standard deviation from the reported summary data. A major limitation of these widely used methods is that they assume that the outcome distribution is normal, which is unlikely to be tenable for studies reporting medians. We propose two novel approaches to estimate the sample mean and standard deviation when data are suspected to be non-normal. Our simulation results and empirical assessments show that the proposed methods often perform better than the existing methods when applied to non-normal data.