The ability to learn disentangled representations that split underlying sources of variation in high dimensional, unstructured data is of central importance for data efficient and robust use of neural networks. Various approaches aiming towards this goal have been proposed in the recent time — validating existing work is hence a crucial task to guide further development. Previous validation methods focused on shared information between generative factors and learned features. The effects of rare events or cumulative influences from multiple factors on encodings, however, remain uncaptured. Our experiments show that this already becomes noticeable in a simple, noise free dataset. This is why we introduce the interventional robustness score, which provides a quantitative evaluation of robustness in learned representations with respect to interventions on generative factors and changing nuisance factors. We show how this score can be estimated from labeled observational data, that may be confounded, and further provide an efficient algorithm that scales linearly in the dataset size. The benefits of our causally motivated framework are illustrated in extensive experiments.
While convolutional neural networks (CNNs) have recently made great strides in supervised classification of data structured on a grid (e.g. images composed of pixel grids), in several interesting datasets, the relations between features can be better represented as a general graph instead of a regular grid. Although recent algorithms that adapt CNNs to graphs have shown promising results, they mostly neglect learning explicit operations for edge features while focusing on vertex features alone. We propose new formulations for convolutional, pooling, and fully connected layers for neural networks that make more comprehensive use of the information available in multi-dimensional graphs. Using these layers led to an improvement in classification accuracy over the state-of-the-art methods on benchmark graph datasets.
The ability to customize a trained Deep Neural Network (DNN) locally using user-specific data may greatly enhance user experiences, reduce development costs, and protect user’s privacy. In this work, we propose to incorporate a novel Mixture of Experts (MOE) approach to accomplish this goal. This architecture comprises of a Global Expert (GE), a Local Expert (LE) and a Gating Network (GN). The GE is a trained DNN developed on a large training dataset representative of many potential users. After deployment on an embedded edge device, GE will be subject to customized, user-specific data (e.g., accent in speech) and its performance may suffer. This problem may be alleviated by training a local DNN (the local expert, LE) on a small size customized training data to correct the errors made by GE. A gating network then will be trained to determine whether an incoming data should be handled by GE or LE. Since the customized dataset is in general very small, the cost of training LE and GN would be much lower than that of re-training of GE. The training of LE and GN thus can be performed at local device, properly protecting the privacy of customized training data. In this work, we developed a prototype MOE architecture for handwritten alphanumeric character recognition task. We use EMNIST as the generic dataset, LeNet5 as GE, and handwritings of 10 users as the customized dataset. We show that with the LE and GN, the classification accuracy is significantly enhanced over the customized dataset with almost no degradation of accuracy over the generic dataset. In terms of energy and network size, the overhead of LE and GN is around 2.5% compared to those of GE.
Data of sequential nature arise in many application domains in forms of, e.g. textual data, DNA sequences, and software execution traces. Different research disciplines have developed methods to learn sequence models from such datasets: (i) in the machine learning field methods such as (hidden) Markov models and recurrent neural networks have been developed and successfully applied to a wide-range of tasks, (ii) in process mining process discovery techniques aim to generate human-interpretable descriptive models, and (iii) in the grammar inference field the focus is on finding descriptive models in the form of formal grammars. Despite their different focuses, these fields share a common goal – learning a model that accurately describes the behavior in the underlying data. Those sequence models are generative, i.e, they can predict what elements are likely to occur after a given unfinished sequence. So far, these fields have developed mainly in isolation from each other and no comparison exists. This paper presents an interdisciplinary experimental evaluation that compares sequence modeling techniques on the task of next-element prediction on four real-life sequence datasets. The results indicate that machine learning techniques that generally have no aim at interpretability in terms of accuracy outperform techniques from the process mining and grammar inference fields that aim to yield interpretable models.
In 2002, the UCR time series classification archive was first released with sixteen datasets. It gradually expanded, until 2015 when it increased in size from 45 datasets to 85 datasets. In October 2018 more datasets were added, bringing the total to 128. The new archive contains a wide range of problems, including variable length series, but it still only contains univariate time series classification problems. One of the motivations for introducing the archive was to encourage researchers to perform a more rigorous evaluation of newly proposed time series classification (TSC) algorithms. It has worked: most recent research into TSC uses all 85 datasets to evaluate algorithmic advances. Research into multivariate time series classification, where more than one series are associated with each class label, is in a position where univariate TSC research was a decade ago. Algorithms are evaluated using very few datasets and claims of improvement are not based on statistical comparisons. We aim to address this problem by forming the first iteration of the MTSC archive, to be hosted at the website http://www.timeseriesclassification.com. Like the univariate archive, this formulation was a collaborative effort between researchers at the University of East Anglia (UEA) and the University of California, Riverside (UCR). The 2018 vintage consists of 30 datasets with a wide range of cases, dimensions and series lengths. For this first iteration of the archive we format all data to be of equal length, include no series with missing data and provide train/test splits.
An important paradigm in smart health is developing diagnosis tools and monitoring a patient’s heart activity through processing Electrocardiogram (ECG) signals is a key example, sue to high mortality rate of heart-related disease. However, current heart monitoring devices suffer from two important drawbacks: i) failure in capturing inter-patient variability, and ii) incapability of identifying heart abnormalities ahead of time to take effective preventive and therapeutic interventions. This paper proposed a novel predictive signal processing method to solve these issues. We propose a two-step classification framework for ECG signals, where a global classifier recognizes severe abnormalities by comparing the signal against a universal reference model. The seemingly normal signals are then passed through a personalized classifier, to recognize mild but informative signal morphology distortions. The key idea is to develop a novel deviation analysis based on a controlled nonlinear transformation to capture significant deviations of the signal towards any of predefined abnormality classes. Here, we embrace the proven but overlooked fact that certain features of ECG signals reflect underlying cardiac abnormalities before the occurrences of cardiac disease. The proposed method achieves a classification accuracy of 96.6% and provides a unique feature of predictive analysis by providing warnings before critical heart conditions. In particular, the chance of observing a severe problem (a red alarm) is raised by about 5% to 10% after observing a yellow alarm of the same type. Although we used this methodology to provide early precaution messages to elderly and high-risk heart-patients, the proposed method is general and applicable to similar bio-medical signal processing applications.
Deep reinforcement learning (DRL) has gained great success by learning directly from high-dimensional sensory inputs, yet is notorious for the lack of interpretability. Interpretability of the subtasks is critical in hierarchical decision-making as it increases the transparency of black-box-style DRL approach and helps the RL practitioners to understand the high-level behavior of the system better. In this paper, we introduce symbolic planning into DRL and propose a framework of Symbolic Deep Reinforcement Learning (SDRL) that can handle both high-dimensional sensory inputs and symbolic planning. The task-level interpretability is enabled by relating symbolic actions to options.This framework features a planner — controller — meta-controller architecture, which takes charge of subtask scheduling, data-driven subtask learning, and subtask evaluation, respectively. The three components cross-fertilize each other and eventually converge to an optimal symbolic plan along with the learned subtasks, bringing together the advantages of long-term planning capability with symbolic knowledge and end-to-end reinforcement learning directly from a high-dimensional sensory input. Experimental results validate the interpretability of subtasks, along with improved data efficiency compared with state-of-the-art approaches.
The expectation-maximization (EM) algorithm is almost ubiquitous for parameter estimation in model-based clustering problems; however, it can become stuck at local maxima, due to its single path, monotonic nature. Rather than using an EM algorithm, an evolutionary algorithm (EA) is developed. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to ‘hard’ model-based clustering and so it can be viewed as a sort of generalization of the k-means algorithm, which is itself equivalent to a classification EM algorithm for a Gaussian mixture model with spherical component covariances. The EA is illustrated on several data sets, and its performance is compared to k-means clustering as well as model-based clustering with an EM algorithm.
One of the main challenges in cluster analysis is estimating the true number of clusters in a dataset. This paper quantifies a notion of persistence of a clustering solution over a range of resolution scales, which is used to characterize the natural clusters and estimate the true number of clusters in a dataset. We show that this quantification of persistence is associated with evaluating the largest eigenvalue of the underlying cluster covariance matrix. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, $X$-means, $G$-means, $PG$-means, dip-means algorithms and information-theoretic method, in accurately predicting the true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm where the number of cluster centers changes (bifurcates) with respect to an annealing parameter. However, the approach suggested in this paper is independent of the choice of clustering algorithm; and can be used in conjunction with any suitable clustering algorithm.
We investigate whether the standard dimensionality reduction technique of PCA inadvertently produces data representations with different fidelity for two different populations. We show on several real-world data sets, PCA has higher reconstruction error on population A than on B (for example, women versus men or lower- versus higher-educated individuals). This can happen even when the data set has a similar number of samples from A and B. This motivates our study of dimensionality reduction techniques which maintain similar fidelity for A and B. We define the notion of Fair PCA and give a polynomial-time algorithm for finding a low dimensional representation of the data which is nearly-optimal with respect to this measure. Finally, we show on real-world data sets that our algorithm can be used to efficiently generate a fair low dimensional representation of the data.
In this paper, we investigate Dimensionality reduction (DR) maps in an information retrieval setting from a quantitative topology point of view. In particular, we show that no DR maps can achieve perfect precision and perfect recall simultaneously. Thus a continuous DR map must have imperfect precision. We further prove an upper bound on the precision of Lipschitz continuous DR maps. While precision is a natural measure in an information retrieval setting, it does not measure how’ wrong the retrieved data is. We therefore propose a new measure based on Wasserstein distance that comes with similar theoretical guarantee. A key technical step in our proofs is a particular optimization problem of the $L_2$-Wasserstein distance over a constrained set of distributions. We provide a complete solution to this optimization problem, which can be of independent interest on the technical side.
Naive Bayes spam filters are highly susceptible to data poisoning attacks. Here, known spam sources/blacklisted IPs exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier accuracy. Such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) model, with only a single (component) density to represent spam (plus a possible attack). We propose a defense based on the use of a mixture of NB models. We demonstrate that the learned mixture almost completely isolates the attack in a second NB component, with the original spam component essentially unchanged by the attack. Our approach addresses both the scenario where the classifier is being re-trained in light of new data and, significantly, the more challenging scenario where the attack is embedded in the original spam training set. Even for weak attack strengths, BIC-based model order selection chooses a two-component solution, which invokes the mixture-based defense. Promising results are presented on the TREC 2005 spam corpus.
When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes. Model-based agents typically learn a single-step transition model. In this paper, we propose a multi-step model that predicts the outcome of an action sequence with variable length. We show that this model is easy to learn, and that the model can make policy-conditional predictions. We report preliminary results that show a clear advantage for the multi-step model compared to its one-step counterpart.
The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce \emph{Alchemist} – an internal service built at Apple from the ground up for \emph{easy}, \emph{fast}, and \emph{scalable} distributed training. We discuss its design, implementation, and examples of running different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection.
We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 300k textual descriptions. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., ‘if X pays Y a compliment, then Y will likely return the compliment’). We propose nine if-then relation types to distinguish causes v.s. effects, agents v.s. themes, voluntary v.s. involuntary events, and actions v.s. mental states. By generatively training on the rich inferential knowledge described in ATOMIC, we show that neural models can acquire simple commonsense capabilities and reason about previously unseen events. Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.
We introduce a new method DOLORES for learning knowledge graph embeddings that effectively captures contextual cues and dependencies among entities and relations. First, we note that short paths on knowledge graphs comprising of chains of entities and relations can encode valuable information regarding their contextual usage. We operationalize this notion by representing knowledge graphs not as a collection of triples but as a collection of entity-relation chains, and learn embeddings for entities and relations using deep neural models that capture such contextual usage. In particular, our model is based on Bi-Directional LSTMs and learn deep representations of entities and relations from constructed entity-relation chains. We show that these representations can very easily be incorporated into existing models to significantly advance the state of the art on several knowledge graph prediction tasks like link prediction, triple classification, and missing relation type prediction (in some cases by at least 9.5%).
Generative Adversarial Networks have surprising ability for generating sharp and realistic images, though they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that while being capable of generating high-quality images, overcomes this problem by encouraging the Discriminator to form clusters in its embedding space, which in turn leads the Generator to exploit these and discover different modes in the data. This is achieved by positioning Gaussian density functions in the corners of a simplex, using the resulting Gaussian mixture as a likelihood function over discriminator embeddings, and formulating an objective function for GAN training that is based on these likelihoods. We show that the optimum of our training objective is attained if and only if the generated and the real distribution match exactly. We further support our theoretical results with empirical evaluations on one synthetic and several real image datasets (CIFAR-10, CelebA, MNIST, and FashionMNIST). We demonstrate empirically (1) the quality of the generated images in Mixture Density GAN and their strong similarity to real images, as measured by the Fr\’echet Inception Distance (FID), which compares very favourably with state-of-the-art methods, and (2) the ability to avoid mode collapse and discover all data modes.
OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challeng- ing due to diverse computation, communication and memory access characteristics which result in varying performance between devices. The Architecture Independent Workload Characterization (AIWC) tool can be used to characterize OpenCL kernels according to a set of architecture-independent features. This work presents a methodology where AIWC features are used to form a model capable of predicting accelerator execution times. We used this methodology to predict execution times for a set of 37 computational kernels running on 15 different devices representing a broad range of CPU, GPU and MIC architectures. The predictions are highly accurate, differing from the measured experimental run-times by an average of only 1.2%, and correspond to actual execution time mispredictions of 9 {\mu}s to 1 sec according to problem size. A previously unencountered code can be instrumented once and the AIWC metrics embedded in the kernel, to allow performance prediction across the full range of modelled devices. The results suggest that this methodology supports correct selection of the most appropriate device for a previously unencountered code, which is highly relevant to the HPC scheduling setting.
Exploiting low-rank structure of the user-item rating matrix has been the crux of many recommendation engines. However, existing recommendation engines force raters with heterogeneous behavior profiles to map their intrinsic rating scales to a common rating scale (e.g. 1-5). This non-linear transformation of the rating scale shatters the low-rank structure of the rating matrix, therefore resulting in a poor fit and consequentially, poor recommendations. In this paper, we propose Clustered Monotone Transforms for Rating Factorization (CMTRF), a novel approach to perform regression up to unknown monotonic transforms over unknown population segments. Essentially, for recommendation systems, the technique searches for monotonic transformations of the rating scales resulting in a better fit. This is combined with an underlying matrix factorization regression model that couples the user-wise ratings to exploit shared low dimensional structure. The rating scale transformations can be generated for each user, for a cluster of users, or for all the users at once, forming the basis of three simple and efficient algorithms proposed in this paper, all of which alternate between transformation of the rating scales and matrix factorization regression. Despite the non-convexity, CMTRF is theoretically shown to recover a unique solution under mild conditions. Experimental results on two synthetic and seven real-world datasets show that CMTRF outperforms other state-of-the-art baselines.
In this paper we analyze convolutional layers of VGG16 model pre-trained on ILSVRC2012. We based our analysis on the responses of neurons to the images of all classes in ImageNet database. In our analysis, we first propose a visualization method to illustrate the learned content of each neuron. Next, we investigate single and multi-faceted neurons based on the diversity of neurons responses to different classes. Finally, we compute the neuronal similarity at each layer and make a comparison between them. Our results demonstrate that the neurons in lower layers exhibit a multi-faceted behavior, whereas the majority of neurons in higher layers comprise single-faceted property and tend to respond to a smaller number of classes.
Counterfactual Regret Minimization (CFR) is the leading algorithm for solving large imperfect-information games. It iteratively traverses the game tree in order to converge to a Nash equilibrium. In order to deal with extremely large games, CFR typically uses domain-specific heuristics to simplify the target game in a process known as abstraction. This simplified game is solved with tabular CFR, and its solution is mapped back to the full game. This paper introduces Deep Counterfactual Regret Minimization (Deep CFR), a form of CFR that obviates the need for abstraction by instead using deep neural networks to approximate the behavior of CFR in the full game. We show that Deep CFR is principled and achieves strong performance in the benchmark game of heads-up no-limit Texas hold’em poker. This is the first successful use of function approximation in CFR for large games.
Human Activity Recognition (HAR) based on motion sensors has drawn a lot of attention over the last few years, since perceiving the human status enables context-aware applications to adapt their services on users’ needs. However, motion sensor fusion and feature extraction have not reached their full potentials, remaining still an open issue. In this paper, we introduce PerceptionNet, a deep Convolutional Neural Network (CNN) that applies a late 2D convolution to multimodal time-series sensor data, in order to extract automatically efficient features for HAR. We evaluate our approach on two public available HAR datasets to demonstrate that the proposed model fuses effectively multimodal sensors and improves the performance of HAR. In particular, PerceptionNet surpasses the performance of state-of-the-art HAR methods based on: (i) features extracted from humans, (ii) deep CNNs exploiting early fusion approaches, and (iii) Long Short-Term Memory (LSTM), by an average accuracy of more than 3%.
We first propose a new task named Dialogue Description (Dial2Desc). Unlike other existing dialogue summarization tasks such as meeting summarization, we do not maintain the natural flow of a conversation but describe an object or an action of what people are talking about. The Dial2Desc system takes a dialogue text as input, then outputs a concise description of the object or the action involved in this conversation. After reading this short description, one can quickly extract the main topic of a conversation and build a clear picture in his mind, without reading or listening to the whole conversation. Based on the existing dialogue dataset, we build a new dataset, which has more than one hundred thousand dialogue-description pairs. As a step forward, we demonstrate that one can get more accurate and descriptive results using a new neural attentive model that exploits the interaction between utterances from different speakers, compared with other baselines.
Building explainable systems is a critical problem in the field of Natural Language Processing (NLP), since most machine learning models provide no explanations for the predictions. Existing approaches for explainable machine learning systems tend to focus on interpreting the outputs or the connections between inputs and outputs. However, the fine-grained information is often ignored, and the systems do not explicitly generate the human-readable explanations. To better alleviate this problem, we propose a novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time. More specifically, we introduce the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations. We construct two new datasets that contain summaries, rating scores, and fine-grained reasons. We conduct experiments on both datasets, comparing with several strong neural network baseline systems. Experimental results show that our method surpasses all baselines on both datasets, and is able to generate concise explanations at the same time.
Many knowledge graph embedding methods operate on triples and are therefore implicitly limited by a very local view of the entire knowledge graph. We present a new framework MOHONE to effectively model higher order network effects in knowledge-graphs, thus enabling one to capture varying degrees of network connectivity (from the local to the global). Our framework is generic, explicitly models the network scale, and captures two different aspects of similarity in networks: (a) shared local neighborhood and (b) structural role-based similarity. First, we introduce methods that learn network representations of entities in the knowledge graph capturing these varied aspects of similarity. We then propose a fast, efficient method to incorporate the information captured by these network representations into existing knowledge graph embeddings. We show that our method consistently and significantly improves the performance on link prediction of several different knowledge-graph embedding methods including TRANSE, TRANSD, DISTMULT, and COMPLEX(by at least 4 points or 17% in some cases).
This paper develops theory and methods for the copula modeling of stationary count time series. The techniques use a latent Gaussian process and a distributional transformation to construct stationary series with very flexible correlation features that can have any pre-specified marginal distribution, including the classical Poisson, generalized Poisson, negative binomial, and binomial count structures. A Gaussian pseudo-likelihood estimation paradigm, based only on the mean and autocovariance function of the count series, is developed via some new Hermite expansions. Particle filtering methods are studied to approximate the true likelihood of the count series. Here, connections to hidden Markov models and other copula likelihood approximations are made. The efficacy of the approach is demonstrated and the methods are used to analyze a count series containing the annual number of no-hitter baseball games pitched in major league baseball since 1893.
Beyond understanding what is being discussed, human communication requires an awareness of what someone is feeling. One challenge for dialogue agents is being able to recognize feelings in the conversation partner and reply accordingly, a key communicative skill that is trivial for humans. Research in this area is made difficult by the paucity of large-scale publicly available datasets both for emotion and relevant dialogues. This work proposes a new task for empathetic dialogue generation and EmpatheticDialogues, a dataset of 25k conversations grounded in emotional contexts to facilitate training and evaluating dialogue systems. Our experiments indicate that models explicitly leveraging emotion predictions from previous utterances are perceived to be more empathetic by human evaluators, while improving on other metrics as well (e.g. perceived relevance of responses, BLEU scores).
Automated planning is one of the foundational areas of AI. Since a single planner unlikely works well for all tasks and domains, portfolio-based techniques become increasingly popular recently. In particular, deep learning emerges as a promising methodology for online planner selection. Owing to the recent development of structural graph representations of planning tasks, we propose a graph neural network (GNN) approach to selecting candidate planners. GNNs are advantageous over a straightforward alternative, the convolutional neural networks, in that they are invariant to node permutations and that they incorporate node labels for better inference. Additionally, for cost-optimal planning, we propose a two-stage adaptive scheduling method to further improve the likelihood that a given task is solved in time. The scheduler may switch at halftime to a different planner, conditioned on the observed performance of the first one. Experimental results validate the effectiveness of the proposed method against strong baselines, both deep learning and non-deep learning based.
Markov decision processes (MDPs) are a common approach used to model dynamic optimization problems. MDPs are specified by a set of states, actions, transition probability kernel and the rewards associated with transitions. The goal is to find a policy that maximizes the expected cumulated reward. However, in most real world problems, the model parameters are estimated from noisy observations and are uncertain. The optimal policy for the nominal parameters might be highly sensitive to even small perturbations in the parameters, leading to significantly suboptimal outcomes. To address this issue, we consider a robust approach where the uncertainty in probability transitions is modeled as an adversarial selection from an uncertainty set. Most prior works consider the case where uncertainty on transitions related to different states is uncoupled. However, the case of general uncertainty sets is known to be intractable. We consider a factor model where the transition probability is a linear function of a factor matrix that is uncertain and belongs to a factor matrix uncertainty set. It allows to model dependence between probability transitions across different states and it is significantly less conservative than prior approaches. We show that under a certain assumption, we can efficiently compute an optimal robust policy under the factor matrix uncertainty model. We show that an optimal robust policy can be chosen deterministic and in particular is an optimal policy for some transition kernel in the uncertainty set. This implies strong min-max duality. We introduce the robust counterpart of important structural results of classical MDPs and we provide a computational study to demonstrate the usefulness of our approach, where we present two examples where robustness improves the worst-case and the empirical performances while maintaining a reasonable performance on the nominal parameters.
The key issue in Dynamic Ensemble Selection (DES) is defining a suitable criterion for calculating the classifiers’ competence. There are several criteria available to measure the level of competence of base classifiers, such as local accuracy estimates and ranking. However, using only one criterion may lead to a poor estimation of the classifier’s competence. In order to deal with this issue, we have proposed a novel dynamic ensemble selection framework using meta-learning, called META-DES. An important aspect of the META-DES framework is that multiple criteria can be embedded in the system encoded as different sets of meta-features. However, some DES criteria are not suitable for every classification problem. For instance, local accuracy estimates may produce poor results when there is a high degree of overlap between the classes. Moreover, a higher classification accuracy can be obtained if the performance of the meta-classifier is optimized for the corresponding data. In this paper, we propose a novel version of the META-DES framework based on the formal definition of the Oracle, called META-DES.Oracle. The Oracle is an abstract method that represents an ideal classifier selection scheme. A meta-feature selection scheme using an overfitting cautious Binary Particle Swarm Optimization (BPSO) is proposed for improving the performance of the meta-classifier. The difference between the outputs obtained by the meta-classifier and those presented by the Oracle is minimized. Thus, the meta-classifier is expected to obtain results that are similar to the Oracle. Experiments carried out using 30 classification problems demonstrate that the optimization procedure based on the Oracle definition leads to a significant improvement in classification accuracy when compared to previous versions of the META-DES framework and other state-of-the-art DES techniques.
In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of TQA dataset. First, it requires to comprehend long lessons to extract knowledge. To tackle this issue of extracting knowledge features from long lessons, we establish knowledge graph from texts and incorporate graph convolutional network (GCN). Second, scientific terms are not spread over the chapters and data splits in TQA dataset. To overcome this so called out-of-domain’ issue, we add novel unsupervised text learning process without any annotations before learning QA problems. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating GCN for extracting knowledge from long lessons and our newly proposed unsupervised learning process are meaningful to solve this problem.
Machine learning models are extensively being used in decision making, especially for prediction tasks. These models could be biased or unfair towards a specific sensitive group either of a specific race, gender or age. Researchers have put efforts into characterizing a particular definition of fairness and enforcing them into the models. In this work, mainly we are concerned with the following three definitions, Disparate Impact, Demographic Parity and Equalized Odds. Researchers have shown that Equalized Odds cannot be satisfied in calibrated classifiers unless the classifier is perfect. Hence the primary challenge is to ensure a degree of fairness while guaranteeing as much accuracy as possible. Fairness constraints are complex and need not be convex. Incorporating them into a machine learning algorithm is a significant challenge. Hence, many researchers have tried to come up with a surrogate loss which is convex in order to build fair classifiers. Besides, certain papers try to build fair representations by preprocessing the data, irrespective of the classifier used. Such methods, not only require a lot of unrealistic assumptions but also require human engineered analytical solutions to build a machine learning model. We instead propose an automated solution which is generalizable over any fairness constraint. We use a neural network which is trained on batches and directly enforces the fairness constraint as the loss function without modifying it further. We have also experimented with other complex performance measures such as H-mean loss, Q-mean-loss, F-measure; without the need for any surrogate loss functions. Our experiments prove that the network achieves similar performance as state of the art. Thus, one can just plug-in appropriate loss function as per required fairness constraint and performance measure of the classifier and train a neural network to achieve that.
Sparse regression such as Lasso has achieved great success in dealing with high dimensional data for several decades. However, there are few methods applicable to missing data, which often occurs in high dimensional data. Recently, CoCoLasso was proposed to deal with high dimensional missing data, but it still suffers from highly missing data. In this paper, we propose a novel Lasso-type regression technique for Highly Missing data, called `HMLasso’. We use the mean imputed covariance matrix, which is notorious in general due to its estimation bias for missing data. However, we effectively incorporate it into Lasso, by using a useful connection with the pairwise covariance matrix. The resulting optimization problem can be seen as a weighted modification of CoCoLasso with the missing ratios, and is quite effective for highly missing data. To the best of our knowledge, this is the first method that can efficiently deal with both high dimensional and highly missing data. We show that the proposed method is beneficial with regards to non-asymptotic properties of the covariance matrix. Numerical experiments show that the proposed method is highly advantageous in terms of estimation error and generalization error.
Topic models are widely used for thematic structure discovery in text. But traditional topic models often require dedicated inference procedures for specific tasks at hand. Also, they are not designed to generate word-level semantic representations. To address these limitations, we propose a topic modeling approach based on Generative Adversarial Nets (GANs), called Adversarial-neural Topic Model (ATM). The proposed ATM models topics with Dirichlet prior and employs a generator network to capture the semantic patterns among latent topics. Meanwhile, the generator could also produce word-level semantic representations. To illustrate the feasibility of porting ATM to tasks other than topic modeling, we apply ATM for open domain event extraction. Our experimental results on the two public corpora show that ATM generates more coherence topics, outperforming a number of competitive baselines. Moreover, ATM is able to extract meaningful events from news articles.
In this paper, we aim to address the problem of human interaction recognition in videos by exploring the long-term inter-related dynamics among multiple persons. Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamic for single-person action recognition due to its ability of capturing the temporal motion information in a range. However, existing RNN models focus only on capturing the dynamics of human interaction by simply combining all dynamics of individuals or modeling them as a whole. Such models neglect the inter-related dynamics of how human interactions change over time. To this end, we propose a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to model the long-term inter-related dynamics among a group of persons for recognizing the human interactions. Specifically, we first feed each person’s static features into a Single-Person LSTM to learn the single-person dynamic. Subsequently, the outputs of all Single-Person LSTM units are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units, a new cell gate and a new co-memory cell. In a Co-LSTM unit, each sub-memory unit stores individual motion information, while this Co-LSTM unit selectively integrates and stores inter-related motion information between multiple interacting persons from multiple sub-memory units via the cell gate and co-memory cell, respectively. Extensive experiments on four public datasets validate the effectiveness of the proposed H-LSTCM by comparing against baseline and state-of-the-art methods.
In this paper, we propose the multi-domain dictionary learning (MDDL) to make dictionary learning-based classification more robust to data representing in different domains. We use adversarial neural networks to generate data in different styles, and collect all the generated data into a miscellaneous dictionary. To tackle the dictionary learning with many samples, we compute the weighting matrix that compress the miscellaneous dictionary from multi-sample per class to single sample per class. We show that the time complexity solving the proposed MDDL with weighting matrix is the same as solving the dictionary with single sample per class. Moreover, since the weighting matrix could help the solver rely more on the training data, which possibly lie in the same domain with the testing data, the classification could be more accurate.
Suggestion mining is increasingly becoming an important task along with sentiment analysis. In today’s cyberspace world, people not only express their sentiments and dispositions towards some entities or services, but they also spend considerable time sharing their experiences and advice to fellow customers and the product/service providers with two-fold agenda: helping fellow customers who are likely to share a similar experience, and motivating the producer to bring specific changes in their offerings which would be more appreciated by the customers. In our current work, we propose a hybrid deep learning model to identify whether a review text contains any suggestion. The model employs semi-supervised learning to leverage the useful information from the large amount of unlabeled data. We evaluate the performance of our proposed model on a benchmark customer review dataset, comprising of the reviews of Hotel and Electronics domains. Our proposed approach shows the F-scores of 65.6% and 65.5% for the Hotel and Electronics review datasets, respectively. These performances are significantly better compared to the existing state-of-the-art system.
Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving the processing algorithm’s accuracy. However, traditional algorithms designed for executing on a single machine lack scalability to deal with the increasing amount of data that has become available in the current Big Data era. ReliefF is one of the most important algorithms successfully implemented in many FS applications. In this paper, we present a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that we have called DiReliefF. Spark is increasing its popularity due to its much faster processing times compared with Hadoop’s MapReduce model implementation. The effectiveness of our proposal is tested on four publicly available datasets, all of them with a large number of instances and two of them with also a large number of features. Subsets of these datasets were also used to compare the results to a non-distributed implementation of the algorithm. The results show that the non-distributed implementation is unable to handle such large volumes of data without specialized hardware, while our design can process them in a scalable way with much better processing times and memory usage.
In this paper, we derive the exact parametric expressions of natural direct and indirect effects, on the odds-ratio scale, in settings with a binary mediator. The effect decomposition we propose does not require the outcome to be rare and generalizes the existing one, allowing for interactions between both the exposure and the mediator and confounding covariates. Further, it outlines a more interpretable relationship between the causal effects and the correspondent pathway-specific logistic regression parameters. Our findings are applied to data coming from a microfinance experiment performed in Bosnia and Herzegovina. A simulation study for a comparison with estimators relying on the rare outcome assumption is also implemented.
Calling multi-threaded C++ code from R has its perils. Since the R interpreter is single-threaded, one must not check for user interruptions or print to the R console from multiple threads. One can, however, synchronize with R from the main thread. The R package RcppThread (current version 0.5.0) contains a header only C++ library for thread safe communication with R that exploits this fact. It includes C++ classes for threads, a thread pool, and parallel loops that routinely synchronize with R. This article explains the package’s functionality and gives examples of its usage. The the synchronization mechanism may also apply to other threading frameworks. Benchmarks suggest that, although synchronization causes overhead, the parallel abstractions of RcppThread are competitive with other popular libraries in typical scenarios encountered in statistical computing.
Marketers often use A/B testing as a tactical tool to compare marketing treatments in a test stage and then deploy the better-performing treatment to the remainder of the consumer population. While these tests have traditionally been analyzed using hypothesis testing, we re-frame such tactical tests as an explicit trade-off between the opportunity cost of the test (where some customers receive a sub-optimal treatment) and the potential losses associated with deploying a sub-optimal treatment to the remainder of the population. We derive a closed-form expression for the profit-maximizing test size and show that it is substantially smaller than that typically recommended for a hypothesis test, particularly when the response is noisy or when the total population is small. The common practice of using small holdout groups can be rationalized by asymmetric priors. The proposed test design achieves nearly the same expected regret as the flexible, yet harder-to-implement multi-armed bandit. We demonstrate the benefits of the method in three different marketing contexts — website design, display advertising and catalog tests — in which we estimate priors from past data. In all three cases, the optimal sample sizes are substantially smaller than for a traditional hypothesis test, resulting in higher profit.
Citizen science projects are successful at gathering rich datasets for various applications. Nevertheless, the data collected by the citizen scientists are often biased, more aligned with the citizens’ preferences rather than scientific objectives. We propose the Shift Compensation Network (SCN), an end-to-end learning scheme which learns the shift from the scientific objectives to the biased data, while compensating the shift by re-weighting the training data. Applied to bird observational data from the citizen science project \textit{eBird}, we demonstrate how SCN quantifies the data distribution shift as well as outperforms supervised learning models that do not address the data bias. Compared with other competing models in the context of covariate shift, we further demonstrate the advantage of SCN in both the effectiveness and the capability of handling massive high-dimensional data.
A feature learning task involves training models that are capable of inferring good representations (transformations of the original space) from input data alone. When working with limited or unlabelled data, and also when multiple visual domains are considered, methods that rely on large annotated datasets, such as Convolutional Neural Networks (CNNs), cannot be employed. In this paper we investigate different auto-encoder (AE) architectures, which require no labels, and explore training strategies to learn representations from images. The models are evaluated considering both the reconstruction error of the images and the feature spaces in terms of their discriminative power. We study the role of dense and convolutional layers on the results, as well as the depth and capacity of the networks, since those are shown to affect both the dimensionality reduction and the capability of generalising for different visual domains. Classification results with AE features were as discriminative as pre-trained CNN features. Our findings can be used as guidelines for the design of unsupervised representation learning methods within and across domains.