Structure learning methods for covariance and concentration graphs are often validated on synthetic models, usually obtained by randomly generating: (i) an undirected graph, and (ii) a compatible symmetric positive definite (SPD) matrix. In order to ensure positive definiteness in (ii), a dominant diagonal is usually imposed. In this work we investigate different methods to generate random symmetric positive definite matrices with undirected graphical constraints. We show that if the graph is chordal is possible to sample uniformly from the set of correlation matrices compatible with the graph, while for general undirected graphs we rely on a partial orthogonalization method.
Experimental data bases are typically very large and high dimensional. To learn from them requires to recognize important features (a pattern), often present at scales different to that of the recorded data. Following the experience collected in statistical mechanics and thermodynamics, the process of recognizing the pattern (the learning process) can be seen as a dissipative time evolution driven by entropy. This is the way thermodynamics enters machine learning. Learning to handle free surface liquids serves as an illustration.
Integrating information from heterogeneous data sources is one of the fundamental problems facing any enterprise. Recently, it has been shown that deep learning based techniques such as embeddings are a promising approach for data integration problems. Prior efforts directly use pre-trained embeddings or simplistically adapt techniques from natural language processing to obtain relational embeddings. In this work, we propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational data. We make three major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in relational world. Second, we propose how to derive sentences from such graph that effectively describe the similarity across elements (tokens, attributes, rows) across the two datasets. The embeddings are learned based on such sentences. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform extensive set of experiments validating them. Our experiments show that our system, EmbDI, produces meaningful results for data integration tasks and our embeddings improve the result quality for existing state of the art methods.
Sufficient conditions for the existence of efficient algorithms are established by introducing the concept of contractility for continuous optimization. Then all the possible continuous problems are divided into three categories: contractile in logarithmic time, contractile in polynomial time, or noncontractile. For the first two, we propose an efficient contracting algorithm to find the set of all global minimizers with a theoretical guarantee of linear convergence; for the last one, we discuss possible troubles caused by using the proposed algorithm.
Allen’s Interval Algebra constitutes a framework for reasoning about temporal information in a qualitative manner. In particular, it uses intervals, i.e., pairs of endpoints, on the timeline to represent entities corresponding to actions, events, or tasks, and binary relations such as precedes and overlaps to encode the possible configurations between those entities. Allen’s calculus has found its way in many academic and industrial applications that involve, most commonly, planning and scheduling, temporal databases, and healthcare. In this paper, we present a novel encoding of Interval Algebra using answer-set programming (ASP) extended by difference constraints, i.e., the fragment abbreviated as ASP(DL), and demonstrate its performance via a preliminary experimental evaluation. Although our ASP encoding is presented in the case of Allen’s calculus for the sake of clarity, we suggest that analogous encodings can be devised for other point-based calculi, too.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this framework, we model a sequence of credit card transactions from three different perspectives, namely (i) The sequence contains or doesn’t contain a fraud (ii) The sequence is obtained by fixing the card-holder or the payment terminal (iii) It is a sequence of spent amount or of elapsed time between the current and previous transactions. Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sequences is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. Our multiple perspectives HMM-based approach offers automated feature engineering to model temporal correlations so as to improve the effectiveness of the classification task and allows for an increase in the detection of fraudulent transactions when combined with the state of the art expert based feature engineering strategy for credit card fraud detection. In extension to previous works, we show that this approach goes beyond ecommerce transactions and provides a robust feature engineering over different datasets, hyperparameters and classifiers. Moreover, we compare strategies to deal with structural missing values.
We propose LaserTagger – a sequence tagging approach that casts text generation as a text editing task. Target texts are reconstructed from the inputs using three main edit operations: keeping a token, deleting it, and adding a phrase before the token. To predict the edit operations, we propose a novel model, which combines a BERT encoder with an autoregressive Transformer decoder. This approach is evaluated on English text on four tasks: sentence fusion, sentence splitting, abstractive summarization, and grammar correction. LaserTagger achieves new state-of-the-art results on three of these tasks, performs comparably to a set of strong seq2seq baselines with a large number of training examples, and outperforms them when the number of examples is limited. Furthermore, we show that at inference time tagging can be more than two orders of magnitude faster than comparable seq2seq models, making it more attractive for running in a live environment.
Several fundamental tasks in data science rely on computing an extremal eigenspace of size $r \ll n$, where $n$ is the underlying problem dimension. For example, spectral clustering and PCA both require the computation of the leading $r$-dimensional subspace. Often, this process is repeated over time due to the possible temporal nature of the data; e.g., graphs representing relations in a social network may change over time, and feature vectors may be added, removed or updated in a dataset. Therefore, it is important to efficiently carry out the computations involved to keep up with frequent changes in the underlying data and also to dynamically determine a reasonable size for the subspace of interest. We present a complete computational pipeline for efficiently updating spectral embeddings in a variety of contexts. Our basic approach is to ‘seed’ iterative methods for eigenproblems with the most recent subspace estimate to significantly reduce the computations involved, in contrast with a na\’ive approach which recomputes the subspace of interest from scratch at every step. In this setting, we provide various bounds on the number of iterations common eigensolvers need to perform in order to update the extremal eigenspace to a sufficient tolerance. We also incorporate a criterion for determining the size of the subspace based on successive eigenvalue ratios. We demonstrate the merits of our approach on the tasks of spectral clustering of temporally evolving graphs and PCA of an incrementally updated data matrix.
Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a Big Data scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current data analysis requirements, for example, social network analysis. Furthermore, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. Nevertheless, there is not much work on the problem of taking OLAP analysis to the graph data model. This paper proposes a formal multidimensional model for graph analysis, that considers the basic graph data, and also background information in the form of dimension hierarchies. The graphs in this model are node- and edge-labelled directed multi-hypergraphs, called graphoids, which can be defined at several different levels of granularity using the dimensions associated with them. Operations analogous to the ones used in typical OLAP over cubes are defined over graphoids. The paper presents a formal definition of the graphoid model for OLAP, proves that the typical OLAP operations on cubes can be expressed over the graphoid model, and shows that the classic data cube model is a particular case of the graphoid data model. Finally, a case study supports the claim that, for many kinds of OLAP-like analysis on graphs, the graphoid model works better than the typical relational OLAP alternative, and for the classic OLAP queries, it remains competitive.
A Natural Language Interface (NLI) facilitates users to pose queries to retrieve information from a database without using any artificial language such as the Structured Query Language (SQL). Several applications in various domains including healthcare, customer support and search engines, require elaborating structured data having information on text. Moreover, many issues have been explored including configuration complexity, processing of intensive algorithms, and popularity of relational databases, due to which translating natural language to database query has become a secondary area of investigation. The emerging trend of querying systems and speech-enabled interfaces revived natural language to database queries research area., The last survey published on this topic was six years ago in 2013. To best of our knowledge, there is no recent study found which discusses the current state of the art translations frameworks for natural language for structured and non-structured query languages. In this paper, we have reviewed 47 frameworks from 2008 to 2018. Out of 47, 35 were closely relevant to our work. SQL based frameworks have been categorized as statistical, symbolic and connectionist approaches. Whereas, NoSQL based frameworks have been categorized as semantic matching and pattern matching. These frameworks are then reviewed based on their supporting language, scheme of their heuristic rule, interoperability support, dataset scope and their overall performance score. The findings stated that 70% of the work in natural language to database querying has been carried out for SQL, and NoSQL share 15%, 10% and 5% of languages like SPAROL, CYPHER and GREMLIN respectively. It has also been observed that most of the frameworks support English language only.
This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for capturing semantics in texts. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model will be available online at https://…/wikipedia2vec.
The uncertainties in future Bitcoin price make it difficult to accurately predict the price of Bitcoin. Accurately predicting the price for Bitcoin is therefore important for decision-making process of investors and market players in the cryptocurrency market. Using historical data from 01/01/2012 to 16/08/2019, machine learning techniques (Generalized linear model via penalized maximum likelihood, random forest, support vector regression with linear kernel, and stacking ensemble) were used to forecast the price of Bitcoin. The prediction models employed key and high dimensional technical indicators as the predictors. The performance of these techniques were evaluated using mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R-squared). The performance metrics revealed that the stacking ensemble model with two base learner (random forest and generalized linear model via penalized maximum likelihood) and support vector regression with linear kernel as meta-learner was the optimal model for forecasting Bitcoin price. The MAPE, RMSE, MAE, and R-squared values for the stacking ensemble model were 0.0191%, 15.5331 USD, 124.5508 USD, and 0.9967 respectively. These values show a high degree of reliability in predicting the price of Bitcoin using the stacking ensemble model. Accurately predicting the future price of Bitcoin will yield significant returns for investors and market players in the cryptocurrency market.
Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent prior work shows that as dataset sizes grow, DL model accuracy and model size grow predictably. This paper leverages the prior work to project the dataset and model size growth required to advance DL accuracy beyond human-level, to frontier targets defined by machine learning experts. Datasets will need to grow $33$$971 \times$, while models will need to grow $6.6$$456\times$ to achieve target accuracies. We further characterize and project the computational requirements to train these applications at scale. Our characterization reveals an important segmentation of DL training challenges for recurrent neural networks (RNNs) that contrasts with prior studies of deep convolutional networks. RNNs will have comparatively moderate operational intensities and very large memory footprint requirements. In contrast to emerging accelerator designs, large-scale RNN training characteristics suggest designs with significantly larger memory capacity and on-chip caches.
Accelerating research in the emerging field of deep graph learning requires new tools. Such systems should support graph as the core abstraction and take care to maintain both forward (i.e. supporting new research ideas) and backward (i.e. integration with existing components) compatibility. In this paper, we present Deep Graph Library (DGL). DGL enables arbitrary message handling and mutation operators, flexible propagation rules, and is framework agnostic so as to leverage high-performance tensor, autograd operations, and other feature extraction modules already available in existing frameworks. DGL carefully handles the sparse and irregular graph structure, deals with graphs big and small which may change dynamically, fuses operations, and performs auto-batching, all to take advantages of modern hardware. DGL has been tested on a variety of models, including but not limited to the popular Graph Neural Networks (GNN) and its variants, with promising speed, memory footprint and scalability.
Recently, consistency-based methods have achieved state-of-the-art results in semi-supervised learning (SSL). These methods always involve two roles, an explicit or implicit teacher model and a student model, and penalize predictions under different perturbations by a consistency constraint. However, the weights of these two roles are tightly coupled since the teacher is essentially an exponential moving average (EMA) of the student. In this work, we show that the coupled EMA teacher causes a performance bottleneck. To address this problem, we introduce Dual Student, which replaces the teacher with another student. We also define a novel concept, stable sample, following which a stabilization constraint is designed for our structure to be trainable. Further, we discuss two variants of our method, which produce even higher performance. Extensive experiments show that our method improves the classification performance significantly on several main SSL benchmarks. Specifically, it reduces the error rate of the 13-layer CNN from 16.84% to 12.39% on CIFAR-10 with 1k labels and from 34.10% to 31.56% on CIFAR-100 with 10k labels. In addition, our method also achieves a clear improvement in domain adaptation.
Model extraction allows an adversary to steal a copy of a remotely deployed machine learning model given access to its predictions. Adversaries are motivated to mount such attacks for a variety of reasons, ranging from reducing their computational costs, to eliminating the need to collect expensive training data, to obtaining a copy of a model in order to find adversarial examples, perform membership inference, or model inversion attacks. In this paper, we taxonomize the space of model extraction attacks around two objectives: \emph{accuracy}, i.e., performing well on the underlying learning task, and \emph{fidelity}, i.e., matching the predictions of the remote victim classifier on any input. To extract a high-accuracy model, we develop a learning-based attack which exploits the victim to supervise the training of an extracted model. Through analytical and empirical arguments, we then explain the inherent limitations that prevent any learning-based strategy from extracting a truly high-fidelity model—i.e., extracting a functionally-equivalent model whose predictions are identical to those of the victim model on all possible inputs. Addressing these limitations, we expand on prior work to develop the first practical functionally-equivalent extraction attack for direct extraction (i.e., without training) of a model’s weights. We perform experiments both on academic datasets and a state-of-the-art image classifier trained with 1 billion proprietary images. In addition to broadening the scope of model extraction research, our work demonstrates the practicality of model extraction attacks against production-grade systems.
Agents trained with deep reinforcement learning algorithms are capable of performing highly complex tasks including locomotion in continuous environments. In order to attain a human-level performance, the next step of research should be to investigate the ability to transfer the learning acquired in one task to a different set of tasks. Concerns on generalization and overfitting in deep reinforcement learning are not usually addressed in current transfer learning research. This issue results in underperforming benchmarks and inaccurate algorithm comparisons due to rudimentary assessments. In this study, we primarily propose regularization techniques in deep reinforcement learning for continuous control through the application of sample elimination and early stopping. First, the importance of the inclusion of training iteration to the hyperparameters in deep transfer learning problems will be emphasized. Because source task performance is not indicative of the generalization capacity of the algorithm, we start by proposing various transfer learning evaluation methods that acknowledge the training iteration as a hyperparameter. In line with this, we introduce an additional step of resorting to earlier snapshots of policy parameters depending on the target task due to overfitting to the source task. Then, in order to generate robust policies,we discard the samples that lead to overfitting via strict clipping. Furthermore, we increase the generalization capacity in widely used transfer learning benchmarks by using entropy bonus, different critic methods and curriculum learning in an adversarial setup. Finally, we evaluate the robustness of these techniques and algorithms on simulated robots in target environments where the morphology of the robot, gravity and tangential friction of the environment are altered from the source environment.
We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective ‘depth’ of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements as existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available at https://github. com/locuslab/deq .
We study the limit computability of finding a global optimum of a continuous function. We give a short proof to show that the problem of checking whether a point is a global minimum is not limit computable. Thereby showing the same for the problem of finding a global minimum. In the next part, we give an algorithm that converges to the global minima when a lower bound on the size of the basin of attraction of the global minima is known. We prove the convergence of this algorithm and provide some numerical experiments.
Dimensionality reduction on Riemannian manifolds is challenging due to the complex nonlinear data structures. While probabilistic principal geodesic analysis~(PPGA) has been proposed to generalize conventional principal component analysis (PCA) onto manifolds, its effectiveness is limited to data with a single modality. In this paper, we present a novel Gaussian latent variable model that provides a unique way to integrate multiple PGA models into a maximum-likelihood framework. This leads to a well-defined mixture model of probabilistic principal geodesic analysis (MPPGA) on sub-populations, where parameters of the principal subspaces are automatically estimated by employing an Expectation Maximization algorithm. We further develop a mixture Bayesian PGA (MBPGA) model that automatically reduces data dimensionality by suppressing irrelevant principal geodesics. We demonstrate the advantages of our model in the contexts of clustering and statistical shape analysis, using synthetic sphere data, real corpus callosum, and mandible data from human brain magnetic resonance~(MR) and CT images.
This thesis focuses on process mining on event data where such a normative specification is absent and, as a result, the event data is less structured. The thesis puts special emphasis on one application domain that fits this description: the analysis of smart home data where sequences of daily activities are recorded. In this thesis we propose a set of techniques to analyze such data, which can be grouped into two categories of techniques. The first category of methods focuses on preprocessing event logs in order to enable process discovery techniques to extract insights into unstructured event data. In this category we have developed the following techniques: – An unsupervised approach to refine event labels based on the time at which the event took place, allowing for example to distinguish recorded eating events into breakfast, lunch, and dinner. – An approach to detect and filter from event logs so-called chaotic activities, which are activities that cause process discovery methods to overgeneralize. – A supervised approach to abstract low-level events into more high-level events, where we show that there exist situations where process discovery approaches overgeneralize on the low-level event data but are able to find precise models on the high-level event data. The second category focuses on mining local process models, i.e., collections of process model patterns that each describe some frequent pattern, in contrast to the single global process model that is obtained with existing process discovery techniques. Several techniques are introduced in the area of local process model mining, including a basic method, fast but approximate heuristic methods, and constraint-based techniques.
Link prediction is one of the fundamental problems in social network analysis. A common set of techniques for link prediction rely on similarity metrics which use the topology of the observed subnetwork to quantify the likelihood of unobserved links. Recently, similarity metrics for link prediction have been shown to be vulnerable to attacks whereby observations about the network are adversarially modified to hide target links. We propose a novel approach for increasing robustness of similarity-based link prediction by endowing the analyst with a restricted set of reliable queries which accurately measure the existence of queried links. The analyst aims to robustly predict a collection of possible links by optimally allocating the reliable queries. We formalize the analyst problem as a Bayesian Stackelberg game in which they first choose the reliable queries, followed by an adversary who deletes a subset of links among the remaining (unreliable) queries by the analyst. The analyst in our model is uncertain about the particular target link the adversary attempts to hide, whereas the adversary has full information about the analyst and the network. Focusing on similarity metrics using only local information, we show that the problem is NP-Hard for both players, and devise two principled and efficient approaches for solving it approximately. Extensive experiments with real and synthetic networks demonstrate the effectiveness of our approach.
Neural networks enjoy widespread use, but many aspects of their training, representation, and operation are poorly understood. In particular, our view into the training process is limited, with a single scalar loss being the most common viewport into this high-dimensional, dynamic process. We propose a new window into training called Loss Change Allocation (LCA), in which credit for changes to the network loss is conservatively partitioned to the parameters. This measurement is accomplished by decomposing the components of an approximate path integral along the training trajectory using a Runge-Kutta integrator. This rich view shows which parameters are responsible for decreasing or increasing the loss during training, or which parameters ‘help’ or ‘hurt’ the network’s learning, respectively. LCA may be summed over training iterations and/or over neurons, channels, or layers for increasingly coarse views. This new measurement device produces several insights into training. (1) We find that barely over 50% of parameters help during any given iteration. (2) Some entire layers hurt overall, moving on average against the training gradient, a phenomenon we hypothesize may be due to phase lag in an oscillatory training process. (3) Finally, increments in learning proceed in a synchronized manner across layers, often peaking on identical iterations.
Everyone makes mistakes. So do human annotators when curating labels for named entity recognition (NER). Such label mistakes might hurt model training and interfere model comparison. In this study, we dive deep into one of the widely-adopted NER benchmark datasets, CoNLL03 NER. We are able to identify label mistakes in about 5.38% test sentences, which is a significant ratio considering that the state-of-the-art test F1 score is already around 93%. Therefore, we manually correct these label mistakes and form a cleaner test set. Our re-evaluation of popular models on this corrected test set leads to more accurate assessments, compared to those on the original test set. More importantly, we propose a simple yet effective framework, CrossWeigh, to handle label mistakes during NER model training. Specifically, it partitions the training data into several folds and train independent NER models to identify potential mistakes in each fold. Then it adjusts the weights of training data accordingly to train the final NER model. Extensive experiments demonstrate significant improvements of plugging various NER models into our proposed framework on three datasets. All implementations and corrected test set are available at our Github repo: https://…/CrossWeigh.
Word embeddings have demonstrated strong performance on NLP tasks. However, lack of interpretability and the unsupervised nature of word embeddings have limited their use within computational social science and digital humanities. We propose the use of informative priors to create interpretable and domain-informed dimensions for probabilistic word embeddings. Experimental results show that sensible priors can capture latent semantic concepts better than or on-par with the current state of the art, while retaining the simplicity and generalizability of using priors.
A key ingredient in branch and bound (B&B) solvers for mixed-integer programming (MIP) is the selection of branching variables since poor or arbitrary selection can affect the size of the resulting search trees by orders of magnitude. A recent article by Le Bodic and Nemhauser [Mathematical Programming, (2017)] investigated variable selection rules by developing a theoretical model of B&B trees from which they developed some new, effective scoring functions for MIP solvers. In their work, Le Bodic and Nemhauser left several open theoretical problems, solutions to which could guide the future design of variable selection rules. In this article, we first solve many of these open theoretical problems. We then implement an improved version of the model-based branching rules in SCIP 6.0, a modern open-source MIP solver, in which we observe an 11% geometric average time and node reduction on instances of the MIPLIB 2017 Benchmark Set that require large B&B trees.
Since the recent advent of deep reinforcement learning for game play and simulated robotic control, a multitude of new algorithms have flourished. Most are model-free algorithms which can be categorized into three families: deep Q-learning, policy gradients, and Q-value policy gradients. These have developed along separate lines of research, such that few, if any, code bases incorporate all three kinds. Yet these algorithms share a great depth of common deep reinforcement learning machinery. We are pleased to share rlpyt, which implements all three algorithm families on top of a shared, optimized infrastructure, in a single repository. It contains modular implementations of many common deep RL algorithms in Python using PyTorch, a leading deep learning library. rlpyt is designed as a high-throughput code base for small- to medium-scale research in deep RL. This white paper summarizes its features, algorithms implemented, and relation to prior work, and concludes with detailed implementation and usage notes. rlpyt is available at https://…/rlpyt.
Advertisements