Book Memo: “Machine Learning in Python”

Essential Techniques for Predictive Analysis
Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, data preparation, and using the trained models in practice. You will learn a core set of Python programming techniques, various methods of building predictive models, and how to measure the performance of each model to ensure that the right one is used. The chapters on penalized linear regression and ensemble methods dive deep into each of the algorithms, and you can use the sample code in the book to develop your own data analysis solutions. Machine learning algorithms are at the core of data analytics and visualization. In the past, these methods required a deep background in math and statistics, often in combination with the specialized R programming language. This book demonstrates how machine learning can be implemented using the more widely used and accessible Python programming language.

R Packages worth a look

Fit Sparse Linear Regression Models via Nonconvex Optimization (sparsenet)
Efficient procedure for fitting regularization paths between L1 and L0, using the MC+ penalty of Zhang, C.H. (2010)<doi:10.1214/09-AOS729>. Imple …

Spatio-Temporal Estimation and Prediction for Censored/Missing Responses (StempCens)
It estimates the parameters of a censored or missing data in spatio-temporal models using the SAEM algorithm (Delyon et al., 1999 <doi:10.1214/aos/1 …

Parallel Use of Statistical Packages in Teaching (rosetta)
When teaching statistics, it can often be desirable to uncouple the content from specific software packages. To easy such efforts, the Rosetta Stats we …

Book Memo: “Artificial Intelligence Brings Positive or Negative Impaction to”

Influence Human Job Nature
Advances in artificial intelligence (AI) technology is for the progress in critical areas, such as health, education, energy, economy inclusion, social welfare and the environment. Whether AI can bring positive or negative impaction to influence human job nature change.Thus, it brings this question: Whether (AI) robotic workers can be instead of traditional human workers in these different new markets to bring positive or negative impaction to change human job nature change? In recent years, machines had been used to be human’s tasks in the performance of certain tasks related to intelligence , such as aspects of image recognition. Experts also forecast that rapid progress in the field of specialized artificial intelligence will continue. Then, it also brings this question: Does (AI) exceed that of human performance on more and more tasks to replace human jobs? If it is truth, will some of human jobs to be disappeared? (AI) will be instead of human some simple jobs, then unemployment rate to the low skillful and low educated workers will be increased.Whether (AI) will be raised either production or performance or unemployment to bring human job market more advantages or more disadvantages? In my this book, I shall explain whether (AI) will bring benefits or disadvantages to human job market. I shall give example to let my readers to think how to support my final view point.

Whats new on arXiv

Probably the Best Itemsets

One of the main current challenges in itemset mining is to discover a small set of high-quality itemsets. In this paper we propose a new and general approach for measuring the quality of itemsets. The method is solidly founded in Bayesian statistics and decreases monotonically, allowing for efficient discovery of all interesting itemsets. The measure is defined by connecting statistical models and collections of itemsets. This allows us to score individual itemsets with the probability of them occuring in random models built on the data. As a concrete example of this framework we use exponential models. This class of models possesses many desirable properties. Most importantly, Occam’s razor in Bayesian model selection provides a defence for the pattern explosion. As general exponential models are infeasible in practice, we use decomposable models; a large sub-class for which the measure is solvable. For the actual computation of the score we sample models from the posterior distribution using an MCMC approach. Experimentation on our method demonstrates the measure works in practice and results in interpretable and insightful itemsets for both synthetic and real-world data.

Pipeline Leak Detection Systems and Data Fusion: A Survey

The pipeline leakage problem is very challenging and critical issue. Solving this problem will save the nation a lot of money, resources and more importantly, it will save the environment. This paper discusses the state-of-the-art of leak detection systems (LDSs) and data fusion approaches that are applicable to pipeline monitoring. A comparison of LDSs is performed based on well-defined criteria. We have classified and critically reviewed these techniques. A thorough analysis and comparison of all the recent works have been tabulated focusing on the capabilities and limitations of each technique.

Ask Not What AI Can Do, But What AI Should Do: Towards a Framework of Task Delegability

Although artificial intelligence holds promise for addressing societal challenges, issues of exactly which tasks to automate and the extent to do so remain understudied. We approach the problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to artificial intelligence. We consider four high-level factors that can contribute to a delegation decision: motivation, difficulty, risk, and trust. To obtain an empirical understanding of human preferences in different tasks, we build a dataset of 100 tasks from academic papers, popular media portrayal of AI, and everyday life. For each task, we administer a survey to collect judgments of each factor and ask subjects to pick the extent to which they prefer AI involvement. We find little preference for full AI control and a strong preference for machine-in-the-loop designs, in which humans play the leading role. Our framework can effectively predict human preferences in degrees of AI assistance. Among the four factors, trust is the most predictive of human preferences of optimal human-machine delegation. This framework represents a first step towards characterizing human preferences of automation across tasks. We hope this work may encourage and aid in future efforts towards understanding such individual attitudes; our goal is to inform the public and the AI research community rather than dictating any direction in technology development.

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. Unlike typical autoregressive models which rely on a fixed, often left-to-right ordering of the output, our approach accommodates arbitrary orderings by allowing for tokens to be inserted anywhere in the sequence during decoding. This flexibility confers a number of advantages: for instance, not only can our model be trained to follow specific orderings such as left-to-right generation or a binary tree traversal, but it can also be trained to maximize entropy over all valid insertions for robustness. In addition, our model seamlessly accommodates both fully autoregressive generation (one insertion at a time) and partially autoregressive generation (simultaneous insertions at multiple locations). We validate our approach by analyzing its performance on the WMT 2014 English-German machine translation task under various settings for training and decoding. We find that the Insertion Transformer outperforms many prior non-autoregressive approaches to translation at comparable or better levels of parallelism, and successfully recovers the performance of the original Transformer while requiring only logarithmically many iterations during decoding.

Invariant-equivariant representation learning for multi-class data

Representations learnt through deep neural networks tend to be highly informative, but opaque in terms of what information they learn to encode. We introduce an approach to probabilistic modelling that learns to represent data with two separate deep representations: an invariant representation that encodes the information of the class from which the data belongs, and an equivariant representation that encodes the symmetry transformation defining the particular data point within the class manifold (equivariant in the sense that the representation varies naturally with symmetry transformations). This approach is based primarily on the strategic routing of data through the two latent variables, and thus is conceptually transparent, easy to implement, and in-principle generally applicable to any data comprised of discrete classes of continuous distributions (e.g. objects in images, topics in language, individuals in behavioural data). We demonstrate qualitatively compelling representation learning and competitive quantitative performance, in both supervised and semi-supervised settings, versus comparable modelling approaches in the literature with little fine tuning.

FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary

We present a novel method of compression of deep Convolutional Neural Networks (CNNs). The proposed method reduces the number of parameters of each convolutional layer by learning a 3D tensor termed Filter Summary (FS). The convolutional filters are extracted from FS as overlapping 3D blocks, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet generates the same number of filters from FS as that of the basline CNN in the forward process. Without hurting the inference speed, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is compatible with weight quantization, leading to even higher compression ratio when combined with weight quantization. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection. For classification task, FSNet of 0.22M effective parameters has prediction accuracy of 93.91% on the CIFAR-10 dataset with less than 0.3% accuracy drop, using ResNet-18 of 11.18M parameters as baseline. Furthermore, FSNet version of ResNet-50 with 2.75M effective parameters achieves the top-1 and top-5 accuracy of 63.80% and 85.72% respectively on ILSVRC-12 benchmark. For object detection task, FSNet is used to compress the Single Shot MultiBox Detector (SSD300) of 26.32M parameters. FSNet of 0.45M effective parameters achieves mAP of 67.63% on the VOC2007 test data with weight quantization, and FSNet of 0.68M parameters achieves mAP of 70.00% with weight quantization on the same test data.

Discovering Context Effects from Raw Choice Data

Many applications in preference learning assume that decisions come from the maximization of a stable utility function. Yet a large experimental literature shows that individual choices and judgements can be affected by ‘irrelevant’ aspects of the context in which they are made. An important class of such contexts is the composition of the choice set. In this work, our goal is to discover such choice set effects from raw choice data. We introduce an extension of the Multinomial Logit (MNL) model, called the context dependent random utility model (CDM), which allows for a particular class of choice set effects. We show that the CDM can be thought of as a second-order approximation to a general choice system, can be inferred optimally using maximum likelihood and, importantly, is easily interpretable. We apply the CDM to both real and simulated choice data to perform principled exploratory analyses for the presence of choice set effects.

Accounting for Significance and Multicollinearity in Building Linear Regression Models

We derive explicit Mixed Integer Optimization (MIO) constraints, as opposed to iteratively imposing them in a cutting plane framework, that impose significance and avoid multicollinearity for building linear regression models. In this way we extend and improve the research program initiated in Bertsimas and King (2016) that imposes sparsity, robustness, pairwise collinearity and group sparsity explicitly and significance and avoiding multicollinearity iteratively. We present a variety of computational results on real and synthetic datasets that suggest that the proposed MIO has a significant computational edge compared to Bertsimas and King (2016) in accuracy, false detection rate and computational time in accounting for significance and multicollinearity as well as providing a holistic framework to produce regression models with desirable properties a priori.

Learning Ontologies with Epistemic Reasoning: The EL Case

We investigate the problem of learning description logic ontologies from entailments via queries, using epistemic reasoning. We introduce a new learning model consisting of epistemic membership and example queries and show that polynomial learnability in this model coincides with polynomial learnability in Angluin’s exact learning model with membership and equivalence queries. We then instantiate our learning framework to EL and show some complexity results for an epistemic extension of EL where epistemic operators can be applied over the axioms. Finally, we transfer known results for EL ontologies and its fragments to our learning model based on epistemic reasoning.

Automatic dimensionality selection for principal component analysis models with the ignorance score

Principal component analysis (PCA) is by far the most widespread tool for unsupervised learning with high-dimensional data sets. Its application is popularly studied for the purpose of exploratory data analysis and online process monitoring. Unfortunately, fine-tuning PCA models and particularly the number of components remains a challenging task. Today, this selection is often based on a combination of guiding principles, experience, and process understanding. Unlike the case of regression, where cross-validation of the prediction error is a widespread and trusted approach for model selection, there are no tools for PCA model selection which reach this level of acceptance. In this work, we address this challenge and evaluate the utility of the cross-validated ignorance score with both simulated and experimental data sets. Application of this method is based on the interpretation of PCA as a density model, as in probabilistic principal component analysis, and is shown to be a valuable tool to identify an optimal number of principal components.

The Influence of Time Series Distance Functions on Climate Networks

Network theory has established itself as an important tool for the analysis of complex systems such as the climate. In this context, climate networks are constructed using a spatiotemporal climate dataset and a time series distance function. It consists of representing each spatial area by a node and connecting nodes that present similar time series. One fundamental concern when building climate network is the definition of a metric that captures similarity between time series. The majority of papers in the literature use Pearson correlation with or without lag. Here we study the influence of 29 time series distance functions on climate network construction using global temperature data. We observed that the distance functions used in the literature generate similar networks in general while alternative ones generate distinct networks and exhibit different long-distance connection patterns (teleconnections). These patterns are highly important for the study of climate dynamics since they generally represent long-distance transportation of energy and can be used to forecast climatological events. Therefore, we consider that the measures here studied represent an alternative for the analysis of climate systems due to their capability of capturing different underlying dynamics, what may provide a better understanding of global climate.

PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite

Tensor methods have gained increasingly attention from various applications, including machine learning, quantum chemistry, healthcare analytics, social network analysis, data mining, and signal processing, to name a few. Sparse tensors and their algorithms become critical to further improve the performance of these methods and enhance the interpretability of their output. This work presents a sparse tensor algorithm benchmark suite (PASTA) for single- and multi-core CPUs. To the best of our knowledge, this is the first benchmark suite for sparse tensor world. PASTA targets on: 1) helping application users to evaluate different computer systems using its representative computational workloads; 2) providing insights to better utilize existed computer architecture and systems and inspiration for the future design. This benchmark suite is publicly released https://…/pasta.

Architecture Compression

In this paper we propose a novel approach to model compression termed Architecture Compression. Instead of operating on the weight or filter space of the network like classical model compression methods, our approach operates on the architecture space. A 1-D CNN encoder-decoder is trained to learn a mapping from discrete architecture space to a continuous embedding and back. Additionally, this embedding is jointly trained to regress accuracy and parameter count in order to incorporate information about the architecture’s effectiveness on the dataset. During the compression phase, we first encode the network and then perform gradient descent in continuous space to optimize a compression objective function that maximizes accuracy and minimizes parameter count. The final continuous feature is then mapped to a discrete architecture using the decoder. We demonstrate the merits of this approach on visual recognition tasks such as CIFAR-10, CIFAR-100, Fashion-MNIST and SVHN and achieve a greater than 20x compression on CIFAR-10.

Censored Quantile Regression Forests

Random forests are powerful non-parametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on new estimating equations that adapt to censoring and lead to quantile score whenever the data do not exhibit censoring. The proposed procedure named censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure.

Mini-batch learning of exponential family finite mixture models

Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation–maximization (EM) algorithm framework, we demonstrate how mini-batch (EM) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms. Theoretical results regarding the convergence of these mini-batch EM algorithms are presented. We then demonstrate how the mini-batch framework may be applied to conduct maximum likelihood (ML) estimation of mixtures of exponential family distributions, with emphasis on ML estimation for mixtures of normal distributions. Via a simulation study, we demonstrate that the mini-batch algorithm for mixtures of normal distributions can outperform the standard EM algorithm. Further evidence of the performance of the mini-batch framework is provided via an application to the famous MNIST data set.

Bayesian Nonparametric Adaptive Spectral Density Estimation for Financial Time Series

Discrimination between non-stationarity and long-range dependency is a difficult and long-standing issue in modelling financial time series. This paper uses an adaptive spectral technique which jointly models the non-stationarity and dependency of financial time series in a non-parametric fashion assuming that the time series consists of a finite, but unknown number, of locally stationary processes, the locations of which are also unknown. The model allows a non-parametric estimate of the dependency structure by modelling the auto-covariance function in the spectral domain. All our estimates are made within a Bayesian framework where we use aReversible Jump Markov Chain Monte Carlo algorithm for inference. We study the frequentist properties of our estimates via a simulation study, and present a novel way of generating time series data from a nonparametric spectrum. Results indicate that our techniques perform well across a range of data generating processes. We apply our method to a number of real examples and our results indicate that several financial time series exhibit both long-range dependency and non-stationarity.


We propose to learn curvature information for better generalization and fast model adaptation, called meta-curvature. Based on the model-agnostic meta-learner (MAML), we learn to transform the gradients in the inner optimization such that the transformed gradients achieve better generalization performance to a new task. For training large scale neural networks, we decompose the curvature matrix into smaller matrices and capture the dependencies of the model’s parameters with a series of tensor products. We demonstrate the effects of our proposed method on both few-shot image classification and few-shot reinforcement learning tasks. Experimental results show consistent improvements on classification tasks and promising results on reinforcement learning tasks. Furthermore, we observe faster convergence rates of the meta-training process. Finally, we present an analysis that explains better generalization performance with the meta-trained curvature.

Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too gigantic to be deployed on edge devices like smart-phones or embedded sensor nodes. There has been efforts to compress these networks, and a popular method is knowledge distillation, where a large (a.k.a. teacher) pre-trained network is used to train a smaller (a.k.a. student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher. We study the effect of teacher assistant size and extend the framework to multi-step distillation. Moreover, empirical and theoretical analysis are conducted to analyze the teacher assistant knowledge distillation framework. Extensive experiments on CIFAR-10 and CIFAR-100 datasets and plain CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity—the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effective similarity measure which does not require explicit term weighting. The proposed measure employs a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents. Our empirical comparison with the existing similarity measures using different term weighting schemes shows that the new measure produces (i) better results in the binary BoW representation; and (ii) competitive and more consistent results in the term-frequency-based BoW representation.

Low-pass filtering as Bayesian inference

We propose a Bayesian nonparametric method for low-pass filtering that can naturally handle unevenly-sampled and noise-corrupted observations. The proposed model is constructed as a latent-factor model for time series, where the latent factors are Gaussian processes with non-overlapping spectra. With this construction, the low-pass version of the time series can be identified as the low-frequency latent component, and therefore it can be found by means of Bayesian inference. We show that the model admits exact training and can be implemented with minimal numerical approximations. Finally, the proposed model is validated against standard linear filters on synthetic and real-world time series.

Contextual Recurrent Neural Networks

There is an implicit assumption that by unfolding recurrent neural networks (RNN) in finite time, the misspecification of choosing a zero value for the initial hidden state is mitigated by later time steps. This assumption has been shown to work in practice and alternative initialization may be suggested but often overlooked. In this paper, we propose a method of parameterizing the initial hidden state of an RNN. The resulting architecture, referred to as a Contextual RNN, can be trained end-to-end. The performance on an associative retrieval task is found to improve by conditioning the RNN initial hidden state on contextual information from the input sequence. Furthermore, we propose a novel method of conditionally generating sequences using the hidden state parameterization of Contextual RNN.

Equivalence of regression curves sharing common parameters

In clinical trials the comparison of two different populations is a frequently addressed problem. Non-linear (parametric) regression models are commonly used to describe the relationship between covariates as the dose and a response variable in the two groups. In some situations it is reasonable to assume some model parameters to be the same, for instance the placebo effect or the maximum treatment effect. In this paper we develop a (parametric) bootstrap test to establish the similarity of two regression curves sharing some common parameters. We show by theoretical arguments and by means of a simulation study that the new test controls its level and achieves a reasonable power. Moreover, it is demonstrated that under the assumption of common parameters a considerable more powerful test can be constructed compared to the test which does not use this assumption. Finally, we illustrate potential applications of the new methodology by a clinical trial example.

Quickest Change Detection in the Presence of a Nuisance Change

In the quickest change detection problem in which both nuisance and critical changes may occur, the objective is to detect the critical change as quickly as possible without raising an alarm when either there is no change or a nuisance change has occurred. A window-limited sequential change detection procedure based on the generalized likelihood ratio test statistic is proposed. A recursive update scheme for the proposed test statistic is developed and is shown to be asymptotically optimal under mild technical conditions. In the scenario where the post-change distribution belongs to a parametrized family, a generalized stopping time and a lower bound on its average run length are derived. The proposed stopping rule is compared with the naive 2-stage procedure that detects the nuisance or critical change using separate CuSum stopping procedures for the nuisance and critical changes. Simulations demonstrate that the proposed rule outperforms the 2-stage procedure, and experiments on a real dataset on bearing failure verify the performance of the proposed stopping time.

Passing Tests without Memorizing: Two Models for Fooling Discriminators

We introduce two mathematical frameworks for foolability in the context of generative distribution learning. In a nuthsell, fooling is an algorithmic task in which the input sample is drawn from some target distribution and the goal is to output a synthetic distribution that is indistinguishable from the target w.r.t to some fixed class of tests. This framework received considerable attention in the context of Generative Adversarial Networks (GANs), a recently proposed approach which achieves impressive empirical results. From a theoretical viewpoint this problem seems difficult to model. This is due to the fact that in its basic form, the notion of foolability is susceptible to a type of overfitting called memorizing. This raises a challenge of devising notions and definitions that separate between fooling algorithms that generate new synthetic data vs. algorithms that merely memorize or copy the training set. The first model we consider is called GAM–Foolability and is inspired by GANs. Here the learner has only an indirect access to the target distribution via a discriminator. The second model, called DP–Foolability, exploits the notion of differential privacy as a candidate criterion for non-memorization. We proceed to characterize foolability within these two models and study their interrelations. We show that DP–Foolability implies GAM–Foolability and prove partial results with respect to the converse. It remains, though, an open question whether GAM–Foolability implies DP–Foolability. We also present an application in the context of differentially private PAC learning. We show that from a statistical perspective, for any class H, learnability by a private proper learner is equivalent to the existence of a private sanitizer for H. This can be seen as an analogue of the equivalence between uniform convergence and learnability in classical PAC learning.

A combinatorial identity

We prove an interesting combinatorial identity, which we came across in counting contributions from forest graphs to a cluster expansion for classical gas correlation functions, but may be of more general interest.

The Perils of Detecting Measurement Faults in Environmental Monitoring Networks

Scientists deploy environmental monitoring net-works to discover previously unobservable phenomena and quantify subtle spatial and temporal differences in the physical quantities they measure. Our experience, shared by others, has shown that measurements gathered by such networks are perturbed by sensor faults. In response, multiple fault detection techniques have been proposed in the literature. However, in this paper we argue that these techniques may misclassify events (e.g. rain events for soil moisture measurements) as faults, potentially discarding the most interesting measurements. We support this argument by applying two commonly used fault detection techniques on data collected from a soil monitoring network. Our results show that in this case, up to 45% of the event measurements are misclassified as faults. Furthermore, tuning the fault detection algorithms to avoid event misclassification, causes them to miss the majority of actual faults. In addition to exposing the tension between fault and event detection, our findings motivate the need to develop novel fault detection mechanisms which incorporate knowledge of the underlying events and are customized to the sensing modality they monitor.

Assessing the Local Interpretability of Machine Learning Models

The increasing adoption of machine learning tools has led to calls for accountability via model interpretability. But what does it mean for a machine learning model to be interpretable by humans, and how can this be assessed? We focus on two definitions of interpretability that have been introduced in the machine learning literature: simulatability (a user’s ability to run a model on a given input) and ‘what if’ local explainability (a user’s ability to correctly indicate the outcome to a model under local changes to the input). Through a user study with 1000 participants, we test whether humans perform well on tasks that mimic the definitions of simulatability and ‘what if’ local explainability on models that are typically considered locally interpretable. We find evidence consistent with the common intuition that decision trees and logistic regression models are interpretable and are more interpretable than neural networks. We propose a metric – the runtime operation count on the simulatability task – to indicate the relative interpretability of models and show that as the number of operations increases the users’ accuracy on the local interpretability tasks decreases.

Multi-Domain Translation by Learning Uncoupled Autoencoders

Multi-domain translation seeks to learn a probabilistic coupling between marginal distributions that reflects the correspondence between different domains. We assume that data from different domains are generated from a shared latent representation based on a structural equation model. Under this assumption, we show that the problem of computing a probabilistic coupling between marginals is equivalent to learning multiple uncoupled autoencoders that embed to a given shared latent distribution. In addition, we propose a new framework and algorithm for multi-domain translation based on learning the shared latent distribution and training autoencoders under distributional constraints. A key practical advantage of our framework is that new autoencoders (i.e., new domains) can be added sequentially to the model without retraining on the other domains, which we demonstrate experimentally on image as well as genomics datasets.

Scalable Fair Clustering

We study the fair variant of the classic k-median problem introduced by Chierichetti et al. [2017]. In the standard k-median problem, given an input pointset P, the goal is to find k centers C and assign each input point to one of the centers in C such that the average distance of points to their cluster center is minimized. In the fair variant of k-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an ‘approximately equal’ number of points of each color. Chierichetti et al. proposed a two-phase algorithm for fair k-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the k-median objective. In the second step, fairlets are merged into k clusters by one of the existing k-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.

Semantically Enhanced Time Series Databases in IoT-Edge-Cloud Infrastructure

Many IoT systems are data intensive and are for the purpose of monitoring for fault detection and diagnosis of critical systems. A large volume of data steadily come out of a large number of sensors in the monitoring system. Thus, we need to consider how to store and manage these data. Existing time series databases (TSDBs) can be used for monitoring data storage, but they do not have good models for describing the data streams stored in the database. In this paper, we develop a semantic model for the specification of the monitoring data streams (time series data) in terms of which sensor generated the data stream, which metric of which entity the sensor is monitoring, what is the relation of the entity to other entities in the system, which measurement unit is used for the data stream, etc. We have also developed a tool suite, SE-TSDB, that can run on top of existing TSDBs to help establish semantic specifications for data streams and enable semantic-based data retrievals. With our semantic model for monitoring data and our SE-TSDB tool suite, users can retrieve non-existing data streams that can be automatically derived from the semantics. Users can also retrieve data streams without knowing where they are. Semantic based retrieval is especially important in a large-scale integrated IoT-Edge-Cloud system, because of its sheer quantity of data, its huge number of computing and IoT devices that may store the data, and the dynamics in data migration and evolution. With better data semantics, data streams can be more effectively tracked and flexibly retrieved to help with timely data analysis and control decision making anywhere and anytime.

Feature Selection for multi-labeled variables via Dependency Maximization

Feature selection and reducing the dimensionality of data is an essential step in data analysis. In this work, we propose a new criterion for feature selection that is formulated as conditional theoretical information between features given the labeled variable. Instead of using the standard mutual information measure based on Kullback-Leibler divergence, we use our proposed criterion to filter out redundant features. This approach results in an efficient and fast non-parametric implementation of feature selection as it can be directly estimated using a geometric measure of dependency, the global Friedman-Rafsky (FR) multivariate run test statistic constructed by a global minimal spanning tree (MST). We demonstrate the advantages of our proposed feature selection approach through simulation. In addition, the proposed feature selection method is applied to the MNIST data set and compared with [1].

Distilled News

Introduction to Augmented Random Search.

This article is based on a paper published on March, 2018 by Horia Mania, Aurelia Guy, and Benjamin Recht from the University of California, Berkeley. The authors assert that they have built an algorithm that is at least 15 times more efficient than the fastest competing model-free methods on MuJoCo locomotion benchmarks. They baptised the algorithm as Augmented Random Search or ARS for short.

Pump up the Volumes: Data in Docker

This article is about using data with Docker. In it, we’ll focus on Docker volumes. Check out the previous articles in the series if you haven’t yet. We covered Docker concepts, the ecosystem, Dockerfiles, slimming down images, and popular commands.

Great Books for Data Science

There is no shortage of books that promise to teach data science. Most of these books read like college textbooks with a wealth of technical material prefaced by a short conceptual introduction. While there is plenty of demand for these technical skills, really good data scientists need a mix of conceptual understanding and curiosity too before they can make the most of their tools. Each of these books approaches the concept of data science differently: some are about what data science is, some are about how professional data scientists think, some introduce perspectives that may change how you view the world, and one even contains the full history of information.
1. Weapons of Math Destruction
2. The Information
3. Everybody Lies
4. Dataclysm
5. Big Data
6. Algorithms to Live By
7. The Signal and the Noise
8. Complex Adaptive Systems
9. Antifragile/Incerto
10. Superforecasting

Learning from Graph data using Keras and Tensorflow

There is a lot of data out there that can be represented in the form of a graph in real-world applications like in Citation Networks, Social Networks (Followers graph, Friends network, … ), Biological Networks or Telecommunications. Using Graph extracted features can boost the performance of predictive models by relying of information flow between neighboring nodes. However, representing graph data is not straightforward especially if we don’t intend to implement hand-crafted features since most ML models expect a fixed sized or linear input which is not the case for graph data. In this post we will explore some ways to deal with generic graphs to do node classification based on graph representations learned directly from data.

5-Steps and 10-Steps, to Learn Machine Learning.

There is a 5-Step shortcut that you can do to be able to solve machine learning problems right away, As a beginner, you can take this path at first if you want to get something done with machine learning. And then you can take the 10-steps path to be a data scientist or a more advanced machine learning engineer.

Test Time Augmentation (TTA) and how to perform it with Keras

There exists a lot of ways to improve the results of a neural network by changing the way we train it. In particular, Data Augmentation is a common practice to virtually increase the size of training dataset, and is also used as a regularization technique, making the model more robust to slight changes in the input data. Data Augmentation is the process of randomly applying some operations (rotation, zoom, shift, flips,…) to the input data. By this mean, the model is never shown twice the exact same example and has to learn more general features about the classes he has to recognize.

Learning generic sentence representation by various NLP tasks

There are a lot of papers introduced different way to compute a better text representation. From character embeddings, word embeddings to sentence embeddings, I introduced different way to deal with it. However, most of them are training by single data set or problem domain. Subramania et al. explored a simple, effective multi-task learning approach for sentence embeddings in this time. The hypothesize is that good sentence representations can be learnt from a vast amount of weakly related tasks.

Text Encoding: A Review

The key to perform any text mining operation, such as topic detection or sentiment analysis, is to transform words into numbers, sequences of words into sequences of numbers. Once we have numbers, we are back in the well-known game of data analytics, where machine learning algorithms can help us with classifying and clustering.

The goal of this post is to explain a powerful algorithm in statistical analysis: the Expectation-Maximization (EM) algorithm. It is powerful in the sense that it has the ability to deal with missing data and unobserved features, the use-cases for which come up frequently in many real-world applications.

So, what is AI really?

So, what is AI really? This post tries to give some guidance. The traditional way coding a computer program worked was through carefully analyzing the problem, trying to separate its different parts into simpler sub-problems, put the solution into an algorithmic form (kind of a recipe) and finally code it. Let’s have a look at an example!

RFM Analysis in R

RFM (recency, frequency, monetary) analysis is a behavior based technique used to segment customers by examining their transaction history such as
• how recently a customer has purchased (recency)
• how often they purchase (frequency)
• how much the customer spends (monetary)
It is based on the marketing axiom that 80% of your business comes from 20% of your customers. RFM helps to identify customers who are more likely to respond to promotions by segmenting them into various categories.

Introducing PlaNet: A Deep Planning Network for Reinforcement Learning

Research into how artificial agents can improve their decisions over time is progressing rapidly via reinforcement learning (RL). For this technique, an agent observes a stream of sensory inputs (e.g. camera images) while choosing actions (e.g. motor commands), and sometimes receives a reward for achieving a specified goal. Model-free approaches to RL aim to directly predict good actions from the sensory observations, enabling DeepMind’s DQN to play Atari and other agents to control robots. However, this blackbox approach often requires several weeks of simulated interaction to learn through trial and error, limiting its usefulness in practice. Model-based RL, in contrast, attempts to have agents learn how the world behaves in general. Instead of directly mapping observations to actions, this allows an agent to explicitly plan ahead, to more carefully select actions by ‘imagining’ their long-term outcomes. Model-based approaches have achieved substantial successes, including AlphaGo, which imagines taking sequences of moves on a fictitious board with the known rules of the game. However, to leverage planning in unknown environments (such as controlling a robot given only pixels as input), the agent must learn the rules or dynamics from experience. Because such dynamics models in principle allow for higher efficiency and natural multi-task learning, creating models that are accurate enough for successful planning is a long-standing goal of RL.


Rant is an all-purpose procedural text engine that is most simply described as the opposite of Regex. It has been refined to include a dizzying array of features for handling everything from the most basic of string generation tasks to advanced dialogue generation, code templating, automatic formatting, and more. The goal of the project is to enable developers of all kinds to automate repetitive writing tasks with a high degree of creative freedom.

If you did not already know

Argument Harvesting google
Much research in computational argumentation assumes that arguments and counterarguments can be obtained in some way. Yet, to improve and apply models of argument, we need methods for acquiring them. Current approaches include argument mining from text, hand coding of arguments by researchers, or generating arguments from knowledge bases. In this paper, we propose a new approach, which we call argument harvesting, that uses a chatbot to enter into a dialogue with a participant to get arguments and counterarguments from him or her. Because it is automated, the chatbot can be used repeatedly in many dialogues, and thereby it can generate a large corpus. We describe the architecture of the chatbot, provide methods for managing a corpus of arguments and counterarguments, and an evaluation of our approach in a case study concerning attitudes of women to participation in sport. …

Bounded-Abstention Method With two Constraints of Reject Rates (BA2) google
Abstaining classificaiton aims to reject to classify the easily misclassified examples, so it is an effective approach to increase the clasificaiton reliability and reduce the misclassification risk in the cost-sensitive applications. In such applications, different types of errors (false positive or false negative) usaully have unequal costs. And the error costs, which depend on specific applications, are usually unknown. However, current abstaining classification methods either do not distinguish the error types, or they need the cost information of misclassification and rejection, which are realized in the framework of cost-sensitive learning. In this paper, we propose a bounded-abstention method with two constraints of reject rates (BA2), which performs abstaining classification when error costs are unequal and unknown. BA2 aims to obtain the optimal area under the ROC curve (AUC) by constraining the reject rates of the positive and negative classes respectively. Specifically, we construct the receiver operating characteristic (ROC) curve, and stepwise search the optimal reject thresholds from both ends of the curve, untill the two constraints are satisfied. Experimental results show that BA2 obtains higher AUC and lower total cost than the state-of-the-art abstaining classification methods. Meanwhile, BA2 achieves controllable reject rates of the positive and negative classes. …

Lifted Neural Network google
We describe a novel family of models of multi- layer feedforward neural networks in which the activation functions are encoded via penalties in the training problem. Our approach is based on representing a non-decreasing activation function as the argmin of an appropriate convex optimization problem. The new framework allows for algorithms such as block-coordinate descent methods to be applied, in which each step is composed of a simple (no hidden layer) supervised learning problem that is parallelizable across data points and/or layers. Experiments indicate that the proposed models provide excellent initial guesses for weights for standard neural networks. In addition, the model provides avenues for interesting extensions, such as robustness against noisy inputs and optimizing over parameters in activation functions. …

Document worth reading: “Complex Contagions: A Decade in Review”

Since the publication of ‘Complex Contagions and the Weakness of Long Ties’ in 2007, complex contagions have been studied across an enormous variety of social domains. In reviewing this decade of research, we discuss recent advancements in applied studies of complex contagions, particularly in the domains of health, innovation diffusion, social media, and politics. We also discuss how these empirical studies have spurred complementary advancements in the theoretical modeling of contagions, which concern the effects of network topology on diffusion, as well as the effects of individual-level attributes and thresholds. In synthesizing these developments, we suggest three main directions for future research. The first concerns the study of how multiple contagions interact within the same network and across networks, in what may be called an ecology of contagions. The second concerns the study of how the structure of thresholds and their behavioral consequences can vary by individual and social context. The third area concerns the roles of diversity and homophily in the dynamics of complex contagion, including both diversity of demographic profiles among local peers, and the broader notion of structural diversity within a network. Throughout this discussion, we make an effort to highlight the theoretical and empirical opportunities that lie ahead. Complex Contagions: A Decade in Review

Distilled News

An Introduction to Bayesian Reasoning

You might be using Bayesian techniques in your data science without knowing it! And if you’re not, then it could enhance the power of your analysis. This blog post, part 1 of 2, will demonstrate how Bayesians employ probability distributions to add information when fitting models, and reason about uncertainty of the model’s fit.

What the heck is Word Embedding

Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding. This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.

Better Language Models and Their Implications

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization – all without task-specific training.

Sustainable Industry: Rinse Over Run – Benchmark

We’re really excited to launch our latest competition! In addition to an interesting, new prize structure, the subject matter is at the intersection of sustainability and industry. Improvements to these kinds of processes can have upside for both a business and the planet. The presence of particles, bacteria, allergens, or other foreign material in a food or beverage product can put consumers at risk. Manufacturers put extra care into ensuring that equipment is properly cleaned between uses to avoid any contamination. At the same time, the cleaning processes require substantial resources in the form of time and cleaning supplies, which are often water and chemical mixtures (e.g. caustic soda, acid, etc.). Given these concerns, the cleaning stations measure turbidity during the cleaning process. Turbidity quanitifies the suspended solids in the liquids that are coming out of the cleaning tank. The goal is to have those liquids be turbidity free, indicating that the equipment is fully clean. Depending on the expected level of turbidity, a cleaning station operator can either extend the final rinse (to eliminate remaining turbidity) or shorten it (saving time and water consumption). The goal of this competition is to predict turbidity in the last rinsing phase in order to help minimize the use of water, energy and time, while ensuring high cleaning standards.

An Introduction to Scala

Welcome to the first article in my multi-part series on Scala. This tutorial is designed to teach introductory Scala as a precursor to learning Apache Spark.

Introducing Uber’s Ludwig

Uber continues its spree of deep learning technology releases. Since last year, the Uber AI Labs team has open sourced different frameworks that enable many of the fundamental building blocks of deep learning solutions. The productivity of the Uber engineering team is nothing short of impressive: Pyro is a framework for probabilistic programming built on top of PyTorch, Horovod is a Tensor-Flow based framework for distributed learning, Manifold focused on visual debugging and interpretability and, of course, Michelangelo is a reference architecture for large scale machine learning solutions. The latest creation of Uber AI Labs is Ludwig, a toolbox for training deep learning models without writing any code. Training is one of the most developer intensive aspects of deep learning applications. Typically, data scientists spend numerous hours experimenting with different deep learning models to better perform about a specific training datasets. This process involves more than just training including several other aspects such as model comparison, evaluation, workload distribution and many others. Given its highly technical nature, the training of deep learning models is an activity typically constrained to data scientists and machine learning experts and includes a significant volume of code. While this problem can be generalized for any machine learning solution it has gotten way worse in deep learning architectures as they typically involve many layers and levels. Simplifying the training processes is the number one factor that can streamline the experimentation phase in deep learning solutions.

The Most Amount of Rain over a 10 Day Period on Record

Townsville, Qld, has been inundated with torrential rain and has broken the record of the largest rainfall over a 10 day period. It has been devastating for the farmers and residents of Townsville. I looked at Townsville’s weather data to understand how significant this event was and if there have been comparable events in the past.

ExFaKT: a framework for explaining facts over knowledge graphs and text

Today’s paper choice focuses on the topical area of fact-checking : how do we know whether a candidate fact, which might for example be harvested from a news article or social media post, is likely to be true? For the first generation of knowledge graphs, fact checking was performed manually by human reviewers, but this clearly doesn’t scale to the volume of information published daily. Automated fact checking methods typically produce a numerical score (probability the fact is true), but these scores are hard to understand and justify without a corresponding explanation.

Introduction to gradient boosting on decision trees with Catboost

Today I would like to share my experience with open source machine learning library, based on gradient boosting on decision trees, developed by Russian search engine company?-?Yandex.

Fondations of Machine Learning, part 5

This post is the nineth (and probably last) one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques.

Fondations of Machine Learning, part 4

This post is the eighth one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques.

Data-driven Introspection of my Android Mobile usage in R

This is an attempt to see how the data that are collected from us, can also be used for the betterment of us – one’s self. When companies are so interested in collecting our personal data to show a push in Quarterly revenues, Why not use our own Data Science skills and get some useful insight that can help our life.

Accelerating Time Series Analysis with Automated Machine Learning

This IDC Solution Spotlight examines how automated machine learning tools can augment the analysis, modeling, and prediction of time series data to deliver easily understood and actionable insights for businesses in a simple and agile fashion. Get the report now.

Do you know what I mean?

This deceptively simple game demonstrates one of the hardest problems in artificial intelligence. How should a person reason about other people’s reasoning? How should an agent model other agents? The way a person reasons about another person’s thinking is called the theory of mind (ToM). It’s an awareness that other people can also reason and act according to their reasoning. It’s the realization that others have goals and beliefs about the world and that these are dependent on the other agent’s mental state about the world. This mental state is often sufficient to predict behavior and thus is a nice thing to have. Note that these mental states can be formed via false beliefs about the world.

Book Memo: “Artificial Intelligence Brings Positive or Negative Impaction to”

Influence Human Job Nature
Advances in artificial intelligence (AI) technology is for the progress in critical areas, such as health, education, energy, economy inclusion, social welfare and the environment. Whether AI can bring positive or negative impaction to influence human job nature change.Thus, it brings this question: Whether (AI) robotic workers can be instead of traditional human workers in these different new markets to bring positive or negative impaction to change human job nature change? In recent years, machines had been used to be human’s tasks in the performance of certain tasks related to intelligence , such as aspects of image recognition. Experts also forecast that rapid progress in the field of specialized artificial intelligence will continue. Then, it also brings this question: Does (AI) exceed that of human performance on more and more tasks to replace human jobs? If it is truth, will some of human jobs to be disappeared? (AI) will be instead of human some simple jobs, then unemployment rate to the low skillful and low educated workers will be increased.Whether (AI) will be raised either production or performance or unemployment to bring human job market more advantages or more disadvantages? In my this book, I shall explain whether (AI) will bring benefits or disadvantages to human job market. I shall give example to let my readers to think how to support my final view point.