The Essential Data Science Venn Diagram

A deeper examination of the interdisciplinary interplay involved in data science, focusing on automation, validity and intuition.


20 Handbooks on Modern Statistical Methods

1. Handbook of Environmental and Ecological Statistics
2. Handbook of Mixture Analysis
3. Handbook of Graphical Models
4. Handbook of Approximate Bayesian Computation
5. Handbook of Statistical Methods for Case-Control Studies
6. Handbook of Quantile Regression
7. Handbook of Methods for Designing, Monitoring, and Analyzing Dose-Finding Trials
8. Handbook of Statistical Methods and Analyses in Sports
9. Handbook of Neuroimaging Data Analysis
10. Handbook of Spatial Epidemiology
11. Handbook of Big Data
12. Handbook of Discrete-Valued Time Series
13. Handbook of Cluster Analysis
14. Handbook of Design and Analysis of Experiments
15. Handbook of Missing Data Methodology
16. Handbook of Mixed Membership Models and Their Applications
17. Handbook of Survival Analysis
18. Handbook of Markov Chain Monte Carlo
19. Handbook of Spatial Statistics
20. Longitudinal Data Analysis


Evolution of Chatbots & their Performance

Chatbots are the hot thing right now. Many surveys claim that XX% of companies plan to deploy a chatbot in Y years. Consider the infographic below that describes the chatbot market value in the US in a million dollars. However, where did it all start? Where are we now? In this article, we try to answer these questions.


Learning and statistical model checking of system response times

Since computers have become increasingly more powerful, users are less willing to accept slow responses of systems. Hence, performance testing is important for interactive systems. However, it is still challenging to test if a system provides acceptable performance or can satisfy certain response-time limits, especially for different usage scenarios. On the one hand, there are performance-testing techniques that require numerous costly tests of the system. On the other hand, model-based performance analysis methods have a doubtful model quality. Hence, we propose a combined method to mitigate these issues. We learn response-time distributions from test data in order to augment existing behavioral models with timing aspects. Then, we perform statistical model checking with the resulting model for a performance prediction. Finally, we test the accuracy of our prediction with hypotheses testing of the real system. Our method is implemented with a property-based testing tool with integrated statistical model checking algorithms. We demonstrate the feasibility of our techniques in an industrial case study with a web-service application.


Fondations of Machine Learning, part 2

This post is the sixth one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 5 is online here.


Fondations of Machine Learning, part 3

This post is the seventh one of our series on the history and foundations of econometric and machine learning models. The first fours were on econometrics techniques. Part 6 is online here.


JMLR Volume 20

• Adaptation Based on Generalized Discrepancy
• Transport Analysis of Infinitely Deep Neural Network
• Parsimonious Online Learning with Kernels via Sparse Projections in Function Space
• Convergence Rate of a Simulated Annealing Algorithm with Noisy Observations
• Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
• scikit-multilearn: A Python library for Multi-Label Classification
• Scalable Approximations for Generalized Linear Problems
• Forward-Backward Selection with Early Dropping
• Dynamic Pricing in High-dimensions
• Graphical Lasso and Thresholding: Equivalence and Closed-form Solutions
• An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
• Scalable Kernel K-Means Clustering with Nyström Approximation: Relative-Error Bounds
• Train and Test Tightness of LP Relaxations in Structured Prediction
• Approximations of the Restless Bandit Problem
• Automated Scalable Bayesian Inference via Hilbert Coresets
• Smooth neighborhood recommender systems
• Delay and Cooperation in Nonstochastic Bandits
• Multiplicative local linear hazard estimation and best one-sided cross-validation
• spark-crowd: A Spark Package for Learning from Crowdsourced Big Data
• Accelerated Alternating Projections for Robust Principal Component Analysis
• Spectrum Estimation from a Few Entries
• Random Feature-based Online Multi-kernel Learning in Environments with Unknown Dynamics
• Determining the Number of Latent Factors in Statistical Multi-Relational Learning
• Joint PLDA for Simultaneous Modeling of Two Factors
• Group Invariance, Stability to Deformations, and Complexity of Deep Convolutional Representations
• TensorLy: Tensor Learning in Python
• Monotone Learning with Rectified Wire Networks
• Pyro: Deep Universal Probabilistic Programming
• Iterated Learning in Dynamic Social Networks


Monotone Learning with Rectified Wire Networks

We introduce a new neural network model, together with a tractable and monotone online learning algorithm. Our model describes feed-forward networks for classification, with one output node for each class. The only nonlinear operation is rectification using a ReLU function with a bias. However, there is a rectifier on every edge rather than at the nodes of the network. There are also weights, but these are positive, static, and associated with the nodes. Our rectified wire networks are able to represent arbitrary Boolean functions. Only the bias parameters, on the edges of the network, are learned. Another departure in our approach, from standard neural networks, is that the loss function is replaced by a constraint. This constraint is simply that the value of the output node associated with the correct class should be zero. Our model has the property that the exact norm-minimizing parameter update, required to correctly classify a training item, is the solution to a quadratic program that can be computed with a few passes through the network. We demonstrate a training algorithm using this update, called sequential deactivation (SDA), on MNIST and some synthetic datasets. Upon adopting a natural choice for the nodal weights, SDA has no hyperparameters other than those describing the network structure. Our experiments explore behavior with respect to network size and depth in a family of sparse expander networks.


Accelerated Alternating Projections for Robust Principal Component Analysis

We study robust PCA for the fully observed setting, which is about separating a low rank matrix \BL \BL and a sparse matrix \BS \BS from their sum \BD=\BL+\BS \BD=\BL+\BS . In this paper, a new algorithm, dubbed accelerated alternating projections, is introduced for robust PCA which significantly improves the computational efficiency of the existing alternating projections proposed in (Netrapalli et al., 2014) when updating the low rank factor. The acceleration is achieved by first projecting a matrix onto some low dimensional subspace before obtaining a new estimate of the low rank matrix via truncated SVD. Exact recovery guarantee has been established which shows linear convergence of the proposed algorithm. Empirical performance evaluations establish the advantage of our algorithm over other state-of-the-art algorithms for robust PCA.


Pyro: Deep Universal Probabilistic Programming

Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research. To scale to large data sets and high-dimensional models, Pyro uses stochastic variational inference algorithms and probability distributions built on top of PyTorch, a modern GPU-accelerated deep learning framework. To accommodate complex or model-specific algorithmic behavior, Pyro leverages Poutine, a library of composable building blocks for modifying the behavior of probabilistic programs.


TensorLy: Tensor Learning in Python

Tensors are higher-order extensions of matrices. While matrix methods form the cornerstone of traditional machine learning and data analysis, tensor methods have been gaining increasing traction. However, software support for tensor operations is not on the same footing. In order to bridge this gap, we have developed TensorLy, a Python library that provides a high-level API for tensor methods and deep tensorized neural networks. TensorLy aims to follow the same standards adopted by the main projects of the Python scientific community, and to seamlessly integrate with them. Its BSD license makes it suitable for both academic and commercial applications. TensorLy’s backend system allows users to perform computations with several libraries such as NumPy or PyTorch to name but a few. They can be scaled on multiple CPU or GPU machines. In addition, using the deep-learning frameworks as backend allows to easily design and train deep tensorized neural networks. TensorLy is available at https://…/tensorly


Machine Translation – From the Cold War to Deep Learning

I open Google Translate twice as often as Facebook, and the instant translation of the price tags is not a cyberpunk for me anymore. That’s what we call reality. Hard to imagine, that this is the result of a centennial fight to build the algorithms of the machine translation with no visible success during half of that period. Those precise developments set the basis of all modern language processing systems – from search engines to voice-controlled microwaves. Talking about the evolution and the structure of the online translation today.


How Ethereum and Smart Contracts Work – Distributed Turing Machine with Blockchain Protection

This is the continuation of Blockchain Inside Out: How Bitcoin Works where basic ideas of blockchain technology – transaction pool, block chains, and mining, were explained. I’d recommend everyone not familiar with these terms read that post first. The following article is more complicated, with programming terminology and links to the first article.


Blockchain Inside Out: How Bitcoin Works – Once and for all in simple words

Two years ago I wrote a post called ‘Brave Decentralized World’ mainly focused on the subject of decentralized networks. It was packed with a brief history of Fidonet, Napster, Gnutella, TOR, BitTorrent with its DHT extension, and there I spoke just a bit about the blockchain concept. Nowadays, when a half of my friends are playing on crypto-exchanges, and the other half – preparing for ICOs, surprisingly, only a few of them understand the blockchain internal mechanisms. This article sheds some light on this topic.


Are you leaking h2o? Call plumber!

H2o is a fantastic open source machine learning platform with many different algorithms. There is Graphical user interface, a Python interface and an R interface. Suppose you want to create a predictive model, and you are lazy then just run automl.


Investigating words distribution with R – Zipf’s law

Have you heard of Zipf’s law? I hadn’t until recently. Zipf’s law is an empirical law that states that many different datasets found in nature can be described using Zipf’s distribution. Most notably, word frequencies in books, documents and even languages can be described in this way. Simplified, Zipf’s law states that if we take a document, book or any collection of words and then the how many times each word occurs, their frequencies will be very similar to Zipf’s distribution.


spark-crowd: A Spark Package for Learning from Crowdsourced Big Data

As the data sets increase in size, the process of manually labeling data becomes unfeasible by small groups of experts. Thus, it is common to rely on crowdsourcing platforms which provide inexpensive, but noisy, labels. Although implementations of algorithms to tackle this problem exist, none of them focus on scalability, limiting the area of application to relatively small data sets. In this paper, we present spark-crowd, an Apache Spark package for learning from crowdsourced data with scalability in mind.


Debugging your tensorflow code right (without so many painful mistakes)

When it comes to discussing writing code on tensorflow, it’s always about comparing it to PyTorch, talking about how complex the framework is and why some parts of tf.contrib work so badly. Moreover, I know a lot of data scientists, who use the tensorflow library only as a part of pre-written Github repo, which can be cloned and then successfully used. The reasons for such an attitude to this framework are very different and they’re definitely worth another long-read, but today let’s focus on more pragmatic problems: debugging code written in tensorflow and understanding its main peculiarities.


Identifying and Correcting Label Bias in Machine Learning

As machine learning (ML) becomes more effective and widespread it is becoming more prevalent in systems with real-life impact, from loan recommendations to job application decisions. With the growing usage comes the risk of bias?-?biased training data could lead to biased ML algorithms, which in turn could perpetuate discrimination and bias in society. In a new paper from Google, researchers propose a novel technique to train machine learning algorithms fairly even with a biased dataset. At the heart of the technique is the idea that a biased dataset can be perceived as an unbiased dataset which has gone through manipulation by a biased agent. Using this framework, the biased dataset is re-weighted to fit the (theoretical) unbiased dataset, and only then fed into a machine learning algorithm as training data. The technique achieves state-of-the-art results in several common fairness tests, while presenting relatively low error rates.