Design and Analysis of Statistical Learning Algorithms which Control False Discoveries.

In this thesis, general theoretical tools are constructed which can be applied to develop ma- chine learning algorithms which are consistent, with fast convergence and which minimize the generalization error by asymptotically controlling the rate of false discoveries (FDR) of features, especially for high dimensional datasets. Even though the main inspiration of this work comes from biological applications, where the data is extremely high dimensional and often hard to obtain, the developed methods are applicable to any general statistical learning problem. In this work, the various machine learning tasks like hypothesis testing, classification, regression, etc are formulated as risk minimization algorithms. This allows such learning tasks to be viewed as optimization problems, which can be solved using first order optimization techniques in case of large data scenarios, while one could use faster converging second order techniques for small to moderately sized data sets. Further, such a formulation allows us to estimate the first order convergence rates of an empirical risk estimator for any arbitrary learning problem, using techniques from large deviation theory. In many scientific applications, robust discovery of factors affecting an outcome or a phe- notype, is more important than the accuracy of predictions. Hence, it is essential to find an appropriate approach to regularize an under-determined estimation problem and thereby control the generalization error. In this work, the use of local probability of false discovery is explored as such a regularization parameter, which forces the optimized solution towards functions with a lower probability to be a false discovery. Again, techniques from large devi- ation theory and the Gibbs principle allow the derivation of an appropriately regularized cost function. These two theoretical results are then used to develop concrete applications. First, the problem of multi-classification is analyzed, which classifies a sample from an arbitrary proba- bility measure into a finite number of categories, based on a given training data set. A general risk functional is derived, which can be used to learn Bayes optimal classifiers controlling the false discovery rate. Secondly, the problem of model selection in the regression context is considered, aiming to select a subset of given regressors which explains most of the observed variation i.e. perform ANOVA. Again, using techniques mentioned above, a risk function is derived which when optimized, controls the rate of false discoveries. This technique is shown to outperform the popular LASSO algorithm, which can be proven to not control the FDR, but only the FWER. Finally, the problem of inferring under-sampled and partially observed non-negative dis- crete random variables is addressed, which has applications to analyzing RNA sequencing data. By assuming infinite divisibility of the underlying random variable, its characterization as being a discrete Compound Poisson Measure (DCP), is derived. This allows construction of a non-parametric Bayesian model of DCPs with a Pitman-Yor Mixture process prior, which is shown to allow for consistent inference under Kullback-Liebler and Renyi divergences even in the under-sampled regime.


The Future Of Work Is An Adaptive Workforce

The future of work isn’t something that happens to you – it’s something you create for your company and your own career. Unfortunately, C-level technology and business leaders are often uncertain on how to do it. We’ve just released a major new report, ‘The Adaptive Workforce Will Drive The Future Of Work,’ to establish a North Star for your aspirations – and a blueprint for how to get there.


Scaling Quality Training Data

If you need people to process some portion of the big data that feeds your artificial intelligence, you need a reliable workforce. You’re not alone: more businesses are using in-house staff, contractors, and crowdsourcing to get this kind of work done, and industry analysts expect that trend to increase significantly over the next two years.
In this Whitepaper you will learn:
• How to determine what kind of AI data work to outsource
• Why anonymous crowdsourcing adds cost to your data operations
• Best practices to accelerate and scale high-quality data for AI


Underwriting by Prediction Machines

Credit decisioning has always been at the forefront of adopting innovative tools and technology. As a result, the error rates and overall cost of prediction have reduced significantly since the industry adopted scorecards and machine-based prediction. From credit scoring to analytics and now to machine learning models, the fundamental problem statement in a credit decisioning model is of a prediction. Prediction is defined as the process of filling in the missing information. Prediction takes the information (data) one has and uses it to generate information one doesn’t have.


Optimization Algorithms for Deep Learning

Deep learning is a highly iterative process. We have to try out various permutations of the hyperparameters to figure out which combination works best. Therefore it is crucial that our deep learning model trains in a shorter time without penalizing the cost. In this post, I’ll explain the math behind the most commonly used optimization algorithms used in deep learning.


When Bayes, Ockham, and Shannon come together to define machine learning

A beautiful idea, which binds together concepts from statistics, information theory, and philosophy to lay down the foundation of machine learning.


The 5 Feature Selection Algorithms every Data Scientist should know

Data Science is the study of algorithms. I grapple through with many algorithms on a day to day basis, so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series. How many times it has happened when you create a lot of features and then you need to come up with ways to reduce the number of features. We sometimes end up using correlation or tree-based methods to find out the important features. Can we add some structure to it? This post is about some of the most common feature selection techniques one can use while working with data.


Automated Keyword Extraction from Articles using NLP

In research & news articles, keywords form an important component since they provide a concise representation of the article’s content. Keywords also play a crucial role in locating the article from information retrieval systems, bibliographic databases and for search engine optimization. Keywords also help to categorize the article into the relevant subject or discipline. Conventional approaches of extracting keywords involve manual assignment of keywords based on the article content and the authors’ judgment. This involves a lot of time & effort and also may not be accurate in terms of selecting the appropriate keywords. With the emergence of Natural Language Processing (NLP), keyword extraction has evolved into being effective as well as efficient. And in this article, we will combine the two – we’ll be applying NLP on a collection of articles (more on this below) to extract keywords.


A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note regarding doing data science in Python: A development environment [for Python] specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist. ‘What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA’ Amit Ghosh


Convolutional Neural Networks: Training an Image Classifier With Keras

Convolutional Neural Networks are a part of what made Deep Learning reach the headlines so often in the last decade. Today we’ll train an image classifier to tell us whether an image contains a dog or a cat, using TensorFlow’s eager API.


Explaining Predictions: Random Forest Post-hoc Analysis (permutation & impurity variable importance)

In the next few posts, we will look at model specific post-hoc analysis which involves ranking the variables according to importance to the model. Though these interpretation can be applied on both white box and black box models, we will be explaining black box models in these posts as white box models can be explained by the models themselves. Specifically, we will explain random forest in this post and gradient boosting in future posts. Similar to the previous posts, the Cleveland heart dataset will be used as well as principles of tidymodels.


The theory you need to know before you start an NLP project

An overview of the most common natural language processing and machine learning techniques needed to start tackling any project involving text.


Neural Semantic Parsing with Type Constraints for Semi-Structured Tables

Natural Language Processing (NLP) that doesn’t pay attention to the semantics of the language, just results in basic statistics of what words are more/less common and what might correlate with those words; so true Natural Language Understanding (NLU) cannot be achieved, because the structure (essential to understand meaning) of language is not conveyed to the model.


How to use data version control (dvc) in a machine learning project

When working in a productive machine learning project you probably deal with a tone of data and several models. To keep track of which models were trained with which data, you should use a system to version the data, similar to versioning and tracking your code. One way to solve this problem is dvc (Data Version Control, https://dvc.org ), which approaches data versioning in a similar way to Git.


Foundations of ML: Parameterized Functions

A soft introduction to parameterized functions, a foundational topic in both machine learning and statistics, explained through small programming examples.


Reinforcement Learning — Tile Coding Implementation

We have come so far and extended our reinforcement learning theories into continuous space. If you would like to go further, you need to know tile coding, which is probably the most practical and computationally efficient tools being used in continuous space, reinforcement learning problems. Essentially, tile coding is a representation of features in continuous state space, and in this post you will:
• Learn the idea of tile coding
• Implement a tile coding step by step
• Integrate and apply it with a Q function


Introducing Distython. New Python package implementing novel distance metrics

Have you ever had a problem choosing a correct distance metric for a mixed-type dataset? In most cases, you could try to perform feature preprocessing and then use some popular distance metrics such as Euclidian or Manhattan. In fact, this approach might be inaccurate and lead to worse performance of your ML model. Have a look at Distython – Python package which implements novel mixed-type distance metrics from research papers!