Democratizing Deep Learning – The Stanford Dawn Project

How about we develop a ML platform that any domain expert can use to build a deep learning model without help from specialist data scientists, in a fraction of the time and cost. The good news is the folks at the Stanford DAWN project are hard at work on just such a platform and the initial results are extraordinary.

Visualize Missing Data with VIM Package

Missing data pose a problem in every data scientist’s daily work. Should we impute them? If so, which method is appropriate? Or can observations with missing data points simply be dropped? To answer these questions, one would need to know what is the mechanism behind the missing data. Detecting it with statistical tests is complex and sometimes only leads to vague statements. Visualization tools, on the contrary, are easy to use and help not only to detect missing data mechanisms but also to gain insights into other aspects of data quality. In this tutorial, a set of plotting methods available in the VIM package are presented to show how they can help one get a solid grasp of what are the patterns in the way data are missing.

Machine Learning Basics – The Norms

Learn linear algebra through code and visualization.

Text Analytics for Beginners using NLTK

Learn How to analyze text using NLTK. Analyze people’s sentiments and classify movie reviews.

Scaling Featuretools with Dask

When a computation is prohibitively slow, the most important question to ask is: ‘What is the bottleneck?’ Once you know the answer, the logical next step is to figure out how to get around that bottleneck. Often, as we´ll see, the bottleneck is that we aren´t taking full advantage of our hardware resources, for example, running a calculation on only one core when our computer has eight. Simply getting a bigger machine – in terms of RAM or cores – will not solve the problem if our code isn´t written to use all our resources. The solution therefore is to rewrite the code to utilize whatever hardware we do have as efficiently as possible. In this article, we´ll see how to refactor our automated feature engineering code to run in parallel on all our laptop´s cores, in the process reducing computation time by over 8x. We´ll make use of two open-source libraries – Featuretools for automated feature engineering and Dask for parallel processing – and solve a problem with a real-world dataset.

10 Roadblocks to Becoming a Data-Driven Enterprise

1. Separate Systems for Real-Time and Historical Data
2. Lack of Data Sharing
3. Data Ownership
4. Data Silos
5. Search
6. Too Much Data
7. Separate Systems for Storage and Analysis
8. Lack of Centralized Data Storage
9. Data Stored in Incorrect Systems
10. Improperly Labelled Data

Deep Learning for NLP: An Overview of Recent Trends

In a timely new paper, Young and colleagues discuss some of the recent trends in deep learning based natural language processing (NLP) systems and applications. The focus of the paper is on the review and comparison of models and methods that have achieved state-of-the-art (SOTA) results on various NLP tasks such as visual question answering (QA) and machine translation. In this comprehensive review, the reader will get a detailed understanding of the past, present, and future of deep learning in NLP. In addition, readers will also learn some of the current best practices for applying deep learning in NLP.

What on earth is data science?

The quest for a useful definition

Financial data analysis

Financial institutions/companies have been using predictive analytics for quite a long time. Recently, due to the availability of computational resources and tremendous research in machine learning made it possible to better data analysis hence better prediction. In the series of articles, I explain how to create a predictive loan model that identifies a bad applicant who is more likely to be charged off. In step by step processes, I show how to process raw data, clean unnecessary part of it, select relevant features, perform exploratory data analysis, and finally build a model.

How to Build a Shiny “Truck”!

Last month, at the R/Pharma conference that took place on the Harvard Campus, I presented bioWARP, a large Shiny application containing more than 500,000 lines of code. Although several other Shiny apps were presented at the conference, I noticed that none of them came close to being as big as bioWARP. And I asked myself, why? I concluded that most people just don´t need to build them that big! So now, I would like to explain why we needed such a large app and how we went about building it. To give you an idea of the scale I am talking about, an automotive metaphor might be useful. A typical Shiny app I see in my daily work has about 50 or even fewer interaction items. Let´s imagine this as a car. With less than 50 interactions, think of a small car like a Mini Cooper. Compared to these applications, with more than 500 interactions, bioWARP is a truck, maybe even a ‘monster’ truck. So why do my customers want to drive trucks when everyone else is driving cars?

Put Shiny Web Apps Online

RStudio lets you put shiny web applications and interactive documents online in the way that works best for you. For Shiny applications, consider Shiny Server or Shiny Server Pro, which adds enterprise grade scaling, security, and admin features to the basic open source edition. If you prefer for us to host your Shiny applications, one of our plans is sure to work for you. When you´re ready, RStudio Connect is a new publishing platform for all the work your teams create in R. Share Shiny applications, R Markdown reports, dashboards, plots, APIs, and more in one convenient place. Use push-button publishing from the RStudio IDE, scheduled execution of reports, and flexible security policies to bring the power of data science to your entire enterprise.

Organizing Your First Text Analytics Project

Text analytics or text mining is the analysis of ‘unstructured’ data contained in natural language text using various methods, tools and techniques. The popularity of text mining today is driven by statistics and the availability of unstructured data. With the growing popularity of social media and with the internet as a central location for all sorts of important conversations, text mining offers a low-cost method to gauge public opinion.

[ Paper Summary ] An Introduction to Independent Component Analysis: InfoMax and FastICA algorithms

In this paper the authors gives an introduction for independent component analysis which is different from principle component analysis. In which optimize for statistical independence of given data.

Deploying Machine Learning Models with Docker

There are a lot of articles out there explaining how to wrap Flask around your machine learning models to serve them as a RESTful API. This article extends that by explaining how to productionize your Flask API and get it ready for deployment using Docker.

Challenges & Solutions for Production Recommendation Systems

How to deal with unseen data, optimise response times, and update models frequently.

The Future with Reinforcement Learning – Part 2: Comparisons and Applications

If you haven´t yet read the reinforcement learning primer go back and check it out first here. That article will provide you with the key concepts in reinforcement learning. Then you will be ready to fully compare the different types of machine learning.

What’s WRONG with Metrics?

For any kind of machine learning problem, you must know how you are going to evaluate your results, or what the evaluation metric is. In this post, we will be viewing the most common metrics out there and discussing the usefulness of each metric depending on the objective and the problem we are trying to solve.