Most impactful AI trends of 2018: the rise of ML Engineering

The field of Machine Learning (ML) has been consistently evolving since Data Science started gaining traction in 2012. However, I believe 2018 was a critical inflection point in the ML industry. After helping Insight Fellows build dozens of ML products to get roles on applied ML teams, and reading through both corporate and academic published research and , I’ve seen more need for engineering skills than ever before. As a field that has consistently toed the line between its origins in academic research and the need to serve customer needs, it has often been hard to reconcile engineering standards with ML models. As both research and applied teams are doubling down on their engineering and infrastructure needs, the nascent field of ML Engineering will build upon 2018’s foundation and truly blossom in 2019.

[Webinar] Managing the Complete Machine Learning Lifecycle

Join Databricks Mar 7, 2019, to learn how using MLflow can help you keep track of experiment runs and results across frameworks, execute projects remotely on to a Databricks cluster, and quickly reproduce your runs, and more. Sign up for this webinar now.

Preparing for the Unexpected

Some of the problems we tackle using machine learning involve categorical features that represent real world objects, such as words, items and categories. So what happens when at inference time we get new object values that have never been seen before? How can we prepare ourselves in advance so we can still make sense out of the input?

Some R Packages for ROC Curves

In a recent post, I presented some of the theory underlying ROC curves, and outlined the history leading up to their present popularity for characterizing the performance of machine learning models. In this post, I describe how to search CRAN for packages to plot ROC curves, and highlight six useful packages. Although I began with a few ideas about packages that I wanted to talk about, like ROCR and pROC, which I have found useful in the past, I decided to use Gábor Csárdi’s relatively new package pkgsearch to search through CRAN and see what’s out there. The package_search() function takes a text string as input and uses basic text mining techniques to search all of CRAN. The algorithm searches through package text fields, and produces a score for each package it finds that is weighted by the number of reverse dependencies and downloads.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) add an interesting twist to basic neural networks. A vanilla neural network takes in a fixed size vector as input which limits its usage in situations that involve a ‘series’ type input with no predetermined size.

How does Apache Spark run on a cluster?

Whether you are a Data Engineer or a Data Scientist, getting up and running with Apache Spark is a relatively easy process from a development perspective. It does require a slight change in paradigm thinking and understanding how Spark executes code and how it functions on our clusters is an important part of being efficient when using Spark. What actually happens when we execute code through Spark? How does Apache Spark run on our cluster? This aim of this blog post is to cover that and go into depth into how Spark code runs.

The Ultimate Guide to Data Cleaning

I spent the last couple of months analyzing data from sensors, surveys, and logs. No matter how many charts I created, how well sophisticated the algorithms are, the results are always misleading. Throwing a random forest at the data is the same as injecting it with a virus. A virus that has no intention other than hurting your insights as if your data is spewing garbage. Even worse, when you show your new findings to the CEO, and Oops guess what? He/she found a flaw, something that doesn’t smell right, your discoveries don’t match their understanding about the domain?-?After all, they are domain experts who know better than you, you as an analyst or a developer. Right away, the blood rushed into your face, your hands are shaken, a moment of silence, followed by, probably, an apology. That’s not bad at all. What if your findings were taken as a guarantee, and your company ended up making a decision based on them?.

AI For Everyone: What Andrew Ng want to convey with this Non Technical Course in 30 points.

AI for everyone is a non technical course taking which you will have greater knowledge than most CEO’s in the world. At least this is what Andrew Ng claims. So let’s find out in short what he wants to convey.

The Python Dreamteam

As a Data Scientist, I code almost entirely in Python. I also get easily scared by configuring stuff. I don’t really know what a PATH is. I have no clue what lies within the /bin directory on my laptop. These are all things that you seemingly have to get familiar with to not have Python implode on your system when you try to change anything. Over some years of struggle I stumbled upon the pipenv/pyenv combo, which seems to have largely solved my Python setup woes in a way that mostly makes sense to me.

A step by step explanation of Principal Component Analysis

The purpose of this post is to provide a complete and simplified explanation of Principal Component Analysis, and especially to answer how it works step by step, so that everyone can understand it and make use of it, without necessarily having a strong mathematical background. PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way. Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.

State of Data Science & Machine Learning

Data Science, Machine Learning, Deep Learning, Artificial Intelligence are emerging disciplines of and are often entangled with fiction. When I first started the transition into the realm of Data Science from traditional Software Development I faced many challenges and I had a lot of questions: should I go back to school? Is MOOC better or Boot Camp? What new programming languages should I learn? What frameworks and tools should I be familiar with? How can I prove my expertise? so on and so forth. As a burgeoning field, there is very little empirical evidence collected to answer these common questions. During my research to bridge the knowledge gap, I stumbled upon Kaggle Data Scientist & Machine Learning Survey.

Data Scientist to Data Strategist

The last six months were, in many ways, some of the best of my professional life. My company was able to do work for huge businesses, my book proposal was accepted by a publisher and, thanks to the new methodologies I’d developed for doing my work, I was achieving better results for my clients than ever before. I had been trying to reach that point for years. I’d been aiming towards building important things for clients with big budgets and sharpening my technical chops along the way. But when I got there I found out that something wasn’t right.

You created a machine learning application. Now make sure it’s secure.

In a recent post, we described what it would take to build a sustainable machine learning practice. By ‘sustainable,’ we mean projects that aren’t just proofs of concepts or experiments. A sustainable practice means projects that are integral to an organization’s mission: projects by which an organization lives or dies. These projects are built and supported by a stable team of engineers, and supported by a management team that understands what machine learning is, why it’s important, and what it’s capable of accomplishing. Finally, sustainable machine learning means that as many aspects of product development as possible are automated: not just building models, but cleaning data, building and managing data pipelines, testing, and much more. Machine learning will penetrate our organizations so deeply that it won’t be possible for humans to manage them unassisted.

How to calculate the correlation coefficients for more than two variables

When I wanted to calculate the correlation coefficients for 25 variables it became tricky. As I aimed to export results in a table, the function cor was not helpful.

Zulip 2.0: Open source team chat

We’re excited to announce the release of Zulip Server 2.0, containing hundreds of new features and bug fixes. Zulip is the world’s most productive team chat software, used by thousands of teams as an alternative to Slack, HipChat, Mattermost and IRC. Zulip’s unique topic-based threading combines the immediacy of chat with the asynchronous efficiency of email-style threading, and is 100% free and open source software.

Python Data Science Handbook

This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.

Text Mining: Techniques, Applications and Issues

Rapid progress in digital data acquisition techniques have led to huge volume of data. More than 80 percent of today’s data is composed of unstructured or semi-structured data. The discovery of appropriate patterns and trends to analyze the text documents from massive volume of data is a big issue. Text mining is a process of extracting interesting and nontrivial patterns from huge amount of text documents. There exist different techniques and tools to mine the text and discover valuable information for future prediction and decision making process. The selection of right and appropriate text mining technique helps to enhance the speed and decreases the time and effort required to extract valuable information. This paper briefly discuss and analyze the text mining techniques and their applications in diverse fields of life. Moreover, the issues in the field of text mining that affect the accuracy and relevance of results are identified.