Data Science Simplified Part 1: Principles and Process

In 2006, Clive Humbly, UK Mathematician, and architect of Tesco’s Clubcard coined the phrase “Data is the new oil. He said the following: ”Data is the new oil. It’s valuable, but if unrefined it cannot be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so, must data be broken down, analyzed for it to have value.” The iPhone revolution, growth of the mobile economy, advancements in Big Data technology has created a perfect storm. In 2012, HBR published an article that put Data Scientists on the radar. The article Data Scientist: The Sexiest Job of the 21st Century labeled this “new breed” of people; a hybrid of data hacker, analyst, communicator, and trusted adviser. Every organization is now making attempts to be more data-driven. Machine learning techniques have helped them in this endeavor. I realize that a lot of the material out there is too technical and difficult to understand. In this series of articles, my aim is to simplify Data Science. I will take a cue from the Stanford course/book (An Introduction to Statistical Learning). This attempt is to make Data Science easy to understand for everyone. In this article, I will begin by covering fundamental principles, general process and types of problems in Data Science.

The best metric to measure accuracy of classification models

Unlike evaluating the accuracy of models that predict a continuous or discrete dependent variable like Linear Regression models, evaluating the accuracy of a classification model could be more complex and time-consuming. Before measuring the accuracy of classification models, an analyst would first measure its robustness with the help of metrics such as AIC-BIC, AUC-ROC, AUC- PR, Kolmogorov-Smirnov chart, etc. The next logical step is to measure its accuracy. To understand the complexity behind measuring the accuracy, we need to know few basic concepts.

Deep Learning Hyperparameter Optimization with Competing Objectives

In this post we’ll show how to use SigOpt’s Bayesian optimization platform to jointly optimize competing objectives in deep learning pipelines on NVIDIA GPUs more than ten times faster than traditional approaches like random search.

Building TensorFlow 1.3.0-rc2 as a standalone project (Raspberry pi 3 included)

Here you’ll learn how to build Tensorflow either for your x86_64 machine or for the raspberry pi 3 as a standalone shared library which can be interfaced from the C++ API.

Data visualization with googleVis exercises part 10

This is part 10 of our series and we are going to explore the features of some interesting types of charts that googleVis provides like Timeline, Flash and learn how to merge two googleVis charts to one. Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Can we believe in the imputations?

A popular approach to deal with missing values is to impute the data to get a complete dataset on which any statistical method can be applied. Many imputation methods are available and provide a completed dataset in any cases, whatever the number of individuals and/or variables, the percentage of missing values, the pattern of missing values, the relationships between variables, etc. However, can we believe in these imputations and in the analyses performed on these imputed datasets?

Multiple imputation for continuous and categorical data

“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin). Indeed, a predicted value is considered as an observed one and the uncertainty of prediction is ignored, conducting to bad inferences with missing values. That is why Multiple Imputation is recommended.

Why continuous learning is key to AI

As more companies begin to experiment with and deploy machine learning in different settings, it’s good to look ahead at what future systems might look like. Today, the typical sequence is to gather data, learn some underlying structure, and deploy an algorithm that systematically captures what you’ve learned. Gathering, preparing, and enriching the right data—particularly training data—is essential and remains a key bottleneck among companies wanting to use machine learning. I take for granted that future AI systems will rely on continuous learning as opposed to algorithms that are trained offline. Humans learn this way, and AI systems will increasingly have the capacity to do the same. Imagine visiting an office for the first time and tripping over an obstacle. The very next time you visit that scene—perhaps just a few minutes later—you’ll most likely know to look out for the object that tripped you. There are many applications and scenarios where learning takes on a similar exploratory nature. Think of an agent interacting with an environment while trying to learn what actions to take and which ones to avoid in order to complete some preassigned task. We’ve already seen glimpses of this with recent applications of reinforcement learning (RL). In RL, the goal is to learn how to map observations and measurements to a set of actions, while trying to maximize some long-term reward. (The term RL is frequently used to describe both a class of problems and a set of algorithms.) While deep learning gets more media attention, there are many interesting recent developments in RL that are well known within AI circles. Researchers have recently applied RL to game play, robotics, autonomous vehicles, dialog systems, text summarization, education and training, and energy utilization.

Radial kernel Support Vector Classifier

Support vector machines are a famous and a very strong classification technique which does not uses any sort of probabilistic model like any other classifier but simply generates hyperplanes or simply putting lines ,to separate and classify the data in some feature space into different regions.

Your ‘Anonymous’ Browsing Data Isn’t Actually Anonymous

In August 2016, a data broker received a phone call from a woman named Anna Rosenberg, who worked for a small startup in Tel Aviv. Rosenberg claimed she was training a neural network, a type of computing architecture inspired by the human brain, and needed a large set of browsing data to do so. The startup she was working for was well-funded and purchasing the data wouldn’t be a problem. But given the number of brokers out there, Rosenberg wasn’t going to purchase the browsing data from just anyone. She wanted a free trial. A day after originally soliciting the data broker, Rosenberg received a phone call. A salesperson representing the broker gave Rosenberg the credentials she’d need to access the browsing data that was part of her free trial. The broker agreed to allow Rosenberg access to the complete browsing history of 3 million German users for one month, with the stipulation that for a part of this period, some of the browsing data would be collected live (that is, refreshed every day or so). There was only one problem: Neither Anna Rosenberg nor the startup she claimed to represent existed.