Making data science data-centric
The source of many common problems in data science projects is placing more emphasis on modelling than data. Data science project success is defined largely by data and the complex process of getting it ready for analysis. However, project progress is commonly defined purely in terms of implementing models and analysis outcomes. The issue is that ‘getting data ready for analysis’ is an ill defined concept, which makes it hard to plan this phase and allocate sufficient resources to it. In our experience, defining clear data readiness goals around which the project timeline is built is key for project planning and success.
6 Useful Programming Languages for Data Science You Should Learn (that are not R and Python)
• Which programming language should you pick for data science? Here’s a list of 6 powerful ones that are not Python or R
• These languages are vast in their scope and are commonly used in the data science field
• We have also provided open-source libraries for each language to help you get started with various stages of a data science project, such as data cleaning, model building, etc.
AI Business Transformation Playbook for Executives
AI, IoT & 5G’ – the confluence is a ‘perfect storm’ of business opportunities that will appear in the next few years. This is an article on how executives at enterprises can get ready to thrive in this milieu. My focus will be on value-creating business opportunities and how to grasp it, written in an easily understandable and logical manner incorporating best practices that I have learned myself or seen in the past 15+ years. We have a few years of AI applications in business under our belts by now (in 2019). The executive lament has been that while there are many PoCs, most do not mature into significant revenue generators. This should not be surprising – I believe that the best way to look at PoCs are as ‘early startups’. Much like startups, very few PoCs go on to become unicorns!
29 Statistical Concepts Explained in Simple English – Part 16
This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more.
Variable Selection Methods: Lasso and Ridge Regression in Python
One of the most in-demand machine learning skill is regression analysis. In this article, you learn how to conduct variable selection methods: Lasso and Ridge regression in Python.
The Future of Database Management Systems is Cloud!
On Friday, 20-June-2019, Gartner published new research titled ‘The Future of the DBMS Market Is Cloud’ by Donald Feinberg, Merv Adrian and Adam Ronthal. The thesis: cloud is now the default platform for managing data. On-premises is the past, and only legacy compatibility or special requirements should keep you there. Some might think we are early; others might think we are late, and some may even think us crazy! Well, in a word, NOT. Here is the evidence: …
Shiny application (with modules) – Saving and Restoring from RDS
I am working on a Shiny application which allows the user to upload data, do some analysis and processing on each variable in the data, and finally use the processed variables to build a statistical model. As there may be hundreds of variables in the data, the user may want to process only a few variables in one sitting and later continue the work from where they left off. They may also want to pass on their work to another user who can then continue processing the variables. The Shiny application is also divided into modules – so there are different modules for data upload and exploration, processing each variables, building the statistical model and run further analysis on the model.
Building a Machine Learning Recommendation Model from Scratch
In this tutorial, we build a regression model using the cruise_ship_info.csv dataset for recommending the crew size for potential cruise ship buyers. This tutorial will highlight important data science and machine learning concepts such as:
a) data preprocessing and variable selection
b) basic regression model building
c) hyperparameters tuning
b) model evaluation
d) techniques for dimensionality reduction
Mobility Data, Feature Engineering and Hierarchical Clustering
The United States has one of the world’s largest automobile markets, second only to China. With 270.4 million registered vehicles as of 2017 on the American roads, there are millions of crashes every year. According to the National Highway Traffic Safety, there were an estimated 7 million police-reported motor vehicle crashes in the US in 2016. This led to about 207 million dollars in collision loss in 2016. Being able to predict the likelihood of a driver filling a claim in the coming months provides the insurer with the ability to adjust premiums and plan the provisions ahead of time. Applying predictive analytics to insurance claim is nothing new, however we are witnessing a transition from classical, static and general data-based models (driver age, driver license age, car type, etc.) to models based on actual driving-behavior (sudden braking and other sorts of unusual driving behavior indicators). This transition is mainly driven by the emergence of big data frameworks and their ability to manipulate and analyze larger and less structured data sets. This has led some companies to start collecting data related to driving patterns by using the devices installed by insurance companies in the insured person’s car.
Quickly Navigating Python Libraries With ctags
A tutorial for using ctags to efficiently navigate Python libraries for data scientists.
An overview of different unsupervised learning techniques
In this article, I want to walk you through the different unsupervised learning methods in machine learning with relevant codes. We will take a look at the k-means clustering algorithm, the Latent Dirichlet Allocation(LDA) for text data, Hierarchical and Density based clustering, Gaussian Mixture Models, Dimensionality Reduction techniques like PCA, Random Projections, Independent component Analysis and finally about cluster validation.
Maximum Likelihood Estimation from Bayes’ Theorem
Probably one of the most popular and easiest method of parameter estimation is ML estimation and Bayes’ theorem being a stand alone genius, has lots of applications. But is it possible to view ML (not Machine Learning) as an application of Bayes’ theorem? Lets see.
Introducing End-to-End Interpolation of Time Series Data in Apache PySpark
Anyone working with data knows that real-world data is often patchy and cleaning it takes up a considerable amount of your time (80/20 rule anyone?). Having recently moved from Pandas to Pyspark, I was used to the conveniences that Pandas offers and that Pyspark sometimes lacks due to its distributed nature. One of the features I have been particularly missing is a straight-forward way of interpolating (or in-filling) time series data. While the problem of in-filling missing values has been covered a few times (e.g. [here]), I was not able to find a source, which detailed the end-to-end process of generating the underlying time-grid and then subsequently filling in the missing values. This post tries to close this gap. Starting from a time-series with missing entries, I will show how we can leverage PySpark to first generate the missing time-stamps and then fill in the missing values using three different interpolation methods (forward filling, backward filling and interpolation). This is demonstrated using the example of sensor read data collected in a set of houses. The full code for this post can be found [here in my github]. Preparing the D
Basics of Independent Component Analysis
We are constantly on the search for patterns and a deeper understanding of data. What’s the first thing that stands out to you about this dataset? If your first thought was – ‘those points look too nice’ – then you caught me. This isn’t a real dataset, and I spend more time than I should have generating these points. That’s not the point, however. What do you see? From a visual perspective, it feels pretty clear that there are two populations with two linear trends. The two groups are mixed together into one undistinguished set of points. Here’s another one: Can you separate the two components in this masterpiece, and identify the person in the picture? Again, this shouldn’t be a problem for us. We can mentally separate the two images, but if someone asked for the two separated images, how would you actually do it? We want a mathematical framework for this process of separating a mixed dataset, and an algorithm to do so.
Machines that learn by doing
In my mid-twenties I learned to play tennis for the first time. The thing about tennis is that, once you start off, it is really hard to make the ball land in the opposite side of the court (and not in the trees behind the court). The trick is to hold the racket roughly vertically the moment you hit the ball, and to give the ball sufficient topspin. Over the course of many hours of training sessions and practice with friends, I slowly learned how to position the racket in just the right angle when I hit the ball. My brain was able to learn a new task, playing the game of tennis, essentially by frequent practice. How did it do that? And could a machine do the same? Can machines learn by doing?
Understanding Gradient Boosting Machines – using XGBoost and LightGBM parameters
Psst.. A Confession: I have, in the past, used and tuned models without really knowing what they do. I tried to do the same with Gradient Boosting Machines – LightGBM and XGBoost – and it was.. frustrating! This technique (or rather laziness), works fine for simpler models like linear regression, decision trees, etc. They have only a few hyperparameters – learning_rate, no_of_iterations,alpha, lambda – and its easy to know what they mean. But GBMs are a different world:
• They have a huge number of hyperparameters – ones that can make or break your model.
• And to top it off, unlike Random Forest, their default settings are often not the optimal one!
So, if you want to use GBMs for modelling your data, I believe that, you have to get atleast a high-level understanding of what happens on the inside. You can’t get away by using it as a complete black-box. And that is what I want to help you with in this article! What I will do is I sew a very simple explanation of Gradient Boosting Machines around the parameters of 2 of its most popular implementations – LightGBM and XGBoost. This way you will be able to tell what’s happening in the algorithm and what parameters you should tweak to make it better. This practical explanation, I believe, will allow you to go directly to implementing them in your own analysis! As you might have guessed already, I am not going to dive into the math in this article. But if you are interested, I will post some good links that you could follow if you want to make the jump. Let’s get to it then..
Cross-entropy: From an Information theory point of view
Cross-entropy is a widely used loss function in machine learning for classification problems. Information theory is used widely but the rationale behind using it is not explained in classes. In this blog, we go through an intuitive understanding of the information theory and finally connecting it with the cross-entropy function.