Monte Carlo Analysis and Simulation

The Monte Carlo method is an simple way to solve very difficult probabilistic problems. This text is a very simple, didactic introduction to this subject, a mixture of history, mathematics and mythology. The method has origins in the World War II, proposed by the Polish American mathematician Stanislaw Ulam and Hungary American mathematician John Von Neumann.


Introductory guide on Linear Programming for (aspiring) data scientists

Optimization is the way of life. We all have finite resources and time and we want to make the most of them. From using your time productively to solving supply chain problems for your company – every thing uses optimization. It is also a very interesting topic – it starts with simple problems, but can get very complex. For example, sharing a chocolate between siblings is a simple optimization problem. We don’t think in mathematical term while solving it. On the other hand devising inventory and warehousing strategy for an e-tailer can be very complex. Millions of SKUs with different popularity in different regions to be delivered in defined time and resources – you see what I mean! Linear programming (LP) is one of the simplest ways to perform optimization. It helps you solve some very complex optimization problems by making a few simplifying assumptions. As an analyst you are bound to come across applications and problems to be solved by Linear Programming. For some reason, LP doesn’t get as much attention as it deserves while learning data science. So, I thought let me do justice to this awesome technique. I decided to write an article which explains Linear programming in simple English. I have kept the content as simple as possible. The idea is to get you started and excited about Linear Programming.


Mathematicians becoming data scientists: Should you? How to?

I was talking the other day with a former student at UW, Sarah Rich, who’s done degrees in both math and CS and then went off to Twitter. I asked her: so what would you say to a math Ph.D. student who was wondering whether they would like being a data scientist in the tech industry? How would you know whether you might find that kind of work enjoyable? And if you did decide to pursue it, what’s the strategy for making yourself a good job candidate? Sarah exceeded my expectations by miles and wrote the following extremely informative and thorough tip sheet, which she’s given me permission to share. Take it away, Sarah!


Allstate Claims Severity Competition, 2nd Place Winner’s Interview: Alexey Noskov

The Allstate Claims Severity recruiting competition ran on Kaggle from October to December 2016. As Kaggle’s most popular recruiting competitions to-date, it attracted over 3,000 entrants who competed to predict the loss value associated with Allstate insurance claims. In this interview, Alexey Noskov walks us through how he came in second place by creating features based on distance from cluster centroids and applying newfound intuitions for (hyper)-parameter tuning. Along the way, he provides details on his favorite tips and tricks including lots of feature engineering and implementing a custom objective function for XGBoost.


Python Deep Learning Frameworks Reviewed

I recently stumbled across an old Data Science Stack Exchange answer of mine on the topic of the “Best Python library for neural networks”, and it struck me how much the Python deep learning ecosystem has evolved over the course of the past 2.5 years. The library I recommended in July 2014, pylearn2, is no longer actively developed or maintained, but a whole host of deep learning libraries have sprung up to take its place. Each has its own strengths and weaknesses. We’ve used most of the technologies on this list in production or development at indico, but for the few that we haven’t, I’ll pull from the experiences of others to help give a clear, comprehensive picture of the Python deep learning ecosystem of 2017.


6 testing methods for binary classification

Once a predictive model has been trained, it is needed to evaluate its predictive power on new data that have not been seen before, the testing instances subset. This process will determine if the predictive model is good enough to be moved into the production phase. The purpose of testing analysis is to compare the responses of the trained predictive model against the correct predictions for every of the instances of the testing set. As these cases have not been used before to train the predictive model, the results of this process can be used as a simulation of what would happen in a real world situation.


Machine Learning From Scratch

Python implementations of some of the foundational Machine Learning models and algorithms from scratch. While some of the matrix operations that are implemented by hand (such as calculation of covariance matrix) are available in numpy I have decided to add these as well to make sure that I understand how the linear algebra is applied. The purpose of this project is purely self-educational.


Text prepration before Sentiment analysis

Before starting Sentiment analysis we need to prepare out text data. Following steps need to execute to clean the corpus and prepare it for the further analysis.