Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths. But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that although each path individually is still an unpredictable random walk, given the location of one of the drunk or dog, we have a pretty good idea of where the other is; that is, the distance between the two is fairly predictable. (For example, if the dog wanders too far away from his owner, she’ll tend to move in his direction to avoid losing him, so the two stay close together despite a tendency to wander around on their own.) We describe this relationship by saying that the drunk and her dog form a cointegrating pair.
I was an engineer working with an MNC in a pretty cushy job. It would have been a pretty happy life for people, but I had some other dreams. I wanted to do an MBA from outside India. Unfortunately, the plan didn’t work out – there were issues on the financial and personal fronts – and eventually I figured that maybe my ambitions were somewhat unrealistic.
For any machine learning problem, say a classifier in this case, it’s always handy to create quickly a base line classifier against which we can compare our new models. You don’t want to spend a lot of time creating these base line classifiers; you would rather spend that time in building and validating new features for your final model. In this post we will see how we can rapidly create base line classifier using scikit learn package for any dataset.
Over the last few years of working with data I’ve collected a toolbox of essential tools that I think all Data Scientists should know and use. All of these tools will not only help you be more efficient as a data scientist or data analyst, but will also help you work better within a team, be more organized and flexible and produce analyses that will be more easily reproducible by others. These probably aren’t the only tools you’ll end up needing to use day-to-day, but this list of tools will be universally useful.
• Git and Github
• Git and Github
The previous post on this blog sought to expose the statistical underpinnings of several machine learning models you know and love. Therein, we made the analogy of a swimming pool: you start on the surface — you know what these models do and how to use them for fun and profit — dive to the bottom — you deconstruct these models into their elementary assumptions and intentions — then finally, work your way back to the surface — reconstructing their functional forms, optimization exigencies and loss functions one step at a time. In this post, we’re going to stay on the surface: instead of deconstructing common models, we’re going to further explore the relationships between them — swimming to different corners of the pool itself. Keeping us afloat will be Bayes’ theorem — a balanced, dependable yet at times fragile pool ring, so to speak — which we’ll take with us wherever we go.
Gradient boosting is a technique attracting attention for its prediction speed and accuracy, especially with large and complex data. As evidenced in the chart below showing the rapid growth of Google searches for xgboost (the best gradient boosting R package). From data science competitions to machine learning solutions for business, gradient boosting has produced best-in-class results. In this blog post I describe what it is and how to use it in Displayr.
According to Yann LeCun, “adversarial training is the coolest thing since sliced bread.” Sliced bread certainly never created this much excitement within the deep learning community. Generative adversarial networks—or GANs, for short—have dramatically sharpened the possibility of AI-generated content, and have drawn active research efforts since they were first described by Ian Goodfellow et al. in 2014. GANs are neural networks that learn to create synthetic data similar to some known input data. For instance, researchers have generated convincing images from photographs of everything from bedrooms to album covers, and they display a remarkable ability to reflect higher-order semantic logic.
There are several ways to model seasonality in a time series. Traditionally, trend-cycle decomposition such as the Holt-Winters procedure has been very popular. Also, until today applied researchers often try to account for seasonality by using seasonal dummy variables. But of course, in a stochastic process it seems unreasonable to assume that seasonal effects are purely deterministic. Therefore, in a time series context seasonal extensions of the classical ARMA model are very popular. One of these extensions is the seasonal unit root model …
Propelled by a fast evolving landscape of techniques and datasets, data science is growing rapidly. Against this background, topological data analysis (TDA) has carved itself a niche for the analysis of datasets that present complex interactions and rich structures. Its distinctive feature, topology, allows TDA to detect, quantify and compare the mesoscopic structures of data, while also providing a language able to encode interactions beyond networks. Here we briefly present the TDA paradigm and some applications, in order to highlight its relevance to the data science community.
The second post in this series of tutorials for implementing machine learning workflows in Python from scratch covers implementing the k-means clustering algorithm.