**Aspiring Data Scientists! Start to learn Statistics with these 6 books!**

1. You Are Not So Smart — by David McRaney

2. Think Like a Freak — by Dubner & Levitt

3. Innumeracy — by John Allen Paulos

4. Naked Statistics — by Charles Wheelan

5. Practical Statistics for Data Scientists — by Andrew & Peter Bruce

6. Think Stats — by Allen B. Downey

2. Think Like a Freak — by Dubner & Levitt

3. Innumeracy — by John Allen Paulos

4. Naked Statistics — by Charles Wheelan

5. Practical Statistics for Data Scientists — by Andrew & Peter Bruce

6. Think Stats — by Allen B. Downey

**Scalable Select of Random Rows in SQL**

If you’re new to the big data world and also migrating from tools like Google Analytics or Mixpanel for your web analytics, you probably noticed performance differences. Google Analytics can show you predefined reports in seconds, while the same query for the same data in your data warehouse can take several minutes or even more. Such performance boosts are achieved by selecting random rows or the sampling technique. Let’s learn how to select random rows in SQL.

**Human Involvement Helps Researchers Perfect New Algorithms to Train Robots**

Many underestimate the role of humans in successful deployment of AI solutions. Alegion engine produces AI training data and enables content moderation, sentiment analysis, data enrichment, tagging, categorization, and more.

**CatBoost vs. Light GBM vs. XGBoost**

I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning.

**The most prolific package maintainers on CRAN**

During a discussion with some other members of the R Consortium, the question came up: who maintains the most packages on CRAN? DataCamp maintains a list of most active maintainers by downloads, but in this case we were interested in the total number of packages by maintainer. Fortunately, this is pretty easy to figure thanks to the CRAN repository tools now included in R, and a little dplyr (see the code below) gives the answer quickly.

**Comparing additive and multiplicative regressions using AIC in R**

One of the basic things the students are taught in statistics classes is that the comparison of models using information criteria can only be done when the models have the same response variable. This means, for example, that when you have log(y t ) log?(yt) and calculate AIC, then this value is not comparable with AIC from a model with y t yt . The reason for this is because the scales of variables are different. But there is a way to make the criteria in these two cases comparable: both variables need to be transformed into the original scale, and we need to understand what are the distributions of these variables in that scale.

**Machine Learning Modelling in R : : Cheat Sheet**

I came across this excellent article lately “Machine learning at central banks” which I decided to use as a basis for a new cheat sheet called Machine Learning Modelling in R. The cheat sheet can be downloaded from RStudio cheat sheets repository.

There was a recent blog post on mental models for deep learning drawing parallels from optics. We all have intuitions for few models but it is hard to put it in words, I believe it is necessary to work collectively for this mental model.

Advertisements