**Machine Learning Libraries by Programming Language**

Machine Learning Periodic Table

**Introduction To Data Science**

Use the R Programming Language to execute data science projects and become a data scientist. (with sources on GitHub)

**Using Hadoop Streaming API to Perform a Word Count job in R and C++**

In this tutorial we showed how to submit a simple Map/Reduce job via the Hadoop Streaming API. Interestingly, we used an R script as the mapper and a C++ program as the reducer. In an upcoming blog post we’ll explain how to run a job using the rmr2 package.

**What are 20 (11) questions to detect fake data scientists?**

1. Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and LASSO?

2. Explain what a local optimum is and why it is important in a specific method, such as k-means clustering. What are specific methods for determining if you have a local optimum problem? What methods can be used to avoid local optima?

3. Assume you need to generate a predictive model of a quantitative outcome variable using multiple regression. Explain how you intend to validate this model.

4. Explain what precision and recall are. How do they relate to the ROC curve?

5. Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?

6. What is latent semantic indexing? What is it used for? What are the specific limitations of the method?

7. What is the Central Limit Theorem? Explain it. Why is it important? When does it fail to hold?

8. What is statistical power?

9. Explain what resampling methods are and why they are useful. Also explain their limitations.

10. Explain the differences between artificial neural networks with softmax activation, logistic regression, and the maximum entropy classifier.

11. Explain selection bias. Why is it important? How can data management procedures such as missing data handling make it worse?

**Managing the Data Pipeline with Git + Luigi**

One of the common pains of managing data, especially for larger companies, is that a lot of data gets dirty (which you may or may not even notice!) and becomes scattered around everywhere. Many ad hoc scripts are running in different places, these scripts silently generate dirty data. Further, if and when a script results in failures — often during the night, as luck would have it — it is very difficult to determine how to recover from the failure unless the maintainer of the script is available. The more the company grows, more messy data is generated and scattered, and the data pipeline becomes harder to maintain.

**Why Mixed Models are Harder in Repeated Measures Designs: G-Side and R-Side Modeling**

That Subtle, Elusive, Important Difference between Repeated and Random Effects:

There are two different ways of dealing with repeated measures in a mixed model.

These two ways have both theoretical and analytical implications, so you need to make deliberate choices about which to use.

One is through modeling the random effects. This is called G-side modeling because it’s about estimating parts of the G matrix: the covariance matrix of the random effects.

(Warning–not every software manual and textbook calls it G. D is also common).

The other is through modeling the multiple residuals for each subject. This is called R-side modeling because it estimates the R matrix: the covariance matrix of residuals for each subject (warning–also often called the Sigma matrix).