A comprehensive beginners guide to Linear Algebra for Data Scientists

One of the most common question we get on Analytics Vidhya is: ‘How much maths do I need to learn to be a data scientist?’ Even though the question sounds simple, there is no simple answer to the the question. Usually, we say that you need to know basic descriptive and inferential statistics to start. That is good to start. But, once you have covered the basic concepts in machine learning, you will need to learn some more math. You need it to understand how these algorithms work. What are their limitations and in case they make any underlying assumptions. Now, there could be a lot of areas to study including algebra, calculus, statistics, 3-D geometry etc. If you get confused (like I did) and ask experts what should you learn at this stage, most of them would suggest / agree that you go ahead with Linear Algebra. But, the problem does not stop there. The next challenge is to figure out how to learn Linear Algebra. You can get lost in the detailed mathematics and derivation and learning them would not help as much! I went through that journey myself and hence decided to write this comprehensive guide. If you have faced this question about how to learn & what to learn in Linear Algebra – you are at the right place. Just follow this guide.

Two Sigma Financial Modeling Challenge, Winner’s Interview: 2nd Place, Nima Shahbazi, Chahhou Mohamed

Our Two Sigma Financial Modeling Challenge ran from December 2016 to March 2017 this year. Asked to search for signal in financial markets data with limited hardware and computational time, this competition attracted over 2000 competitors. In this winners’ interview, 2nd place winners’ Nima and Chahhou describe how paying close attention to unreliable engineered features was important to building a successful model.

Regular Expression & Treemaps to Visualize Emergency Department Visits

It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It was inspiring to learn about the wide variety of data science applications ranging from finance and marketing to genomics and even the refugee crisis. One of the workshops at ODSC was text analytics, which includes basic text processing, dendrograms, natural language processing and sentiment analysis. This gave me the thought of applying some text analytics to visualize some data I was working on last summer. In this post I’m going to walk through how I used regular expression to label classification codes in a large dataset (NHAMCS) representing emergency department visits in the United States and eventually visualize the data.

A recipe for dynamic programming in R: Solving a “quadruel” case from PuzzlOR

As also described in Cormen, et al (2009) p. 65, in algorithm design, divide-and-conquer paradigm incorporates a recursive approach in which the main problem is:
• Divided into smaller sub-problems (divide),
• The sub-problems are solved (conquer),
• And the solutions to sub-problems are combined to solve the original and “bigger” problem (combine).
Instead of constructing indefinite number of nested loops destroying the readability of the code and the performance of execution, the “recursive” way utilizes just one block of code which calls itself (hence the term “recursive”) for the smaller problem. The main point is to define a “stop” rule, so that the function does not sink into an infinite recursion depth. While nested loops modify the same object (or address space in the low level sense), recursion moves the “stack pointer”, so each recursion depth uses a different part of the stack (a copy of the objects will be created for each recursion). This illustrates a well-known trade-off in algorithm design: Memory versus performance; recursion enhances performance at the expense of using more memory.

Microsoft R Open 3.4.0 now available

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.0 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.4.0, reduces the size of the installer image, and updates the bundled packages. R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages). MRO is supported on Windows, Mac and Linux (Ubuntu, RedHat/CentOS, and SUSE). MRO 3.4.0 is 100% compatible with R 3.4.0, and you can use it with any of the 10,000+ packages available on CRAN. Here are some highlights of new packages released since the last MRO update. We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

What Kaggle has learned from almost a million data scientists

This is a keynote highlight from the Strata Data Conference in London 2017.

Data science and deep learning in retail

In this episode of the Data Show, I spoke with Jeremy Stanley, VP of data science at Instacart, a popular grocery delivery service that is expanding rapidly. As Stanley describes it, Instacart operates a four-sided marketplace comprised of retail stores, products within the stores, shoppers assigned to the stores, and customers who order from Instacart. The objective is to get fresh groceries from popular retailers delivered to customers in a timely fashion. Instacart’s goals land them in the center of the many opportunities and challenges involved in building high-impact data products.

Running a word count application using Spark

This is a highlight from Ted Malaska’s Introduction to Apache Spark for Java and Scala developers.

DataScience.com Releases Python Package for Interpreting the Decision-Making Processes of Predictive Models

DataScience.com new Python library, Skater, uses a combination of model interpretation algorithms to identify how models leverage data to make predictions.

How A Data Scientist Can Improve His Productivity

Data science and machine learning are iterative processes. It is never possible to successfully complete a data science project in a single pass. A data scientist constantly tries new ideas and changes steps of his pipeline:
1. extract new features and accidentally find noise in the data
2. clean up the noise, find one more promising feature
3. extract the new feature
4. rebuild and validate the model, realize that the learning algorithm parameters are not perfect for the new feature set
5. change machine learning algorithm parameters and retrain the model
6. find the ineffective feature subset and remove it from the feature set
7. try a few more new features
8. try another ML algorithm. And then a data format change is required.
This is only a small episode in a data scientist’s daily life and it is what makes our job different from a regular engineering job.

Deep learning for natural language processing, Part 1

The machine learning revolution leaves no stone unturned. Natural language processing is yet another field that underwent a small revolution thanks to the second coming of artificial neural networks. Let’s just briefly discuss two advances in the natural language processing toolbox made thanks to artificial neural networks and deep learning techniques.

Python vs R. Which language should you choose?

Data science is an interdisciplinary field where scientific techniques from statistics, mathematics, and computer science are used to analyze data and solve problems more accurately and effectively. It is no wonder, then, that languages such as R and Python, with their extensive packages and libraries that support statistical methods and machine learning algorithms are cornerstones of the data science revolution. Often times, beginners find it hard to decide which language to learn first. This guide will help you make that decision.

A practical explanation of a Naive Bayes classifier

The simplest solutions are usually the most powerful ones, and Naive Bayes is a good proof of that. In spite of the great advances of the Machine Learning in the last years, it has proven to not only be simple but also fast, accurate and reliable. It has been successfully used for many purposes, but it works particularly well with natural language processing (NLP) problems. Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem to predict the category of a sample (like a piece of news or a customer review). They are probabilistic, which means that they calculate the probability of each category for a given sample, and then output the category with the highest one. The way they get these probabilities is by using Bayes’ Theorem, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature. We’re going to be working with an algorithm called Multinomial Naive Bayes. We’ll walk through the algorithm applied to NLP with an example, so by the end not only will you know how this method works, but also why it works. Then, we’ll lay out a few advanced techniques that can make Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural networks.

Dimensionality Reduction Algorithms: Strengths and Weaknesses

Welcome to Part 2 of our tour through modern machine learning algorithms. In this part, we’ll cover methods for Dimensionality Reduction, further broken into Feature Selection and Feature Extraction. In general, these tasks are rarely performed in isolation. Instead, they’re often preprocessing steps to support other tasks.

Exponential Change and Unsupervised Learning

Brian Hopkins of Forrester Research recently penned an excellent blog post about why companies are getting disrupted and why they realize it so late. The post draws from Ray Kurzweil’s Law Of Accelerating Returns and speaks to the fact that the human brain doesn’t do well with exponential growth.

An Introduction to the MXNet Python API?

This post outlines an entire 6-part tutorial series on the MXNet deep learning library and its Python API. In-depth and descriptive, this is a great guide for anyone looking to start leveraging this powerful neural network library.

Versioning R model objects in SQL Server

If you build a model and never update it you’re missing a trick. Behaviours change so your model will tend to perform worse over time. You’ve got to regularly refresh it, whether that’s adjusting the existing model to fit the latest data (recalibration) or building a whole new model (retraining), but this means you’ve got new versions of your model that you have to handle. You need to think about your methodology for versioning R model objects, ideally before you lose any versions. You could store models with ye olde YYYYMMDD style of versioning but that means regularly changing your code to use the latest model version. I’m too lazy for that! If we’re storing our R model objects in SQL Server then we can utilise another SQL Server capability, temporal tables, to take the pain out of versioning and make it super simple. Temporal tables will track changes automatically so you would overwrite the previous model with the new one and it would keep a copy of the old one automagically in a history table. You get to always use the latest version via the main table but you can then write temporal queries to extract any version of the model that’s ever been implemented. Super neat! For some of you, if you’re not interested in the technical details you can drop off now with the knowledge that you can store your models in a non-destructive but easy to use way in SQL Server if you need to. If you want to see how it’s done, read on!

What’s Next for Big Data? Thinking Beyond Hadoop with Elastic Object Storage

In this special guest feature, Irshad Raihan, Product Marketing Manager at Red Hat Storage, discusses how organizations can save money and realize greater flexibility by moving data with lower business value to a more affordable storage solution. Irshad Raihan is a product manager at Red Hat Storage, responsible for product strategy, messaging, and go to market activities. Previously, he held senior product marketing and product management positions at HP and IBM responsible for big data and data management products. Irshad holds a Masters in Computer Science from Clemson University, and an MBA from Carnegie Mellon University.

What is an Ontology?

This is a short blog post to introduce the concept of an ontology for those who are unfamiliar with the term, or who have previously encountered explanations that make little or no sense, as I have. I’m aiming to “democratise knowledge of this topic” as one of my colleagues put it.