Machine Learning Algorithms In Layman’s Terms, Part 1

As a recent graduate of the Flatiron School’s Data Science Bootcamp, I’ve been inundated with advice on how to ace technical interviews. A soft skill that keeps coming to the forefront is the ability to explain complex machine learning algorithms to a non-technical person.

Why You Should Learn About Streaming Data Science

Traditional machine learning trains models based on historical data. This approach assumes that the world essentially ‘stays the same’ – that the same patterns, anomalies, and mechanisms observed in the past will happen in the future. So, ‘predictive’ analytics is really looking-to-the-past rather than the future. Streaming business intelligence is innovative technology that allows business users to ‘query the future’ based on real-time streaming data from any streaming data source including IoT sensors, web interactions or transactions, GPS position information or social media content. And, at the same time, we can now apply data science models to streaming data. No longer bound to look only at the past, the implications of streaming data science are profound.

10 Myths About Data Scientists

Myth #1: It’s a male dominated field.
Myth # 2: You have to know how to code
Myth # 3: You have to be an egghead to become a data scientist
Myth #4: A Master’s degree in Data Science = Data Scientist
Myth #5: ‘Data Scientist’ and ‘Business Analyst’ are the same thing
Myth #6: There’s a Shortage of Data Scientists.
Myth #7: Data Scientists Earn the Big Bucks.
Myth #8: AI Will Replace the Data Scientist
Myth #9: It’s all about the tools
Myth #10: Data Science is a lifetime career.

Understanding Decision Trees (once and for all!)

This article is made for complete beginners in Machine Learning who want to understand one of the simplest algorithm, yet one of the most important because of its interpretability, power of prediction and use in different variants like Random Forest or Gradient Boosting Trees. This article is also for all the Machine Learners like me who rushed towards the children of Decision Trees (Random Forest or Gradient Boosting Trees), because they usually performed better at Kaggle competition, forgetting to get familiar with the Decision Trees and unveiling all its mystery. ?? The first part of the article is about setting up the dataset and model, the second part is about understanding the model : the Decision Tree.

Efficient MCMC with Caching

This post is part of a running series on Bayesian MCMC tutorials.

Probability Theory for Deep Learning

Why do we need a foundation in Probability Theory in order to understand machine/deep learning algorithms? The answer to the above question is the main motivation behind this article. Machine/Deep learning often deals with stochastic or random quantities, which can be thought of as non-deterministic (something which can not be predicted beforehand or which exhibits random behaviour). The study of these quantities is quite different from deterministic quantities arising in a range of computer science fields. Given this crucial information, it is therefore desirable to be able to reason in an environment of uncertainty, and probability theory is the tool that shall help us to do so. Because I do not want to cloud your thoughts with mathematical jargon in the very beginning, I have added a section on application of all these things at the very end of the article. That should be your prime motivation for understanding this stuff. Let’s begin then So what makes any system prone to these uncertainties?

Building Blocks: Text Pre-Processing

In the last article of our series, we introduced the concept of Natural Language Processing, you can read it here, and now you probably want to try it yourself, right? Great! Without further ado, let’s dive in to the building blocks for statistical natural language processing. In this article, we’ll introduce the key concepts, along with practical implementation in Python and the challenges to keep in mind at the time of application.

Probabilistic Graphical Models: Bayesian Networks

In this article, I will be giving a detailed overview of Bayesian Networks which forms a class of Directed Graphical Models (DGM). I will be covering the recapitulation of Probability which forms the basis of this approach.

Understanding the ROC and AUC metrics.

ROC and AUC curves are important evaluation metrics for calculating the performance of any classification model. These definitions and jargons are pretty common in the Machine learning community and are encountered by each one of us when we start to learn about classification models. However, most of the times they are not completely understood or rather misunderstood and their real essence cannot be utilized. Under the hood, these are very simple calculation parameters which just needs a little demystification. The concept of ROC and AUC builds upon the knowledge of Confusion Matrix, Specificity and Sensitivity. Also, the example that I will use in this article is based on Logisitic Regression algorithm, however, it is important to keep in mind that the concept of ROC and AUC can apply to more than just Logistic Regression. Consider a hypothetical example containing a group of people. The y-axis has two categories i.e will have Heart Disease represented by red circles and will not have Heart Disease represented by green circles. Along the x-axis, we have cholesterol levels. The classifier tries to classify people into two categories depending upon their cholesterol levels.

Using LDA Topic Models as a Classification Model Input

Topic Modeling in NLP seeks to find hidden semantic structure in documents. They are probabilistic models that can help you comb through massive amounts of raw text and cluster similar groups of documents together in an unsupervised way. This post specifically focuses on Latent Dirichlet Allocation (LDA), which was a technique proposed in 2000 for population genetics and re-discovered independently by ML-hero Andrew Ng et al. in 2003. LDA states that each document in a corpus is a combination of a fixed number of topics. A topic has a probability of generating various words, where the words are all the observed words in the corpus. These ‘hidden’ topics are then surfaced based on the likelihood of word co-occurrence. Formally, this is Bayesian Inference problem .


A straightfoward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.

How to build a simple flowchart with R: DiagrammeR package

Since I learned Markdown and use R notebook in R Studio to produce high-quality PDF report, I hoped that I am not going to use Microsoft Office anymore. In fact, by using R Markdown, I can accomplish everything, until I need to make a flowchart. A flowchart is a diagram that represents a workflow. In medical research, flowcharts are being used to show the study populations and exclusions.

Explaining data science, AI, ML and deep learning to management – a presentation and a script

This series of three posts is meant to serve as an accompanying script for the Prezi presentation ‘Data science, AI, ML, DL and all that jazz’. In the first part of the series, we delved into the concept of data science. This part is focused on the presentation’s econd chapter which aims at giving a non-technica explanation of what AI is, give lots of examples, and also explain what AI isn’t (ye). As you might have noticed, AI is nowadays the buzzword of choice of almost every company out there who is selling data analytics and ML related services, such as Yuxi Global :).

On Building Effective Data Science Teams

As Data Science and AI make their way into almost every industries under the sun, so do the challenges of building a team capable of building sucessful AI projects. The demand for that archetypical ‘Data Scientist’ who is the perfect blend of a statistician, programmer and communicator has never been greater. But as the dust settles, we have started hearing stories of failed projects and disenchanted professionals.

The Difference Between Data Scientists and Data Engineers

As the field of machine intelligence continues to expand, new roles are being created and existing ones are expanding. Many people don’t have a clear understanding of the difference between data scientists and data engineers. The articles addressed the specific skill sets required for these two distinct career paths. Here are some of the core competencies of data scientists and data engineers along with overlapping areas: Data scientists – mathematics & statistics, computer science, machine learning plus AI/deep learning, advanced analytics, and data storytelling Data engineers – production-level programming, distributed systems, data transformation, data analytics, and data pipelines. Overlapping – data analytics and programming

How to Build a Deep Neural Network Without a Framework

Learn how to build an extensible deep neural network with NumPy and use it for image classification.

Introducing GPipe, an Open Source Library for Efficiently Training Large-scale Neural Network Models

Deep neural networks (DNNs) have advanced many machine learning tasks, including speech recognition, visual recognition, and language processing. Recent advances by BigGan, Bert, and GPT2.0 have shown that ever-larger DNN models lead to better task performance and past progress in visual recognition tasks has also shown a strong correlation between the model size and classification accuracy. For example, the winner of the 2014 ImageNet visual recognition challenge was GoogleNet, which achieved 74.8% top-1 accuracy with 4 million parameters, while just three years later, the winner of the 2017 ImageNet challenge went to Squeeze-and-Excitation Networks, which achieved 82.7% top-1 accuracy with 145.8 million (36x more) parameters. However, in the same period, GPU memory has only increased by a factor of ~3, and the current state-of-the-art image models have already reached the available memory found on Cloud TPUv2s. Hence, there is a strong and pressing need for an efficient, scalable infrastructure that enables large-scale deep learning and overcomes the memory limitation on current accelerators.