Storytelling: The Power to Influence in Data Science

Data scientists need to share results, which is different than talking shop with other data scientists. Read about influencing people and telling stories as a data scientist.

Business Intuition in Data Science

Often when we think of a data science assignment, the main thing that comes to mind is the algorithm technique that needs to be applied. While, that is crucially important, there are many other steps in a typical data science assignment that requires equal attention.

Introduction to Machine Learning with Python’s Scikit-learn

In this post, we’ll be doing a step-by-step walkthrough of a basic machine learning project, geared toward people with some knowledge of programming (preferably Python), but who don’t have much experience with machine learning. By the end of this post, you’ll understand what machine learning is, how it can help you, and be able to build your own machine learning classifiers for any dataset you want. We’ll teach a computer how to distinguish between ‘clickbait’ headlines and ‘normal’ headlines, where the former are those irritating ‘You won’t believe what X does to Y’ type headlines that deliberately withhold information to try to make people click on the article.

Strength, Wits and Sympy

This is a document about another exercise in decision theory applied to video games. The video in this edition is divine divinity: original sins 2. It’s actually very good and it gave me a small nerd snipe.

Word Embeddings: A Natural Language Processing Crash Course

The field of natural language processing (NLP) makes it possible to understand patterns in large amounts of language data, from online reviews to audio recordings. But before a data scientist can really dig into an NLP problem, he or she must lay the groundwork that helps a model make sense of the different units of language it will encounter. Word embeddings are a set of feature engineering techniques widely used in predictive NLP modeling, particularly in deep learning applications. Word embeddings transform sparse vector representations of words into a dense, continuous vector space, enabling you to identify similarities between words and phrases — on a large scale — based on their context. In this piece, I’ll explain the reasoning behind word embeddings and demostrate how to use these techniques to create clusters of similar words using data from 500,000 Amazon reviews of food. You can download the dataset to follow along.

Anomaly detection for writing styles

Applying NLP feature engineering and stylometry principles to identify authorship similarity

It Only Takes One Line of Code to Run Regression

I learned how important to understand data before running algorithms, how important it is to know the context and the industry before jumping on getting insights, how it is very easy to make models but tough to get them to work for you, and finally, how it only takes one line of code to run linear regression on your dataset.

Why an (interactive) picture is worth a thousand numbers

Miriah Meyer explores how interactive visualizations can help us find meaning in mounds of data.

Qualitative Research in R

In the last two posts, I’ve focused purely on statistical topics – one-way ANOVA and dealing with multicollinearity in R. In this post, I’ll deviate from the pure statistical topics and will try to highlight some aspects of qualitative research. More specifically, I’ll show you the procedure of analyzing text mining and visualizing the text analysis using word cloud.

7 types of Artificial Neural Networks for Natural Language Processing

What is an artificial neural network? How does it work? What types of artificial neural networks exist? How are different types of artificial neural networks used in natural language processing? We will discuss all these questions in the following article.

The R manuals in bookdown format

While there are hundreds of excellent books and websites devoted to R, the canonical source of truth regarding the R system remains the R manuals. You can find the manuals at your local CRAN mirror and on your laptop as part of the R distribution (try Help > Manuals in RGui, or Help > R Help in RStudio to find them). Unlike books, the R manuals are updated by the R Core Team with every new release, so if you’re not sure how the base R system is supposed to work this is the place to check.

First steps with MRF smooths

One of the specialist smoother types in the mgcv package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. I’ve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result I’ve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial Regional Smoothing in R from The Pudding. In this post I take a quick look at how to use the MRF smooth in mgcv to model a discrete spatial data set from the US Census Bureau.

Ensemble learning for time series forecasting in R

Ensemble learning methods are widely used nowadays for its predictive performance improvement. Ensemble learning combines multiple predictions (forecasts) from one or multiple methods to overcome accuracy of simple prediction and to avoid possible overfit. In the domain of time series forecasting, we have somehow obstructed situation because of dynamic changes in coming data. However, when a single regression model is used for forecasting, time dependency is not the obstacle, we can tune it at current time of a sliding window. For this reason, in this post, I will describe you two simple ensemble learning methods – Bagging and Random Forest. Bagging will be used with combination of two simple regression trees methods used in the previous post (RPART and CTREE). I will not repeat most of the things mentioned there so check it first if you didn’t make it already: Using regression trees for forecasting double-seasonal time series with trend.