A Comprehensive guide to Data Exploration

There are no shortcuts for data exploration. If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. After some point of time, you’ll realize that you are struggling at improving model’s accuracy. In such situation, data exploration techniques will come to your rescue. I can confidently say this, because I’ve been through such situations, a lot. I have been a Business Analytics professional for close to three years now. In my initial days, one of my mentor suggested me to spend significant time on exploration and analyzing data. Following his advice has served me well. I’ve created this tutorial to help you understand the underlying techniques of data exploration. As always, I’ve tried my best to explain these concepts in the simplest manner. For better understanding, I’ve taken up few examples to demonstrate the complicated concepts.

Is 2016 the Year of AI?

If 2016 is to be “the year of AI” as some folks are speculating then we ought to take a look at what that might actually mean. For starters, is AI sufficiently mature and will it matter in the every day world of consumers? I’ll stipulate that AI is already relevant to a sizable minority of data scientists, especially those directly involved in AI projects. But like the balance of data science, real economic importance comes when our knowledge is put to use in the broad economy. Hence the emphasis on whether consumers will give a hoot. Like a lot of other DS disciplines, this doesn’t mean that Jane and Joe consumer even need to know that DS is at work. It does mean that Jane and Joe would recognize that their lives are less convenient or efficient if the DS was suddenly removed.

Sentiment analysis with machine learning in R

Machine learning makes sentiment analysis more convenient. This post would introduce how to do sentiment analysis with machine learning using R. In the landscape of R, the sentiment R package and the more general text mining package have been well developed by Timothy P. Jurka. You can check out the sentiment package and the fantastic RTextTools package. Actually, Timothy also writes an maxent package for low-memory multinomial logistic regression (also known as maximum entropy).

Predicting links in ego-networks using temporal information

Link prediction appears as a central problem of network science, as it calls for unfolding the mechanisms that govern the micro-dynamics of the network. In this work, we are interested in ego-networks, that is the mere information of interactions of a node to its neighbors, in the context of social relationships. As the structural information is very poor, we rely on another source of information to predict links among egos’ neighbors: the timing of interactions. We define several features to capture different kinds of temporal information and apply machine learning methods to combine these various features and improve the quality of the prediction. We demonstrate the efficiency of this temporal approach on a cellphone interaction dataset, pointing out features which prove themselves to perform well in this context, in particular the temporal profile of interactions and elapsed time between contacts.

When Should i Check the Mail?

Mail is delivered by the USPS mailman at a regular but not observed time; what is observed is whether the mail has been delivered at a time, yielding somewhat-unusual “interval-censored data”. I describe the problem of estimating when the mailman delivers, write a simulation of the data-generating process, and demonstrate analysis of interval-censored data in R using maximum-likelihood (survival analysis with Gaussian regression using survival library), MCMC (Bayesian model in JAGS), and likelihood-free Bayesian inference (custom ABC, using the simulation). This allows estimation of the distribution of mail delivery times. I compare those estimates from the interval-censored data with estimates from a (smaller) set of exact delivery-times provided by USPS tracking & personal observation, using a multilevel model to deal with heterogeneity apparently due to a change in USPS routes/postmen. Finally, I define a loss function on mail checks, enabling: a choice of optimal time to check the mailbox to minimize loss (exploitation); optimal time to check to maximize information gain (exploration); Thompson sampling (balancing exploration & exploitation indefinitely), and estimates of the value-of-information of another datapoint (to estimate when to stop exploration and start exploitation after a finite amount of data). Consider a question of burning importance: what time does the mailman come, bearing gifts and when should I check my mail, considering that my mailbox is ~200m away?

7 Steps to Understanding Deep Learning

Step 1: Introducing Deep Learning
Step 2: Getting Technical
Step 3: Backpropagation and Gradient Descent
Step 4: Getting Practical
Step 5: Convolutional Neural Nets and Computer Vision
Step 6: Recurrent Nets and Language Processing
Step 7: Further Topics

Understanding Rare Events and Anomalies: Why streaks patterns change

As we head into 2016, we can often look back at the past year and an overall history of rare events, and try to then extrapolate future odds of the same rare event, based on that. See this recent, confusing CNBC segment for example, where two Wall Street strategists (Tobias Levkovich and David Bianco) both mishandled their understanding of recent market data (facts) and how to think about their odds going forward (probability theory). What we will illustrate here is that recherché past events has no usefulness in understanding the rarity of the same events in the future! So the flat or worse 2015 for the markets would be rare to some degree, but that is just a trivial understanding that can’t be used to guess the future.

The R Journal, Volume 7/2, December 2015 – is online!

The new issue of The R Journal is now available!

A video tutorial on R programming – The essentials

Here is a my video tutorial on R programming – The essentials. This tutorial is meant for those who would like to learn R, for R beginners or for those who would like to get a quick start on R. This tutorial tries to focus on those main functions that any R programmer is likely to use often rather than trying to cover every aspect of R with all its subtleties. You can clone the R tutorial used in this video along with the powerpoint presentation from Github. For this you will have to install Git on your laptop After you have installed Git you should be able to clone this repository and then try the functions out yourself in RStudio. Make your own variations to the functions, as you familiarize yourself with R.

A Data Science Solution to the Question “What is Data Science?”

As this flowchart from Wikipedia illustrates, data science is about collecting, cleaning, analyzing and reporting data. But is it data science or just or a ‘sexed up term’ for Statistics (see embedded quote by Nate Silver)? It’s difficult to separate the two at this level of generality, so perhaps we need to define our terms.

Repel overlapping text labels in ggplot2

A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name. But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this – reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.