R sucks

R is great. It is because R is so great that I am bothered by its flaws and I want it to be even better.

Bayesian Item Response Theory in JAGS: A Hierarchical Two Parameter Logistic Model

I recently created a hierarchical two-parameter logistic model for item response theory (IRT). The JAGS script is now in the folder of scripts that accompany the book (available at the book’s web site; click book cover at right). Below are slides that accompany my presentation of the material. I hope the slides are self-explanatory for those of you who are already familiar with IRT, and maybe even for those of you who are not. Maybe one day I’ll record a narration over the slides and post a video. Meanwhile, I hope the slides below are useful.

Kaggle Solution: What’s Cooking ? (Text Mining Competition)

Tutorial on Text Mining, XGBoost and Ensemble Modeling in R

How to create a Twitter Sentiment Analysis using R and Shiny

I will show you how to create a simple application in R & Shiny to perform Twitter Sentiment Analysis in real-time. I use RStudio. First, I create a Shiny Project. Then, in the ui.R file, I put this code:

Predicting the Future

If you’re looking for a high growth, high demand area in data science where there is very high value and not many expert practitioners, look at time series forecasting, especially as it applies to Supply Chain Management demand forecasting.

It’s time for businesses to use IT Operations Analytics!

IT Operations Analytics (ITOA) is also known as Advanced Operational Analytics, or IT Data Analytics) and encapsulate technologies that are primarily used to discover complex patterns in high volumes of ‘noisy’ IT system availability and performance data.

SVM in Practice

Many Machine Learning articles and papers describe the wonders of the Support Vector Machine (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues. I will try to describe the steps I took to make the algorithm work in practice.

Workflows in Python: Curating Features and Thinking Scientifically about Algorithms

This is the second post in a series about end-to-end data analysis in Python using scikit-learn Pipeline and GridSearchCV. In the first post, I got my data formatted for machine learning by encoding string features as integers, and then used the data to build several different models. I got things running really fast, which is great, but at the cost of being a little quick-and-dirty about some details. First, I got the features encoded as integers, but they really should be dummy variables. Second, it’s worth going through the models a little more thoughtfully, to try to understand their performance and if there’s any more juice I can get out of them.

Analyzing Customer Churn – Time-Dependent Covariates

When you’re using cox regression to model customer churn, you’re often interested in the effects of variables that change throughout a customer’s lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they’ve logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you’ll essentially be regressing customer’s churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.

Prejudiced Minds & Prejudiced Data Scientists-Boon or Curse ?

A Conversation between Accused, Lawyer, Judge and an Objective and Subjective Data Scientist in the Court of Law:

Set up Sublime Text for light-weight all-in-one data science IDE

Good IDEs are everywhere. RStudio for R, Pycharm for Python, IntelliJ for Scala. But these are specialized in one language and what I wanted to have is a tool for broad and multi-language data science tasks with autocompletion, FTP, highlighter, formatter, split editing, local/remote evaluation and REPL. I especially wanted REPL functionality that takes single inputs, evaluates them, and returns the result for quick prototyping. Sublime Text is a popular text editor with massive amount of plugins that provides quite a lot of what you need. By adding custom REPL, Sublime Text becomes an all-in-one tool for every data science task from hadoop ETL, Spark machine learning, HTML/CSS editing to markdown reporting.

Fraud Detection with R and Azure

Detecting fraudulent transactions is a key applucation of statistical modeling, especially in an age of online transactions. R of course has many functions and packages suited to this purpose, including binary classification techniques such as logistic regression.