At the heart of most data mining, we are trying to represent complex things in a simple way. The simpler you can explain the phenomenon, the better you understand. It’s a little zen – compression is the same as understanding. Warning: Some math ahead.. but stick with it, it’s worth it.
Python/scikit-learn: Detecting Which Sentences in a Transcript Contain a Speaker
Over the past couple of months I’ve been playing around with How I Met Your Mother transcripts and the most recent thing I’ve been working on is how to extract the speaker for a particular sentence. This initially seemed like a really simple problem as most of the initial sentences I looked at were structured like this: : . If there were all in that format then we could write a simple regular expression and then move on but unfortunately they aren’t. We could probably write a more complex regex to pull out the speaker but I thought it’d be fun to see if I could train a model to work it out instead. The approach I’ve taken is derived from an example in the NLTK book.
Python library for Probabilistic Graphical Models
pgmpy is a Python library for creation, manipulation and implementation of Probablistic Graphical Models (PGM).
• Uses SciPy stack and NetworkX for mathematical and graph operations respectively.
• Provides interface to existing PGM algorithms.
Python: scikit-learn – Training a classifier with non numeric features
Following on from my previous posts on training a classifier to pick out the speaker in sentences of HIMYM transcripts the next thing to do was train a random forest of decision trees to see how that fared. I’ve used scikit-learn for this before so I decided to use that. However, before building a random forest I wanted to check that I could build an equivalent decision tree. I initially thought that scikit-learn’s DecisionTree classifier would take in data in the same format as nltk’s so I started out with the following code:
How to Make a Histogram with Basic R
A histogram is a visual representation of the distribution of a dataset. As such, the shape of a histogram is its most obvious and informative characteristic: it allows you to easily see where a relatively large amount of the data is situated and where there is very little data to be found (Verzani 2004). In other words, you can see where the middle is in your data distribution, how close the data lie around this middle and where possible outliers are to be found. Exactly because of all this, histograms are a great way to get to know your data!
Creating composite figures with ggplot2 for reproducible research
So far, I have been preparing composite figures by plotting the data using ggplot2, and then putting the panels together in OmniGraffle or Adobe Illustrator. Of course, every time the data is updated, I would need to go back to the vector editing program. After moving my manuscript from Word to knitr, I figured I should also try to cut out the image editing step. ggplot2 does not make it easy to put different panels together in a seamless fashion and without any margins. However, by piecing together different StackOverflow answers, I found a way to extract different parts of the figures, and glue them back together with the gtable package.
Deriving Value with Data Visualization Tools
Steps to visualize big data success:
• Aim for context.
• Plan for speed and scale.
• Assure data quality.
• Display meaningful results.
• Dealing with outliers.
R: Basics of Lists
Lists are a data type in R that are perhaps a bit daunting at first, but soon become amazingly useful. They are especially wonderful once you combine them with the powers of the apply() functions. This post will be part 1 of a two-part series on the uses of lists. In this post, we will discuss the basics – how to create lists, manipulate them, describe them, and convert them. In part 2, we’ll see how using lapply() and sapply() on lists can really improve your R coding.