Doing Naive Bayes Classification
This is an example of going from labeled text to machine classification, first with NLTK and then the Python machine learning library scikit-learn. Examples updated from my OpenVis Conf talk here, which is more entertaining: https://www.youtube.com/watch?v=f41U936WqPM and slides: http://www.slideshare.net/arnicas/the-bones-of-a-bestseller
Business Model Innovation Needs to Be More of a (Data) Science
There’s a problem with how we commonly approach business model innovation. We tend to treat it as though it were an art. We try to innovate with more creative ideation, stronger phase gates or smaller trial-and-error cycles – and it’s not working. This post is a companion to my O’Reilly Strata talk on making business model innovation more of a science than an art. I’m studying ways that data science can improve business model innovation.
Developing an Open Source Software Library for big Data
XDATA is developing an open source software library for big data to help overcome the challenges of effectively scaling to modern data volume and characteristics. The program is developing the tools and techniques to process and analyze large sets of imperfect, incomplete data. Its programs and publications focus on the areas of analytics, visualization, and infrastructure to efficiently fuse, analyze and disseminate these large volumes of data.
Comparing Supervised Learning Algorithms
In the data science course that I instruct, we cover most of the data science pipeline but focus especially on machine learning. Besides teaching model evaluation procedures and metrics, we obviously teach the algorithms themselves, primarily for supervised learning. Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to intelligently choose between supervised learning algorithms when working a machine learning problem. Although there is some value in the “brute force” approach (try everything and see what works best), there is a lot more value in being able to understand the trade-offs you’re making when choosing one algorithm over another. I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to compare the algorithms across a dozen different dimensions. I couldn’t find a table like this on the Internet, so I decided to construct one myself! Here’s what I came up with:
Combinatorial optimization, deterministic and stochastic approaches
What is the minimum cost required to connect all neighborhoods to electricity?
Streaming Big Data: Storm, Spark and Samza
There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of th
Ireland calls for a ‘Magna Carta’ for data ethics in Europe
As the war over trust and privacy online intensifies, Irish data scientists will next week call on EU chiefs to create a ‘Magna Carta’ on data ethics for Europe. In Brussels next Wednesday an Irish delegation from the Government-supported Insight Centre will deliver a proposal to create a ‘Magna Carta for Data’ to protect individual privacy and support EU-wide data innovation. The document is designed to contribute to the policy discussion around data ethics, ownership and use in Europe. The EU is currently developing its Data Protection Regulation and Directive.
R: Aggregation
Aggregation splits data into subsets, computes summary statistics on each subset, and reports the results in a conveniently summarized form. The aggregate function is one of the most capable functions in the scidb package. The package overloads R’s standard aggregate function for SciDB arrays, using reasonably standard R syntax to cover most SciDB aggregation operators including aggregate, window, and variable_window. (The regrid and cumulate functions separately implement additional SciDB aggregation operators.) The aggregate function extends the default capabilities of many SciDB aggregation operators to allow grouping by SciDB array dimensions, aggregates, other SciDB arrays, and combinations of all three.
reshape: from long to wide format
This is to continue on the topic of using the melt/cast functions in reshape to convert between long and wide format of data frame. Here is the example I found helpful in generating covariate table required for PEER (or Matrix_eQTL) analysis:
Does Balancing Classes Improve Classifier Performance?
It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been skeptical of the claim that artificially balancing the classes (through resampling, for instance) always helps, when the model is to be run on a population with the native class prevalences.
John Snow, and OpenStreetMap
While I was working for a training on data visualization, I wanted to get a nice visual for John Snow’s cholera dataset. This dataset can actually be found in a great package of famous historical datasets.
John Snow, and Google Maps
In my previous post, I discussed how to use OpenStreetMaps (and standard plotting functions of R) to visualize John Snow’s dataset. But it is also possible to use Google Maps (and ggplot2 types of graphs).