Transforming Text and Data Into a True Knowledge Base
Extensive Google searches to locate current statistics on the size and growth rates of big data failed me today. I then realized why. Growth rates are so large and so dynamic; few, if any, are attempting to predict how much data is really out there. In our multichannel world, there’s simply too much data to digest. The variety of data, growth in unstructured data and challenges deciphering it prevent transformation from noise to meaning, and no industry is immune from this monumental shift in data complexity. There are some rays of hope. Newer sources of data including Open Linked Data (LOD) are available. It’s free of charge, used by few, understood by less and powerful enough to distinguish you from the pack. The growth in unstructured data is running at breakneck speed and organizations are scrambling to keep up, but with the proper technology some are succeeding. These progressives have found a way to transform, connect, organize, query and analyze information to achieve big data enlightenment. They have ‘knowledge bases’ of integrated data, which provide a roadmap to discovery, decision support, better research, improved customer service, personalized patient care, higher advertising conversion rates, customer retention and more. If we relate it to the human mind, we have our own knowledge bases developed from experience, study, training, interactions, relationships and events. We use it every day to make decisions. And our brain, the engine behind the knowledge, has the ability to reason, extend knowledge, learn and draw conclusions. Imagine if your business leveraged a knowledge base replete with all of its dark data. Imagine the types of questions you could answer given instant recall, powerful reasoning and billions of related facts.

Classifying text with bag-of-words: a tutorial
There is a Kaggle training competition where you attempt to classify text, specifically movie reviews. No other data – this is a perfect opportunity to do some experiments with text classification. Kaggle has a tutorial for this contest which takes you through the popular bag-of-words approach, and a take at word2vec. The tutorial hardly represents best practices, most probably to let the competitors improve on it easily. And that’s what we’ll do.

Tuning the parameters of your Random Forest model
Why to tune Machine Learning Algorithms?

Using pandas and scikit-learn for classification tasks
pandas_sklearn_rendered.ipynb

State of Hyperparameter Selection
Historically hyperparameter determination has been a woefully forgotten aspect of machine learning. With the rise of neural nets – which require more hyperparameters, more precisely tuned than many other models – there has been a recent surge of interest in intelligent methods for selection; however, the average practitioner still seems to commonly use either default hyperparameters, grid search, random search, or (believe it or not) manual search. For the readers who don’t know, hyperparameter selection boils down to a conceptually) simple problem: you have a set of variables (your hyperparameters) and an objective function (a measure of how good your model is). As you add hyperparameters, the search space of this problem explodes.

Great list of resources: data science, visualization, machine learning, big data
Fantastic resource created by Andrea Motosi. I’ve only included the 5 categories that are the most relevant to our audience, though it has 31 categories total, including a few on distributed systems and Hadoop.

How Much Did It Rain? Winner’s Interview: 2nd place, No Rain No Gain
Kagglers Sudalai (aka SRK) and Marios (aka Kazanova) came together to form team ‘No Rain No Gain!’ and take second place in the How Much Did it Rain? competition. Sudalai had two goals in competing: to earn a Master’s badge and to finish in the top 100. In the blog below, Sudalai shares how he managed to accomplish both (and get a new friend) by being part of a great team.

advanced.procD.lm for pairwise tests and model comparisons
In geomorph 2.1.5, we decided to deprecate the functions, pairwiseD.test and pairwise.slope.test. Our reason for this was two-fold. First, recent updates by CRAN rendered these functions unstable. These functions depended on the model.frame family of base functions, which were updated by CRAN. We tried to find solutions but the updated pairwise functions were worse than non-functioning functions, as they sometimes provided incorrect results (owing to strange sorting of rows in design matrices). We realized that we were in a position that required a complete overhaul of these functions, if we wanted to maintain them. Second, because advanced.procD.lm was already capable of pairwise tests and did not suffer from the same issues, we realized we did not have to update the other functions, but could instead help users understand how to use advancd.procD.lm. Basically, this blog post is a much better use of time than trying again and again to fix broken functions. Before reading on, if you have not already read the blog post on ANOVA in geomorph, it would probably be worth your time to read that post first.

List of user-installed R packages and their versions
This R command lists all the packages installed by the user (ignoring packages that come with R such as base and foreign) and the package versions.

Mortgages Are About Math: Open-Source Loan-Level Analysis of Fannie and Freddie
The so-called government-sponsored enterprises went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. As part of my efforts, I wrote code to transform the raw data into a more useful PostgreSQL database format, and some R scripts for analysis. The code for processing and analyzing the data is all available on GitHub.