The Awesome Parrondo’s Paradox
“A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth”

Introducing V8: An Embedded JavaScript Engine for R
JavaScript is an fantastic language for building applications. It runs on browsers, servers and databases, making it possible to design an entire web stack in a single language. The OpenCPU JavaScript client already allows for calling R functions from JavaScript (see jsfiddles and apps). With the new V8 package we can now do the reverse as well: run JavaScript inside R!

When the Art Is Watching You
Museums are mining detailed information from visitors, raising questions about the use of Big Data in the arts

Get Projects to Done with Monte Carlo Simulation and More…
Losing sleep over a delivery date? If you want to better understand the impact of risk and uncertainty in your projects and quantify progress to plan, check out the latest VersionOne® Analytics enhancements to Monte Carlo Simulation, Project Burndown Forecast, Epic Progress, and Epic Burn Up Charts. VersionOne is the first agile lifecycle management platform to offer advanced project forecasting through Monte Carlo Simulation.

Peter Norvig’s Spell Checker in Two Lines of Base R
Peter Norvig, the director of research at Google, wrote a nice essay on How to Write a Spelling Corrector a couple of years ago. That essay explains and implements a simple but effective spelling correction function in just 21 lines of Python. Highly recommended reading! I was wondering how many lines it would take to write something similar in base R. Turns out you can do it in (at least) two pretty obfuscated lines:

Open Source Tools for Machine Learning
Open source machine learning software makes it easier to implement machine learning solutions on single computers and at scale, and the diversity of packages provide more options for implementers.

Sketching Scatterplots to Demonstrate Different Correlations
Looking just now for an openly licensed graphic showing a set of scatterplots that demonstrate different correlations between X and Y values, I couldn’t find one.

Preferring a preference index
I’ve been reading about preference indexes lately, specifically for characterising pollinator preferences for plants, so here is what I learnt. Preference is defined as using an item (e.g. plant) more than expected given the item abundance.

Best solution to a problem: data science versus statistical paradigm
The definition of ‘best’ depends on which school you follow. Data science and classic statistical science are at the opposite ends of the spectrum. So let’s clarify what ‘best solution’ means in these two opposite contexts:

Using convolutional neural nets to detect facial keypoints tutorial
This is a hands-on tutorial on deep learning. Step by step, we’ll go about building a solution for the Facial Keypoint Detection Kaggle challenge. The tutorial introduces Lasagne, a new library for building neural networks with Python and Theano. We’ll use Lasagne to implement a couple of network architectures, talk about data augmentation, dropout, the importance of momentum, and pre-training. Some of these methods will help us improve our results quite a bit.

Reducing your R memory footprint by 7000x
R is notoriously a memory heavy language. I don’t necessarily think this is a bad thing–R wasn’t built to be super performant, it was built for analyzing data! That said, there are times when there are some implementation patterns that are quite…redundant. As an example, I’m going to show you how you can prune a 330 MB glm to 45KB without losing significant functionality.

Sentiment analysis on web scraped data with kimono and MonkeyLearn
New tools have enabled businesses of all sizes to understand how their customers are reacting to them – do customers like the location, hate the menu, would they come back? This increased volume of data is incredibly valuable but larger than any mere mortal can assess, understand and turn into action. Several technologies have emerged to help businesses unlock the meaning behind this data.
This blog looks at how Kimono, which scrapes and structures data at scale, and MonkeyLearn, which provides machine learning capabilities, can be used together to translate data into insight.

Big Data Analytics: Time For New Tools
So you’re considering Hadoop as a big data platform. You’ll probably need some new analytics and business intelligence tools if you’re going to wring fresh insights out of your data.

The Geometry of Classifiers
As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado,, “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI Machine Learning Repository. For fun, we decided to do a follow-up study, using their data and several classifier implementations from scikit-learn, the Python machine learning library. We were interested not just in classifier accuracy, but also in seeing if there is a “geometry” of classifiers: which classifiers produce predictions patterns that look similar to each other, and which classifiers produce predictions that are quite different? To examine these questions, we put together a Shiny app to interactively explore how the relative behavior of classifiers changes for different types of data sets.

htmlwidgets: JavaScript data visualization for R
Today we’re excited to announce htmlwidgets, a new framework that brings the best of JavaScript data visualization libraries to R. There are already several packages that take advantage of the framework (leaflet, dygraphs, networkD3, DataTables, and rthreejs) with hopefully many more to come.

10 Ways Big Data Is Revolutionizing Manufacturing
McKinsey & Company recently published How Big Data Can Improve Manufacturing which provides insightful analysis of how big data and advanced analytics can streamline biopharmaceutical, chemical and discrete manufacturing.

Data Savvy Business Roles for Better Analytics Outcomes
To put data and analytics to work usually requires multiple roles, not just one, to achieve ongoing value and to align with company functions. No one person can provide all the skills and expertise that are needed. A variety of viewpoints, knowledge and experience need to come into play to get the most from processing and analyzing data – and then to correlate analysis with other information sources and knowledge — to derive the best overall insight and recommendations.

Curling – exploring web request options
Underlying almost all of our packages are requests to web resources served over the http protocol via curl. curl is a command line tool and library for transferring data with URL syntax, supporting (lots of protocols) . curl has many options that you may not know about.

Ask a Data Scientist: Ensemble Methods
What are ensemble methods? How do you use them?

My Commonly Done ggplot2 graphs: Part 2
In my last post I described some of my commonly done ggplot2 graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently.

Functional Principal Component Analysis
In mathematics, a general principle for studying an object is always from the study of the object itself to the study of the relationship between objects. In functional data analysis, the most important part for studying of the object itself, i.e. one functional data set, is functional principal component analysis (FPCA). And for the study of the relationship between two functional data sets, one popular way is various types of regression analysis. For this post, I only focus on the FPCA.

Subjective Ways of Cutting a Continuous Variables
You have probably seen @coulmont’s maps. If you haven’t, you should probably go and spend some time on his blog (but please, come back afterwards, I still have my story to tell you). Consider, for instance, the maps we obtained for a post published in Monkey Cage, a few months ago:

A small introduction to the ROCR package
I’ve been doing some classification with logistic regression in brain imaging recently. I have been using the ROCR package, which is helpful at estimating performance measures and plotting these measures over a range of cutoffs.

The data science project lifecycle
This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of ‘science’ which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

Pulling Insights from Unstructured Data – Nine Key Steps
In the era of “Big Data”, companies are flooded with information from a variety of sources. Most of this information is structured, meaning it can be easily categorized, sorted, and filtered. However, significant insights can be found in what’s known as “unstructured data”, for example by reviewing content within social media posts, mobile devices, even customer phone calls.
1. Narrow down the data.
2. Consider the intended result.
3. Pick the stack.
4. Throw it in a lake.
5. Do some cleaning.
6. Pull the useful stuff.
7. Build the ontology.
8. Modeling and execution.
9. Take action based on the results.