Tutorial on scikit-learn and IPython for parallel machine learning
Scope of this tutorial:
• Learn common machine learning concepts and how they match the scikit-learn Estimator API.
• Learn about scalable feature extraction for text classification and clustering
• Learn how to perform parallel cross validation and hyper parameters grid search in parallel with IPython.
• Learn to analyze the kinds of common errors predictive models are subject to and how to refine your modeling to take this analysis into account.
• Learn to optimize memory allocation on your computing nodes with numpy memory mapping features.
• Learn how to run a cheap IPython cluster for interactive predictive modeling on the Amazon EC2 spot instances using StarCluster.
Association Rules and Market Basket Analysis with R
In today’s data-oriented world, just about every retailer has amassed a huge database of purchase transaction. Each transaction consists of a number of products that have been purchased together. A natural question that you could answer from this database is: What products are typically purchased together? This is called Market Basket Analysis (or Affinity Analysis). A closely related question is: Can we find relationships between certain products, which indicate the purchase of other products? For example, if someone purchases avocados and salsa, it’s likely they’ll purchase tortilla chips and limes as well. This is called association rule learning, a data mining technique used by retailers to improve product placement, marketing, and new product development.
From time to time, I discover some of my experiments translated into Shiny Apps, like this one. Some days ago, I discovered one of these translations and I contacted the author, who was a guy from Vietnam called Vu Anh. I asked him to do a Shiny App from this experiment. Vu was enthusiastic with the idea. We defined some parameters to play with shape, number, width and alpha of lines as well as background color and I received a perfect release of the application in just a few hours.
Mixing Numbers and Symbols in Time Series Charts
One of the things I’ve been trying to explore with my #f1datajunkie projects are ways of representing information that work both in a glanceable way as well as repaying deeper reading. I’ve also been looking at various ways of using text labels rather than markers to provide additional information around particular data points. For example, in a race battlemap, with lap number on the horizontal x-axis and gap time on the vertical y-axis, I use a text label to indicate which driver is ahead (or behind) a particular target driver.
Plot.ly: Six Ways You Can Make Beautiful Graphs (Like Your Favorite Journalists)
This post shows how to make graphs like The Economist, New York Times, Vox, 538, Pew, and Quartz. And you can share-embed your beautiful, interactive graphs in apps, blog posts, and web sites. Read on to learn how.