An Interaction or Not? How a few ML Models Generalize to New Data
This post examines how a few statistical and machine learning models respond to a simple toy example where they’re asked to make predictions on new regions of feature space. The key question the models will answer differently is whether there’s an “interaction” between two features: does the influence of one feature differs depending on the value of another. In this case, the data won’t provide information about whether there’s an interaction or not. Interactions are often real and important, but in many contexts we treat interaction effects as likely to be small (without evidence otherwise). I’ll walk through why decision trees and bagged ensembles of decision trees (random forests) can make the opposite assumption: they can strongly prefer an interaction, even when the evidence is equally consistent with including or not including an interaction.
Why are YOU learning to code?
Top 50 Data Science Resources: The Best Blogs, Forums, Videos and Tutorials to Learn All about Data Science
The field of data science is constantly evolving and ever-advancing, with new technologies placing more valuable insights in the hands of modern enterprises. More data-driven organizations are hiring data scientists to drive their efforts to gather, analyze, and make use of Big Data in valuable ways. Because the field of data science is so broad and sometimes challenging to navigate, we’ve compiled a list of 50 of the most helpful data science resources on the web. Whether you’re a student or new professional working in the field of data science, these resources are valuable for discovering the latest employment opportunities, finding tutorials for the processes and systems you’re using on a daily basis, learning hacks and tricks to boost your performance, and connecting with other professionals in your field. Note: The following 50 resources are not ranked or rated in order of importance or value; rather, they are categorized to make it easy for you to locate the resources you need most. Click through to a specific category using the links in the Table of Contents below.
Analytical Zen: Keeping it Simple yet Powerful
What makes one visualisation more impactful than another? Why do some chart types struggle to tell a story? In this session we explore how the Zen of Analysis is impacted by the types of visualisation we use, and the means in which we can use human perception to better tell our data story. Here, we explore the design principles that any analyst should be aware of when designing their worksheets and dashboards, as well as tie in the concepts of visual best practice to issues of speed and performance.
Hadoop and the Open Data Platform
Pivotal, IBM and Hortonworks announced recently the “Open Data Platform” (ODP) – an attempt to standardize Hadoop. This move seems to be backed up by IBM, Teradata and others that appear as sponsors on the initiative site. This move has a lot of potential and a few possible downsides. ODP promises standardization – Cloudera’s Mike Olson downplays the importance of this “Every vendor shipping a Hadoop distribution builds off the Hadoop trunk. The APIs, data formats and semantics of trunk are stable. The project is a decade old, now, and the global Hadoop community exercises its governance obligations responsibly. There’s simply no fundamental incompatibility among the core Hadoop components shipped by the various vendors.”
Wiki PageRank with Hadoop
In this tutorial we are going to create a PageRanking for Wikipedia with the use of Hadoop. This was a good hands-on excercise to get started with Hadoop. The page ranking is not a new thing, but a suitable usecase and way cooler than a word counter! The Wikipedia (en) has 3.7M articles at the moment and is still growing. Each article has many links to other articles. With those incomming and outgoing links we can determine which page is more important than others, which basically is what PageRanking does.
Creating an Analytics Ecosystem by integrating ModSpace and RStudio Server Professional
As the importance of using analytics to drive decision making continues to grow at pace, so too does the need to make data science more efficient and “deployable”. This drives many changes to the way in which analytics is performed, including:
• Allowing data scientists to collaborate on analytic projects
• Enabling discovery and re-use of code to avoid duplication of effort
• Enforcing rigour (versioning, audit) without adding admin overhead
• Centralising and standardising analytic code so it can be deployed via applications
Working with customers, and the RStudio team, Mango have been tasked with creating data science environments, leading to the development of the Data Science Workbench.
While the Data Science Workbench won’t be officially released for a few months, we couldn’t resist giving everyone a sneak peek at the way in which ModSpace (Mango’s collaborative analytics platform) and RStudio Server Professional can be integrated to create an effective R ecosystem.
101 new external resources and articles about data science, big data, ML – March 2
Starred articles were potential candidates for our picture of the week published in our weekly digest. Enjoy our new selection of articles and resources (R, data science, Python, machine learning etc.) Comments are from Vincent Granville.
Color extraction with R
Given all the attention the internet has given to the colors of this dress, I thought it would be interesting to look at the capabilities for extracting colors in R. R has a number of packages for importing images in various file formats, including PNG, JPG, TIFF, and BMP. (The readbitmap package works with all of these.) In each case, the result is a 3-dimensional R array containing a 2-D image layer for each of the color channels (for example red, green and blue for color images). You can then manipulate the array as ordinary data to extract color information. For example, Derek Jones used the readPNG function to extract data from published heatmaps when the source data has been lost.
Part 3a: Plotting with ggplot2
We will start off this first section of Part 3 with a brief introduction of the plotting system ggplot2. Then, with the attention focused mainly on the syntax, we will create a few graphs, based on the weather data we have prepared previously. Next, in Part 3b, where we will be doing actual EDA, specific visualisations using ggplot2 will be developed, in order to address the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? Lastly, in Part 4, we will use Machine Learning algorithms to check to which extent rain can be predicted based on the variables available to us. But for now, let’s see what this ggplot2 system is all about and create a few nice looking figures.
Book review: About Time Series Databases and a New look at Anomaly detection
This blog is a review of two books. Both are available for free from the MapR site, written by Ted Dunning and Ellen Friedman (published by O Reilly) : About Time Series Databases: New ways to store and access data andA new look at Anomaly Detection The MapR platform is a key part of the Data Science for the Internet of Things (IoT) course – University o… and I shall be covering these issues in my course In this post, I discuss the significance of Time series databases from an IoT perspective based on my review of these books. Specifically, we discuss Classification and Anomaly detection which often go together for typical IoT applications. The books are easy to read with analogies like HAL (Space Odyssey ) and I recommend them.