Confounding
Correlation is not causation’ is one of the most important lessons you should take from this or any other data analysis course. A common example for why this statement is so often true is confounding. Simply stated confounding occurs when we observe a correlation or association between X and Y but this is strictly the result of both X and Y depending on an extraneous variable Z . Here we describe Simpson’s paradox, perhaps the most famous case of confounding and then show an example of confounding in high throughput biology.
HANA, Hadoop Help SAP Connect IoT And Big Data
Engineers at Sapphire Now discussed the potential and limitations for SAP S/4HANA as a platform for managing an enterprise Internet of Things and the big data that strategy requires.
Eliminating data pipeline glue code with GraphLab Create’s SDK
In this post, I demonstrate how to use Dato’s open source code to quickly unify an awesome C++ machine learning library with SFrame’s scalable feature engineering capabilities. The result exposes a Kaggle-winning, open source C++ machine learning library as a Python package so that data scientists can quickly apply it to their data. It also provides seamless out-of-core feature engineering with (Dato’s GraphLab Create), thereby avoiding data pipeline headaches.
First Steps with Structural Equation Modeling
Last Friday at the Davis R Users’ Group, Grace Charles gave a presentation on structural equation modeling in R using the Lavaan package. Here’s the video and her slides. We’ve also posted Grace’s script from the presentation as a gist here. More resources that Grace mentioned in her talk below.
Clusters Powerful Enough to Generate Their Own Subspaces
Cluster are groupings that have no external label. We start with entities described by a set of measurements but no rule for sorting them by type. Mixture modeling makes this point explicit with its equation showing how each measurement is an independent draw from one of K possible distributions.
Teaching R course? Use analogsea to run your customized RStudio in Digital Ocean!
In this post I will show how a few lines of R code can start a customized RStudio docklet in a cloud and email login credentials to course participants. So, the participant do not need to install R and the required packages. Moreover, it is guaranteed they all run exactly the same software. All they need is a decent web browser to access RStudio server.
Benchmarking Random Forest Implementations
I currently have the need for machine learning tools that can deal with observations of the order of 10 millions in the context of binary classification. That kind of data is a few GBs in size and it fits comfortably nowadays in the RAM of a decent single machine. It is a trivial task for linear models, as there are plenty of open source tools that can train a logistic regression with this amount of data on a single machine in a few seconds, even while using only 1 processor core (many of these tools are single-threaded). Linear models are also the gold standard of large-scale machine learning that can run on clusters, processing very large distributed datasets.
Self-learning Machines & Deep Convolutional Neural Networks Classify Scenes & Identify Objects
Recent research using deep convolutional neural networks and new system architectures have demonstrated the ability of smart machines to autonomously learn to classify image scenes and identify objects.
An Introduction to Deep Learning and it’s role for IoT/ future cities
This article is a part of an evolving theme. Here, I explain the basics of Deep Learning and how Deep learning algorithms could apply to IoT and Smart city domains. Specifically, as I discuss below, I am interested in complementing Deep learning algorithms using IoT datasets. I elaborate these ideas in the Data Science for Internet of Things program which enables you to work towards being a Data Scientist for the Internet of Things (modelled on the course I teach at Oxford University and UPM – Madrid). I will also present these ideas at the International conference on City Sciences at Tongji University in Shanghai and the Data Science for IoT workshop at the Iotworld event in San Francisco.