Introduction to Pandas
Joris Van den Bossche

A Sankey Plot with Uniform Coloured Edges
Following up on my previous post about generating Sankey plots with the riverplot package. It’s also possible to generate plots which have constant coloured edges.

The Hype Around Graph Databases And Why It Matters
Organizations are struggling with a fundamental challenge – there’s far more data than they can handle. Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies? The big data tidal wave is transforming the database management industry, employee skill sets, and business strategy as organizations race to unlock meaningful connections between disparate sources of data. Graph Databases are rapidly gaining traction in the market as an effective method for deciphering meaning but many people outside the space are unsure of what exactly this entails. Generally speaking, graph databases store data in a graph structure where entities are connected through relationships to adjacent elements. The Web is a graph; also your friend-of-a-friend network and the road network are graphs.

Avoiding Extensive Feature Engineering for IoT
IoT holds a lot of promise for making the physical world more responsive to our needs. Building hardware is not as frightening as it once was, in an era when Intel Edison and other inexpensive SoC products now deliver WiFi, Bluetooth, 1GB RAM and Linux for < $100. However hardware alone is not going to make the promise of IoT come true. Effective application of machine learning is what will turn noisy sensors into ‘delightful data products.’ The challenge we face is how to turn noisy, unreliable, extremely large and distributed data streams produced by a plethora of sensors (like biometric sensors, cameras, Lidar, etc.) into useful applications.

Exploratory Data Analysis – Kernel Density Estimation and Rug Plots in R
This post follows the recent introduction of the conceptual foundations of kernel density estimation. It uses the ‘Ozone’ data from the built-in ‘airquality’ data set in R and the previously simulated ozone data for the fictitious city of ‘Ozonopolis’ to illustrate how to construct kernel density plots in R. It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

Comparison of Bayesian predictive methods for model selection
The results show that the optimization of a utility estimate such as the cross-validation score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. Better and much less varying results are obtained by incorporating all the uncertainties into a full encompassing model and projecting this information onto the submodels. The reference model projection appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.