R: How to Layout and Design an Infographic
As promised from my recent article, here’s my tutorial on how to layout and design an infographic in R. This article will serve as a template for more infographic design that I plan to share on future posts. Hence, we will go through the following sections:
1. Layout – mainly handles by grid package.
2. Design – style of the elements in the layout.
• Texts – use extrafont package for custom fonts;
• Shapes (lines and point characters) – use grid, although this package has been removed from CRAN (as of February 26, 2015), the compressed file of the source code of the package is still available. But if I am not mistaken, by default this package is included in R. You might check it first before installing.
• Plots – several choices for plotting data in R: base plot, lattice, or ggplot2 package.
The Imminent Future of Predictive Modeling
Over the past two to three years there has been a small explosion of companies offering cloud-based Machine Learning as a Service (MLaaS) and Predictive Analytics as a Service (PAaaS). IBM and Microsoft both have major freemium offerings in the form of Watson Analytics and Azure Machine Learning respectively, with companies like BigML, Ayasdi, LogicalGlue and ErsatzLabs occupying the smaller end of the spectrum. These are services which allow a data owner to upload data and rapidly build predictive or descriptive models, on the cloud, with a minimum of data science expertise.
Topic models: Past, present, and future
I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along). I recently sat down with David Blei, co-author of the seminal paper on topic models, and who remains one of the leading researchers in the field. We talked about the origins of topic models, their applications, improvements to the underlying algorithms, and his new role in training data scientists at Columbia University.
Announcing shinyapps.io General Availability
RStudio is excited to announce the general availability (GA) of shinyapps.io. Shinyapps.io is an easy to use, secure, and scalable hosted service already being used by thousands of professionals and students to deploy Shiny applications on the web. Effective today, shinyapps.io has completed beta testing and is generally available as a commercial service for anyone.
Big data and the retail industry
infographics
Deep Learning: Doubly Easy and Doubly Powerful with GraphLab Create
One of machine learning’s core goals is classification of input data. This is the task of taking novel data and assigning it to one of a pre-determined number of labels, based on what the classifier learns from a training set. For instance, a classifier could take an image and predict whether it is a cat or a dog.
Python Tutorial: Multivariate Linear Regression to Predict House Prices
This tutorial , uses multivariate regression to predict house price. The high level goal is the use multiple features (size , number of bedrooms,bathrooms etc) to predict the price of a house. This tutorial is a self-paced tutorial. The language used throughout will be Python and libraries available in python for scientific and machine learning applications. One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you’re watching right now. We’ll go over other practical tools, widely used in the data science industry, below.
Young Startup Create a new Network Analysis Platform
The Fastest way to Analyse Small and Large pcap Files
Compiling CoffeeScript in R with the js package
A new release of the js package has made it’s way to CRAN. This version adds support for compiling Coffee Script. Along with the uglify and jshint tools already in there, the package now provides a very complete suite for compiling, validating, reformating, optimizing and analyzing JavaScript code in R.
Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni
Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
Collaborative Computing with distcomp
Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.
Fuzzy String Matching – a survival skill to tackle unstructured information
“The amount of information available in the internet grows every day” thank you captain Obvious! by now even my grandma is aware of that!. Actually, the internet has increasingly become the first address for data people to find good and up-to-date data. But this is not and never has been an easy task. Even if the Semantic Web was pushed very hard in the academic environments and in spite of all efforts driven by the community of internet visionaries, like Sir Tim Berners-Lee, the vast majority of the existing sites don’t speak RDF, don’t expose their data with microformats and keep giving a hard time to the people trying to programmatically consume their data.