List of useful packages (libraries) for Data Analysis in R

R offers multiple packages for performing data analysis. Apart from providing an awesome interface for statistical analysis, the next best thing about R is the endless support it gets from developers and data science maestros from all over the world. Current count of downloadable packages from CRAN stands close to 7000 packages! Beyond some of the popular packages such as caret, ggplot, dplyr, lattice, there exist many more libraries which remain unnoticeable, but prove to be very handy at certain stages of analysis. So, we created a comprehensive list of all packages in R.

Step-by-Step Guidelines to Optimize Big Data Transfers

Transferring large datasets especially with heterogeneous file sizes (i.e. many small and large files together) causes inefficient utilization of the available network bandwidth. Small file transfers may cause the underlying transfer protocol not reaching the full network utilization due to short-duration transfers and connection start up/tear down overhead; and large file transfers may suffer from protocol inefficiency and end-system limitations.

Faster deep learning with GPUs and Theano

I will show you how to:
• Configure the Python library Theano to use the GPU for computation.
• Build and train neural networks in Python.
Using the GPU, I’ll show that we can train deep belief networks up to 15x faster than using just the CPU, cutting training time down from hours to minutes.

K-nearest neighbors in python

In this post, we’ll be using the K-nearest neighbors algorithm to predict how many points NBA players scored in the 2013-2014 season. Along the way, we’ll learn about euclidean distance and figure out which NBA players are the most similar to Lebron James.

Nonparametric Latent Dirichlet Allocation

Latent Dirichlet Allocation is a generative model for topic modeling. Given a collection of documents, an LDA inference algorithm attempts to determined (in an unsupervised manner) the topics discussed in the documents. It makes the assumption that each document is generated by a probability model, and, when doing inference, we try to find the parameters that best fit the model (as well as unseen/latent variables generated by the model). If you are unfamiliar with LDA, Edwin Chen has a friendly introduction you should read.

Identifying Trends in SQL with Linear Regression

One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.

Possible Future Roles in Big Data with Descrptions

• Data Trader
• Data Hound
• Data Plumber
• Data Butcher
• Data Miners
• Data Canary
• Data Janitor
• Data Cleaner
• Data Pharmacist
• Data Chef
• Data Taster
• Data Server
• Data Whisperer
• Data Czar
• Data Shouterer

Getting Started: Adobe Analytics Clickstream Data Feed

In this blog post, I will show the structure of the Adobe Analytics Clickstream Data Feed and how to work with a day worth of data within R. Clickstream data isn’t as raw as pure server logs, but the only limit to what we can calculate from clickstream data is what we can accomplish with a bit of programming and imagination. In later posts, I’ll show how to store a year worth of data in a relational database, storing the same data in Hadoop and doing analysis using modern tools such as Apache Spark.

Generalised Linear Models in R

Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points. Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention. I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post. Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model. Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.

Visualizing population density

The graph at the top of this post represents a square kilometer and draws a dot for every person in various counties. This representation is deceptive at high densities. It would look like a black square long before it got to 1,000,000 people (1,000 people by 1,000 people, each taking up a square meter). We just can’t show 1,000 by 1,000 dots on a graph that size.

Sensemaking in R: A Plenitude of Models Makes for Good Storytelling

Sensemaking is a motivated, continuous effort to understand connections (which can be among people, places, and events) in order to anticipate their trajectories and act effectively.’

Feature Engineering versus Feature Extraction: Game On!

Feature engineering’ is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.