Parse and process XML (and HTML) with xml2
I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R.
Top 10 R Packages to be a Kaggle Champion
Kaggle top ranker Xavier Conort shares insights on the ’10 R Packages to Win Kaggle Competitions’.
SAP Embraces Hadoop in the Enterprise
At the European Hadoop Summit in Brussels last week, SAP underlined its support for the fast-growing number of enterprise Hadoop deployments.
Introducing ghrr: GitHub Hosted R Repository
R relies on package repositories for initial installation of a package via install.packages(). A crucial second step is update.packages(): For all currently installed packages, a list of available updates is constructed or offered for either one-by-one or bulk updates. This keeps the local packages in sync with upstream, and provides for a very convenient way to obtain new features, bug fixes and other improvements. So by installing from a repository, we automatically have the ability to track the repository for updates.
Accelerating R with multi-node parallelism – Rmpi, BatchJobs and OpenLava
R users frequently need to find solutions to parallelize workloads, and while solutions like multicore and socket level parallelism are good for some problems, when it comes to large problems there is nothing like a distributed cluster.
Machine Learning for (Smart) Dummies
Using my background in theoretical machine learning research, I instructed a recent seven-week course at Yahoo with the aim of providing a theoretical foundation on which the aforementioned algorithms are based. Why does a large margin guarantee good generalization? How does one avoid overfitting? What are the ‘no free lunch’ results in learning? What is the best learning rate one could hope for? Using rigorous mathematical tools, the course provides answers to these questions.
Plotting Factor Analysis Results
A recent factor analysis project (as discussed previously here, here, and here) gave me an opportunity to experiment with some different ways of visualizing highly multidimensional data sets. Factor analysis results are often presented in tables of factor loadings, which are good when you want the numerical details, but bad when you want to convey larger-scale patterns – loadings of 0.91 and 0.19 look similar in a table but very different in a graph. The detailed code is posted on RPubs because embedding the code, output, and figures in a webpage is much, much easier using RStudio’s markdown functions. That version shows how to get these example data and how to format them correctly for these plots. Here I will just post the key plot commands and figures those commands produce.
Application of PageRank algorithm to analyze packages in R
In the previous article, we talked about a crucial algorithm named PageRank, used by most of the search engines to figure out the popular/helpful pages on web. We learnt that however, counting the number of occurrences of any keyword can help us get the most relevant page for a query, it still remains a weak recommender system. In this article, we will take up some practice problems which will help you understand this algorithm better. We will build a dependency structure between R packages and then try to solve a few interesting puzzles using PageRank algorithm. But before we do that, we should brush up our knowledge on packages in R for better understanding.
The Big Data Challenge
Big Data Analytics comprises:
1. Data Collection – collect unstructured and structured data from variety of conventional and non conventional sources including machine sensors.
2. Data Storage – store data in robust, distributed, scalable storage based on commodity hardware with replication copies.
3. Descriptive Analytics – summarize data and develop data visualization.
4. Predictive Analytics – develop model using available data using supervised learning algorithms.
5. Prescriptive Analytics – develop story for leveraging predictions.
Bayesian Priors for Parameter Estimation
In the last post we looked at how we could estimate the conversion rate for visitors that subscribe to my email list. In order to understand how the Beta distribution changes as we gain information let’s look at another conversion rate. This time we’ll look at email subscribers and try to figure out how likely they are to click on the link to a post given that they open the email I sent them. I use MailChimp which is great/horrible because it gives you real-time info about how many people who have opened an email have clicked the link to your post, so you can obsessively watch each person click/not click.
R for more powerful clustering
R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term ‘heatmap’ is often confusing, making most wonder – which is it? A ‘colorful visual representation of data in a matrix’ or ‘a (thematic) map in which areas are represented in patterns (‘heat’ colors) that are proportionate to the measurement of some information being displayed on the map’? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
Linear Models in R: Diagnosing Our Regression Model
Today we learn how to obtain useful diagnostic information about a regression model and then how to draw residuals on a plot. As before, we perform the regression.