Adding a CRAN Search Engine to Chrome

Riffing off of the previous post, here’s a way to quickly search CRAN (the @RStudio flavor) from the Chrome search bar.

What is the role of analytics in E-Commerce industry?

If you are preparing for an interview into role of analytics, you need to do your ground work to get a basic understanding of domain. Also, you should know what is the role of analytics to do smarter business in this domain. But such information is not available in public neither is it available on Job Descriptions. Mostly one of your interview round will be to assess your capability to analyse a problem in their domain. If you know domain before hand, it will be a jackpot. In this article, I will introduce you to a few roles analytics plays in E-Commerce industry.

Book Review: Statistical Analysis of Network Data with R

Statistical Analysis of Network Data with R is a recent addition to the growing UseR! series of computational statistics monographs using the R programming language (R Core Team 2015). It gives a practical introduction to the visualization, modeling and analysis of network data, a topic which has enjoyed a recent surge in popularity. The book brings together a partnership of two established researchers in the field: Eric Kolaczyk, author of a number of papers and a recent texts on statistical network analysis, and G´abor Cs´ardi, researcher of network data arising in biological applications and lead developer of a popular network analysis software suite. I was thus curious to see what this book has to offer, especially since such data is becoming more available and of interest in a wide range of scientific fields. In the preface, the authors state the aims of the book as “to provide an easily accessible introduction to the statistical analysis of network data”, but flag to the reader that the book is “not a detailed manual for using the various R packages encountered […] nor […] provide exhaustive coverage of the conceptual and technical foundations of the topic area”, but instead aims to “strike a balance between the two”. I will discuss both the theoretical and computing aspects of the book below.

Estimating Elasticities, All Over Again

I had some interesting email from Andrew a while back to do with computing elasticities from log-log regression models, and some related issues.

Machine Learning in Javascript- A compilation of Resources

One of the beauties of running Javascript related applications is you don’t need to install any client side software, optimize servers and spend tons of time on the core infrastructure. Javascript just work outs of the core browser. In that spirit, there is a lot of increasing momentum on building Machine Learning in Javascript. We have collected a list of resources on Javascript that will be helpful if you are building machine learning applications in Javascript.

Exercise to detect Algorithmically Generated Domain Names

In this notebook we’re going to use some great python modules to explore, understand and classify domains as being ‘legit’ or having a high probability of being generated by a DGA (Dynamic Generation Algorithm). We have ‘legit’ in quotes as we’re using the domains in Alexa as the ‘legit’ set. The primary motivation is to explore the nexus of IPython, Pandas and scikit-learn with DGA classification as a vehicle for that exploration. The exercise intentionally shows common missteps, warts in the data, paths that didn’t work out that well and results that could definitely be improved upon. In general capturing what worked and what didn’t is not only more realistic but often much more informative. 🙂

Is Regression Trustworthy? Or how to use Metrics to Trust the Prediction of Regression

Regression analysis is, without a doubt, one of the most widely used machine-learning models for prediction and forecasting. There are many reasons for the popularity of regression analysis. The most significant of these are the following:
• It is simple, and who doesn’t love simplicity.
• It is usually very accurate, if you have the right features and a large amount of data.
• It is easy to interpret. For some applications like medical, interpretability is more important than accuracy!

Matrix Factorization Comes in Many Flavors: Components, Clusters, Building Blocks and Id

Unsupervised learning is covered in Chapter 14 of The Elements of Statistical Learning. Here we learn about several data reduction techniques including principal component analysis (PCA), K-means clustering, nonnegative matrix factorization (NMF) and archetypal analysis (AA). Although on the surface they seem so different, each is a data approximation technique using matrix factorization with different constraints. We can learn a great deal if we compare and contrast these four major forms of matrix factorization.

A simple statnet model of CRAN

In a recent post on creating JavaScript network graphs directly from R, my colleague and fellow blogger, Andrie de Vries, included a link to a saved graph of CRAN. Here, I will use that same graph (network) to build a simple exponential random graph model using functions from the igraph package, and the network and ergm packages included in the statnet suite of R packages. Each node (vertex) in the saved graph represents a package on CRAN, and a directed link or edge between two nodes A -> B indicates that package A depends on package B. Since Andrie’s CRAN graph does not have any external attributes associated with either the nodes or edges, the idea is to see if we can develop a model using only structural aspects of the network itself as predictors. In general, this is not an easy thing to do and we are only going to have limited success here. However, the process will illustrate some basic concepts.