Beginner’s guide to Web Scraping in Python (using BeautifulSoup)

The need and importance of extracting data from the web is becoming increasingly loud and clear. Every few weeks, I find myself in a situation where we need to extract data from the web. For example, last week we were thinking of creating an index of hotness and sentiment about various data science courses available on the internet. This would not only require finding out new courses, but also scrape the web for their reviews and then summarizing them in a few metrics! This is one of the problems / products, whose efficacy depends more on web scrapping and information extraction (data collection) than the techniques used to summarize the data.

Deep learning for sentiment analysis: successful approaches and future challenges

Sentiment analysis (also known as opinion mining) is an active research area in natural language processing. It aims at identifying, extracting and organizing sentiments from user generated texts in social networks, blogs or product reviews. A lot of studies in literature exploit machine learning approaches to solve sentiment analysis tasks from different perspectives in the past 15 years. Since the performance of a machine learner heavily depends on the choices of data representation, many studies devote to building powerful feature extractor with domain expert and careful engineering. Recently, deep learning approaches emerge as powerful computational models that discover intricate semantic representations of texts automatically from data without feature engineering. These approaches have improved the state-of-the-art in many sentiment analysis tasks including sentiment classification of sentences/documents, sentiment extraction and sentiment lexicon learning. In this paper, we provide an overview of the successful deep learning approaches for sentiment analysis tasks, lay out the remaining challenges and provide some suggestions to address these challenges. WIREs Data Mining Knowl Discov 2015, 5:292-303. doi: 10.1002/widm.1171

How To Use Multivariate Time Series Techniques For Capacity Planning on VMs

Capacity planning is an arduous, ongoing task for many operations teams, especially for those who rely on Virtual Machines (VMs) to power their business. At Pivotal, we have developed a data science model capable of forecasting hundreds of thousands of models to automate this task using a multivariate time series approach. Open to reuse for other areas such as industrial equipment or vehicles engines, this technique can be applied broadly to anything where regular monitoring data can be collected.

Building Barplots with Error Bars

Bar charts are a pretty common way to represent data visually, but constructing them isn’t always the most intuitive thing in the world. One way that we can construct these graphs is using R’s default packages.

Dimensionality Reduction for Binary Data through the Projection of Natural Parameters

Principal component analysis (PCA) for binary data, known as logistic PCA, has become a popular alternative to dimensionality reduction of binary data. It is motivated as an extension of ordinary PCA by means of a matrix factorization, akin to the singular value decomposition, that maximizes the Bernoulli log-likelihood. We propose a new formulation of logistic PCA which extends Pearson’s formulation of a low dimensional data representation with minimum error to binary data. Our formulation does not require a matrix factorization, as previous methods do, but instead looks for projections of the natural parameters from the saturated model. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We derive explicit solutions for data matrices of special structure and provide computationally efficient algorithms for solving for the principal component loadings. Through simulation experiments and an analysis of medical diagnoses data, we compare our formulation of logistic PCA to the previous formulation as well as ordinary PCA to demonstrate its benefits.

JSOC 2015 project – Interactive Visualizations in Julia with GLVisualize.jl

GLVisualize is an interactive visualization library that supports 2D and 3D rendering as well as building of basic GUIs. It’s written entirely in Julia and OpenGL. I’m really glad that I could continue working on this project with the support of Julia Summer of Code.

Party with the First Tribe

The de facto standard for decision trees or “recursive partitioning” trees as they are known in the literature, is the CART algorithm by Breiman et al. (1984) implemented in R’s rpart package. Stripped down to it’s essential structure, CART is a two stage algorithm. In the first stage, the algorithm conducts an exhaustive search over each variable to find the best split by maximizing an information criterion that will result in cells that are as pure as possible for one or the other of the class variables. In the second stage, a constant model is fit to each cell of the resulting partition. The algorithm then proceeds in a recursive “greedy” fashion making splits and not looking back to see how things might have been before making the next split. Although hugely successful in practice, the algorithm has two vexing problems: (1) overfitting and (2) selection bias – the algorithm favors features with many possible splits1. Overfitting occurs because the algorithm has “no concept of statistical significance” 2. While overfitting is usually handled with cross validation and pruning there doesn’t seem to be an easy way to deal with selection bias in the CART / rpart framework.

Installing R on OS X – “100% Homebrew Edition”

In a previous post I provided “mouse-heavy” instructions for getting R running on your Mac. A few of the comments suggested that an “all Homebrew” solution may be preferable for some folks. Now, there are issues with this since getting “support” for what may be R issues will be very difficult on the official mailing lists as you’ll immediately be told to “use the official distribution” by some stalwart R folks (this happens on StackOverflow and other forums as well). However, if you have a thick skin and can be somewhat self-sustaining, Homebrew is a superb alternative to setting up your R environment (and other things) on your OS X system.

JSOC 2015 project – Automatic Differentiation in Julia with ForwardDiff.jl

This summer, I’ve had the good fortune to be able to participate in the first ever Julia Summer of Code (JSoC), generously sponsored by the Gordon and Betty Moore Foundation. My JSoC project was to explore the use of Julia for automatic differentiation (AD), a topic with a wide array of applications in the field of optimization. Under the mentorship of Miles Lubin and Theodore Papamarkou, I completed a major overhaul of ForwardDiff.jl, a Julia package for calculating derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia functions (or any callable Julia type, really). By the end of this post, you’ll hopefully know a little bit about how ForwardDiff.jl works, why it’s useful, and why Julia is uniquely well-suited for AD compared to other languages.