Launching learning path to master D3.js

The aim of these learning paths is to take away confusion in learning for newbies. If you haven’t checked them out and are learning a tool / technique, you should definitely check them out.

How Airlines Measure Loyalty

An ongoing debate in the loyalty world is what type of customer is more important? One that spends big on airline tickets or one that filters millions of credit card miles through the banks? Recently at the Loyalty Event in San Diego, Mike Hecht from Delta noted that they don’t mind whether their customers are loyal due to credit card spend, or due to bum-in-seat flying – either way they’re being loyal and should be rewarded with status.

Introduction to data quality

How many times have you heard managers and colleagues complain about the quality of the data in a particular report, system or database? People often describe poor quality data as unreliable or not trustworthy. Defining exactly what high or low quality data is, why it is a certain quality level and how to manage and improve it is often a trickier task.

A tale about LDA2vec: when LDA meets word2vec

A few days ago I found out that there had appeared lda2vec (by Chris Moody) – a hybrid algorithm combining best ideas from well-known LDA (Latent Dirichlet Allocation) topic modeling algorithm and from a bit less well-known tool for language modeling named word2vec. And now I’m going to tell you a tale about lda2vec and my attempts to try it and compare with simple LDA implementation (I used gensim package for this). So, once upon a time…

Writing fast asynchronous SGD/AdaGrad with RcppParallel

After Tomas Mikolov et al. released word2vec tool, there was a boom of articles about words vector representations. One of the greatest is GloVe, which did a big thing by explaining how such algorithms work. It also refolmulates word2vec optimization as a special kind of factoriazation for word cooccurences matrix.
This post is devided into two main parts:
1.Very brief introduction into GloVe algorithm.
2.Details of implementation. I will show how to write fast parallel asynchronous SGD with RcppParallel with adaptive learning rate in C++ using Intel TBB and RcppParallel.

A menagerie of messed up data analyses and how to avoid them

In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a ‘menagerie’ of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).

Using PostgreSQL in R: A quick how-to

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

GPU Computing for Data Science

When working with big data or complex algorithms, we often look to parallelize our code to optimize runtime. By taking advantage of a GPUs 1000+ cores, a data scientist can quickly scale out solutions inexpensively and sometime more quickly than using traditional CPU cluster computing. In this webinar, we will present ways to incorporate GPU computing to complete computationally intensive tasks in both Python and R.

Multidimensional algorithms and iteration

Starting with release 0.4, Julia makes it easy to write elegant and efficient multidimensional algorithms. The new capabilities rest on two foundations: a new type of iterator, called CartesianRange, and sophisticated array indexing mechanisms. Before I explain, let me emphasize that developing these capabilities was a collaborative effort, with the bulk of the work done by Matt Bauman (@mbauman), Jutho Haegeman (@Jutho), and myself (@timholy). These new iterators are deceptively simple, so much so that I’ve never been entirely convinced that this blog post is necessary: once you learn a few principles, there’s almost nothing to it. However, like many simple concepts, the implications can take a while to sink in. There also seems to be some widespread confusion about the relationship between these iterators and Base.Cartesian, which is a completely different (and much more painful) approach to solving the same problem. There are still a few occasions where Base.Cartesian is necessary, but for many problems these new capabilities represent a vastly simplified approach.

?DataCamp Interactive R Tutorial: Data Exploration with Kaggle Scripts

Ever wonder where to begin your data analysis? Exploratory Data Analysis (EDA) is often the best starting point. Take the new hands-on course from Kaggle & DataCamp “Data Exploration with Kaggle Scripts” to learn the essentials of Data Exploration and begin navigating the world of data. By the end of the course you will learn how to apply various R packages and tools in combination in order to extract all of their usefulness for exploring your data. Furthermore, you will also be guided through the process of submitting your first Kaggle Script to your profile, and will publish analyses on Kaggle Scripts that you’ve personalized with information from your own life. (Tip: make sure to share your profile link with hiring managers and peers to easily show off and discuss your work.)

The Top A.I. Breakthroughs of 2015

Progress in artificial intelligence and machine learning has been impressive this year. Those in the field acknowledge progress is accelerating year by year, though it is still a manageable pace for us. The vast majority of work in the field these days actually builds on previous work done by other teams earlier the same year, in contrast to most other fields where references span decades. Creating a summary of a wide range of developments in this field will almost invariably lead to descriptions that sound heavily anthropomorphic, and this summary does indeed. Such metaphors, however, are only convenient shorthands for talking about these functionalities. It’s important to remember that even though many of these capabilities sound very thought-like, they’re usually not very similar to how human cognition works. The systems are all of course functional and mechanistic, and, though increasingly less so, each are still quite narrow in what they do. Be warned though: in reading this article, these functionalities may seem to go from fanciful to prosaic.

Microsoft Deep Learning Brings Innovative Features – CNTK Shows Promise

Microsoft releases CNTK, a deep learning tool kit which shows promise. While a few innovative features set it apart from its competitors, a major drawback may hurt its adoption.

Commonmark: Super Fast Markdown Rendering in R

Markdown is used in many places these days, however the original spec actually leaves some ambiguity which makes it difficult to optimize and leads to inconsistencies between implementations. Commonmark is an initiative led by John MacFarlane at UC Berkeley (also the author of pandoc) to standardize the markdown syntax. Besides a specification, the commonmark team provides reference implementations for C (cmark) and JavaScript (commonmark.js). The commonmark R package wraps around cmark which converts markdown text into various formats, including html, latex and groff man. This makes commonmark very suitable for e.g. writing manual pages which are often stored in exactly these formats. In addition the package exposes the markdown parse tree in xml format to support customized output handling.

Unemployment in Europe

A couple of years I have made plots of unemployment and its change over the years. At first this was a bigger and complex piece of code. As things have progressed, the code can now become pretty concise. There are just plenty of packages to do the heavy lifting. So, this year I tried to make the code easy to read and reasonably documented.

New in V8: Calling R, from JavaScript, from R, from Javascript…

The V8 package provides an R interface to Google’s open source JavaScript engine. The package is completely self contained and requires no runtime dependencies, making it very easy to execute JavaScript code from R. A hand full of CRAN packages use V8 to provide R bindings to useful JavaScript libraries. Have a look at the v8 vignette to get started.

Tracking ggplot2 Extensions

The purpose of this blog post is to inform R users of a website that I created to track and list ggplot2 extensions. The site is available at: The purpose of this site is to help other R users easily find ggplot2 extensions that are coming in “fast and furious” from the R community.

Video: Applied Predictive Modeling with R

There’s more to Iowa than just today’s presidential primary. Last month, the Central Iowa R User Group hosted Dr. Max Kuhn, Director of Non-Clinical Statistics at Pfizer Global R&D, via video-chat to present on Applied Predictive Modeling with R. Max is the co-author of the excellent book Applied Predictive Modeling (read our review here), and in the presentation he covers many of the topics from the book in a brisk 75 minutes.

Voronoi Diagrams in Plotly and R

Here’s a function which uses plotly’s R Library to overlay a voronoi diagram on top of a 2-D K-Means visualization.