Data frames and tables in Scala

To statisticians and data scientists used to working in R, the concept of a data frame is one of the most natural and basic starting points for statistical computing and data analysis. It always surprises me that data frames aren’t a core concept in most programming languages’ standard libraries, since they are essentially a representation of a relational database table, and relational databases are pretty ubiquitous in data processing and related computing. For statistical modelling and data science, having functions designed for data frames is much more elegant than using functions designed to work directly on vectors and matrices, for example. Trivial things like being able to refer to columns by a readable name rather than a numeric index makes a huge difference, before we even get into issues like columns of heterogeneous types, coherent handling of missing data, etc. This is why modelling in R is typically nicer than in certain other languages I could mention, where libraries for scientific and numerical computing existed for a long time before libraries for data frames were added to the language ecosystem.

DL4J: Word2Vec

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.

An Introduction to Distributed Machine Learning

Building, managing, and even using distributed systems can be hard. For over 50 years, distributed systems experts have been working hard to achieve the vision of making many machines work harmoniously together as though they were one. With an increase in the volume of data being collected today, the need to efficiently distribute computation has greatly increased. Today, distributed computation is ubiquitous. For some problems, there are many existing implementations of distributed systems that can scale out computation efficiently, but there are many other problems where significant roadblocks prevent efficient distribution. In this blog post, I will provide perspective on the challenges and benefits of using distributed systems for modern day machine learning needs.

The Power of Indexing

We’re living in an age of data-overload. It takes one second to generate a Google search and one more second to possess endless data on almost any topic imaginable. The reams of information, statistics and measurements returned can be overwhelming. How do companies begin to navigate the enormous amounts of customer information, like profiles, spending habits and decision-making processes? If that’s not enough, what seems like unlimited amounts of data can be made even more vast by slicing and dicing it to create new measures and data points. All this big data combined with the complexity of modern analytical needs makes it difficult to identify and understand the most impactful insights.

Avito Winner’s Interview: 2nd place, Changsheng Gu (aka Gzs_iceberg)

The Avito Context Ad Click competition asked Kagglers to predict if users of Russia’s largest general classified website would click on context ads while they browsed the site. The competition provided a truly robust dataset with eight comprehensive relational tables of data on historical user browsing and search behavior, location, and more. Chengsheng Gu (aka Gzs_iceberg) finished in second place by using a combination of custom and public tools.

Examining Email Addresses in R

I don’t normally work with personal identifiable information such as emails. However, the recent data dump from Ashley Madison got me thinking about how I’d examine a data set composed of email addresses. What are the characteristics of an email that I’d look to extract? How would I perform that task in R? Here’s some quick R code to extract the host, address type, and other information from a set of email strings. From there, we can obviously summarize the data according to a number of desired email characteristics. I’d love to dive into the Ashley Madison email dump to find which companies and industries had the highest ratio of executive on that site, but that’s a little beyond my technical skills given the sheer size of the data set. Hopefully someone will complete that analysis soon enough.

Where Does the S&P 500 Stand?

Last week was brutal for pretty much all markets. Surprisingly, it was bad even for the US dollar. The sharp and straight downward move was reminiscent of the descent of 2011. It’s time to review where does the major index stands from technical point of view.

Analysing longitudinal data: Multilevel growth models (I)

Last time we discussed the conversion of longitudinal data between wide and long formats and visualised individual growth trajectories using a sample randomised controlled trial dataset. But could we take this a step farther and predict the trajectory of the outcomes over time?

Function Argument Lists and missing()

Sometimes it is useful to write a wrapper function for an existing function. In this short example we demonstrate how to grab the list of arguments passed to a function and use it to call another function, taking care of optional arguments with or without default values.

Is Bayesian A/B Testing Immune to Peeking? Not Exactly

Since I joined Stack Exchange as a Data Scientist in June, one of my first projects has been reconsidering the A/B testing system used to evaluate new features and changes to the site. Our current approach relies on computing a p-value to measure our confidence in a new feature. Unfortunately, this leads to a common pitfall in performing A/B testing, which is the habit of looking at a test while it’s running, then stopping the test as soon as the p-value reaches a particular threshold- say, .05. This seems reasonable, but in doing so, you’re making the p-value no longer trustworthy, and making it substantially more likely you’ll implement features that offer no improvement. How Not To Run an A/B Test gives a good explanation of this problem. One solution is to pre-commit to running your experiment for a particular amount of time, never stopping early or extending it farther. But this is impractical in a business setting, where you might want to stop a test early once you see a positive change, or keep a not-yet-significant test running longer than you’d planned. (For more on this, see A/B Testing Rigorously (without losing your job)). An often-proposed alternative is to rely on a Bayesian procedure rather than frequentist hypothesis testing. It is often claimed that Bayesian methods, unlike frequentist tests, are immune to this problem, and allow you to peek at your test while it’s running and stop once you’ve collected enough evidence.