R: Cohort Analysis of Neo4j Meetup Members
In the customer retention example we track customer purchases on a month by month basis and each customer is put into a cohort or bucket based on the first month they made a purchase in. We then calculate how many of them made purchases in subsequent months and compare that with the behaviour of people in other cohorts. In our case we aren’t selling anything so our equivalent will be a person attending a meetup. We’ll put people into cohorts based on the month of the first meetup they attended. This can act as a proxy for when people become interested in a technology and could perhaps allow us to see how the behaviour of innovators, early adopters and the early majority differs, if at all.
Learn how each ML classifier works: decision boundary vs. assumed true boundary
In the latest post of my own blog, I argued about how to learn how each machine learning classifier works visually. My idea is that first I prepare samples for training and then I show its assumed true boundary, and finally decision boundary estimated by the classifier with a dense grid covering over the space as test dataset and the assumed boundary are compared. In the case below, the assumed true boundary of the space is a set of 3 parallel lines; I think everybody will guess so intuitively, but the most important point here is whether any machine learning classifier works so. For example, when multinomial logit – one of linear classifiers – is trained by samples below, it gives decision boundary for a grid dataset covering the whole space. It looks almost the same as the assumed boundary.
The Value of Data, Part 3: Data Business Models
Data is incredibly valuable. It helps create superior products, it forms a barrier to entry, and it can be directly monetized. This post is the third in a 3-part series about making data a core part of a startup’s business plan.
Artificial Neurons and Single-Layer Neural Networks
This article offers a brief glimpse of the history and basic concepts of machine learning. We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.
Adopting R for experienced developers
More and more frequently I come across people who express an interest in R, and I thought I would share some advice to help people decide if R is something they should use, as well as some high level advice on getting started.
Announcing Spark 1.3!
Today I’m excited to announce the general availability of Spark 1.3! Spark 1.3 introduces the widely anticipated DataFrame API, an evolution of Spark’s RDD abstraction designed to make crunching large datasets simple and fast. Spark 1.3 also boasts a large number of improvements across the stack, from Streaming, to ML, to SQL. The release has been posted today on the Apache Spark website.
• A new DataFrame API
• Spark SQL Graduates from Alpha
• Built-in Support for Spark Packages
• Lower Level Kafka Support in Spark Streaming
• New Algorithms in MLlib