On Mean Shift and K-Means Clustering
Mean shift and K-Means algorithm are two similar clustering algorithms; both of them extract information from data with some kind of mean vector operations. Whereas the K-Mean algorithm has been widely popular, the mean shift algorithm has found only limited applications (e.g. for image segmentation.) In this note I’ll briefly compare these two algorithms and show a way, with VisuMap software, to combine them to get much better clustering tools.

The Mean Shift Clustering Algorithm
Mean shift clustering is a general non-parametric cluster finding procedure — introduced by Fukunaga and Hostetler [1], and popular within the computer vision field. Nicely, and in contrast to the more-well-known K-means clustering algorithm, the output of mean shift does not depend on any explicit assumptions on the shape of the point distribution, the number of clusters, or any form of random initialization.

Announcing New Podcast: Talk Python to Me
New podcast for Python developers called Talk Python To Me.

Python 101 for Aspiring Data Nerds
Here, we’ll build an app in Python from A-Z, iterate on it to make it more robust, and finally add application event logging with Fluentd and Treasure Data. We chose Python because it’s quickly becoming the language of choice among aspiring data scientists. In our examples, we’ll use Python version 2.7.

Prerequisites for Data Science
• Thinking creatively, but constructively, about data.
• Facility with data sets of varying sizes, and some understanding of scalability issues when working with data.
• Statistical computing skills in a command-driven environment (e.g., R, Python, or Julia).
• Experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially curated statistical method.
• An ethos of reproducibility.

Random Data Sets Quickly
This post will discuss a recent GitHub package I’m working on, wakefield to generate random data sets.

Back to basics: High quality plots using base R graphics
Today at the Davis R Users’ Group, Michael Koontz gave tour de force lesson in using R’s base graphics capabilities to plot data.

Dashboards in R with Shiny & Plotly
Shiny is an R package that allows users to build interactive web applications easily in R! Using Shiny and Plotly together, you can deploy an interactive dashboard. That means your team can create graphs in Shiny, then export and share them. Shiny apps involve two main components: a ui (user interface) script and a server script. The user interface script controls the layout of the app and the server script controls what the app does. In other words, the ui script creates what the user sees and controls and the server script completes calculations and creates the plots.

Spherical Trigonometry, Circle Packing, and Lead Generation – A Journey
Google, Twitter, and Instagram APIs allow for fairly large search radii, but return few results. For instance, querying Google for ‘nearby businesses’ in a 3KM radius would only return 20 results (due to result limiting). If we reduce the radius to something much smaller, like 50 meters, we end up with a more believable output (say ~12 results). Most APIs have some sort of hard result limit which tends to be problematic when trying to reason about data at scale. We get better resolution on queries with smaller radii (since they are unlikely to exceed the result limit), so our implementation will probably need to incorporate this somehow.

A Benchmark Dataset for Time Series Anomaly Detection
We’re in the middle of an anomaly. It’s March Madness and that means that we’re seeing an abnormally large amount of traffic to Yahoo Fantasy Sports. We anticipated this anomaly (and you probably could have guessed it too)-but what about far more serious abnormalities that no one could anticipate? What about those anomalies that indicate potential security threats to our user’s data?