Event Detection for News using Tensor Decomposition

The project I worked on this summer was to develop a method that algorithmically generates timelines around a given news subject. A “subject” can be any topic or event, such as the Sony hacks or the FIFA corruption scandal or the ongoing news coverage of Hillary Clinton or Donald Trump, or even specific issues such as the tax policy of presidential candidates. The goal is to determine key events over a specified time window that indicate new or significant developments in the story. The result is a retrospective look at how these events unfolded within a particular topic’s lifetime.

Building Web Data Products with R & Shiny

The purpose of many data science projects is to end up with a model that can be used within an organisation to solve a particular problem. If this is our case, we need to determine the right representation of that model so it can be shared in the easiest, cheapest, and most effective way. Web data products are an ideal vehicle for delivering machine learning models. The Web can be accessed almost everywhere and by multiple users. Moreover, the typical web application deployment cycle allows us to do easy updates. In this tutorial, we will introduce Shiny, a web development framework and application server for the R language. In simple terms, Shiny can make data analysis into interactive web apps.

Render Google Maps Tiles with Mapnik and Python

If you want to take a bunch of GIS data and rasterize it as a tiled image map for public consumption, the folks at ESRI would be happy to sell you an expensive solution. Of course, as with oh-so-many projects, you can accomplish the same thing for free with open-source software. In this case, we’ll use Python and a library called Mapnik to render beautiful map layers, then display them on Google Maps, just like this demo rendering of my home county!

Taxi & Ride Sharing Optimization

Working with taxi or geospatial data? Have an eye on a data science gig at a hot new ride sharing service? Check out these top scripts for visualization inspiration and code that gets you started training taxi optimization models. Earlier this year, we ran two competitions with ECML / PKDD 2015 using a shared dataset of geospatial data from taxis in Porto, Portugal. The goal of the competitions was to optimize taxi services by predicting total trip time and projected drop off points. The training set contained one year of trip trajectories for all 442 taxis running in the city of Porto.

Gartner 2015 Hype Cycle: Big Data is Out, Machine Learning is in

Which are the most hyped technologies today? Check out Gartner’s latest 2015 Hype Cycle Report. Autonomous cars & IoT stay at the peak while big data is losing its prominence. Smart Dust is a new cool technology for the next decade!

Building Wordclouds in R

In this article, I will show you how to use text data to build word clouds in R. We will use a dataset containing around 200k Jeopardy questions. The dataset can be downloaded here (thanks to reddit user trexmatt for providing the dataset).

Legoplots in R (3D barplots in R)

I previously mentioned these, but I thought about it some more and went ahead and wrote an R implementation. With very little work you could rejig this code to display all kinds of 3D bar plots; I’ve commented the code thoroughly so it should be easy to follow. Immediately below is an example plot and below that is a gallery comparing an original plot by the Broad Institute with two versions produced by my code. I wouldn’t want to slavishly copy, so I’ve added an optional transparency effect that is both useful and attractive.

Coloring (and Drawing) Outside the Lines in ggplot

Coloring legend same as lines.