November 2015: Scripts of the Week

November’s scripts of the week feature Jupyter Notebook (newly supported on Kaggle Scripts), explore fundamental aspects of the American experience, and illuminate why sentiment analysis is ‘not a trivial affair’. Both USA Census scripts in this post are great starting points to share your own work on Kaggle. We encourage you to fork them and publish another perspective.


Seasonal Unit Root Testing in EViews

When we’re dealing with seasonal data – e.g., quarterly data – we need to distinguish between ‘deterministic seasonality’ and ‘stochastic seasonality’. The first type of seasonality is what we try to remove when we ‘seasonally adjust’ the series. It’s also what we’re trying to account for when we include seasonal dummy variables in a regression model.


Comparing Python Clustering Algorithms

There are a lot of clustering algorithms to choose from. The standard sklearn clustering suite has thirteen different clustering classes alone. So what clustering algorithms should you be using? As with every question in data science and machine learning it depends on your data. A number of those thirteen classes in sklearn are specialised for certain tasks (such as co-clustering and bi-clustering, or clustering features instead data points). Obviously an algorithm specializing in text clustering is going to be the right choice for clustering text data, and other algorithms specialize in other specific kinds of data. Thus, if you know enough about your data, you can narrow down on the clustering algorithm that best suits that kind of data, or the sorts of important properties your data has, or the sorts of clustering you need done. All well and good, but what if you don’t know much about your data? If, for example, you are ‘just looking’ and doing some exploratory data analysis (EDA) it is not so easy to choose a specialized algorithm. So, what algorithm is good for exploratory data analysis?


Mining open datasets for transparency in taxi transport in metropolitan environments

Uber has recently been introducing novel practices in urban taxi transport. Journey prices can change dynamically in almost real time and also vary geographically from one area to another in a city, a strategy known as surge pricing. In this paper, we explore the power of the new generation of open datasets towards understanding the impact of the new disruption technologies that emerge in the area of public transport. With our primary goal being a more transparent economic landscape for urban commuters, we provide a direct price comparison between Uber and the Yellow Cab company in New York. We discover that Uber, despite its lower standard pricing rates, effectively charges higher fares on average, especially during short in length, but frequent in occurrence, taxi journeys. Building on this insight, we develop a smartphone application, OpenStreetCab, that offers a personalized consultation to mobile users on which taxi provider is cheaper for their journey. Almost five months after its launch, the app has attracted more than three thousand users in a single city. Their journey queries have provided additional insights on the potential savings similar technologies can have for urban commuters, with a highlight being that on average, a user in New York saves 6 U.S. Dollars per taxi journey if they pick the cheapest taxi provider. We run extensive experiments to show how Uber’s surge pricing is the driving factor of higher journey prices and therefore higher potential savings for our application’s users. Finally, motivated by the observation that Uber’s surge pricing is occurring more frequently that intuitively expected, we formulate a prediction task where the aim becomes to predict a geographic area’s tendency to surge. Using exogenous to Uber data, in particular Yellow Cab and Foursquare data, we show how it is possible to estimate customer demand within an area, and by extension surge pricing, with high accuracy.


Anomaly Detection in Predictive Maintenance with Time Series Analysis

The newest challenge now lies in predicting the “unknown”, i.e. an anomaly. An anomaly is an event that is not part of the system’s past; an event that cannot be found in the system’s historical data. In the case of network data, an anomaly can be an intrusion, in medicine a sudden pathological status, in sales or credit card businesses a fraudulent payment, and, finally, in machinery a mechanical piece breakdown.


How to Learn R

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies. That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more. Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience.


analyze survey data for free

obsessively-detailed instructions to analyze publicly-available survey data with free tools – the r language, the survey package, and (for big data) sqlsurvey + monetdb.


Download and plot Google Trends data with R

Google Trends is a useful way to compare changes in popularity of certain search terms over time, and Google Trends data can be used as a proxy for all sorts of difficult-to-measure quantities like economic activity and disease propagation. If you’d like to use Google Trends data in your own analyses, the gtrendsR package for R is now available on CRAN. This package by Philippe Massicotte and Dirk Eddelbuettel adds functions to connect with your Google account, and download Trends data for one or more search terms at daily or weekly resolution over a specified period of time.