Python’s Hidden Regular Expression Gems

There are many terrible modules in the Python standard library, but the Python re module is not one of them. While it’s old and has not been updated in many years, it’s one of the best of all dynamic languages I would argue. What I always found interesting about that module is that Python is one of the few dynamic languages which does not have language integrated regular expression support. However while it lacks syntax and interpreter support for it, it makes up for it with one of the better designed core systems from a pure API point of view. At the same time it’s very bizarre. For instance the parser is written in pure Python which has some bizarre consequences if you ever try to trace Python while importing. You will discover that 90% of your time is probably spent in on of re’s support module.

21 data science systems used by Amazon to operate its business

Most of these systems are or should be used by most large organisations, for business optimization.

Comparing Data Warehouse Solutions

We have collected the best resources from the net comparing the top data warehouse solutions such as Vertica, Aster Data, Greenplum, Netezza, Teradata, HANA, Hbase etc…


The 12K INDEX allows you to quantify the attention for any given topic in tech media over time. Our database already includes more than 1 million articles from the world’s leading tech media sources and it’s growing daily. Our research shows that expert media coverage is a leading indicator for the economic potential of a trend.

The Well Dressed Recommendation Engine

Well Dressed is an app that suggests clothes to wear or buy based on how you look, as well as on the weather, occasion and your budget. It debuted at WebSummit 2015 in Ireland, and it’s a one-man operation (for now!).

Simplified interface for TensorFlow (mimicking Scikit Learn)

This is a simplified interface for TensorFlow, to get people started on predictive analytics and data mining.
Why TensorFlow?
• TensorFlow provides a good backbone for building different shapes of machine learning applications.
• It will continue to evolve both in the distributed direction and as general pipelinining machinery.
Why Scikit Flow?
• To smooth the transition from the Scikit Learn world of one-liner machine learning into the more open world of building different shapes of ML models. You can start by using fit/predict and slide into TensorFlow APIs as you are getting comfortable.
• To provide a set of reference models that would be easy to integrate with existing code.

Language Understanding for Text-based Games using Deep Reinforcement Learning

In this paper, we consider the task of learning control policies for text-based games. In these games, all interactions in the virtual world are through text and the underlying state is not observed. The resulting language barrier makes such environments challenging for automatic game players.

Conditional Computation in Neural Networks for faster models

Our goal is to use reinforcement learning in order to design better, more informed dropout policies, which are data-dependent. We cast the problem of learning activation-dependent dropout policies for blocks of units as a reinforcement learning problem. We propose a learning scheme motivated by computation speed, capturing the idea of wanting to have parsimonious activation while maintaining prediction accuracy. We apply a policy gradient algorithm for learning policies that optimize this loss.

A data mining framework to analyze road accident data

One of the key objectives in accident data analysis to identify the main factors associated with a road and traffic accident. However, heterogeneous nature of road accident data makes the analysis task difficult. Data segmentation has been used widely to overcome this heterogeneity of the accident data. In this paper, we proposed a framework that used K-modes clustering technique as a preliminary task for segmentation of 11,574 road accidents on road network of Dehradun (India) between 2009 and 2014 (both included). Next, association rule mining are used to identify the various circumstances that are associated with the occurrence of an accident for both the entire data set (EDS) and the clusters identified by K-modes clustering algorithm. The findings of cluster based analysis and entire data set analysis are then compared. The results reveal that the combination of k mode clustering and association rule mining is very inspiring as it produces important information that would remain hidden if no segmentation has been performed prior to generate association rules. Further a trend analysis have also been performed for each clusters and EDS accidents which finds different trends in different cluster whereas a positive trend is shown by EDS. Trend analysis also shows that prior segmentation of accident data is very important before analysis.

What is the importance of Dark Data in Big Data world?

Dark data is a subset of big data, but it constitutes the biggest portion of the total volume of big data collected by organizations in a year. We will discuss about what opportunities this holds for an organization.

Fun with Simpson’s Paradox: Simulating Confounders

Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow th

An OS X R Task Runner for – and a Mini-R-centric review of – Microsoft’s Visual Studio Code Editor

Microsoft’s newfound desire to make themselves desirable to the hipster development community has caused them to make many things open and/or free of late. One of these manifestations is Visual Studio Code, an Atom-ish editor for us code jockeys. I have friends at Microsoft and the Revolution R folks are there now, so I try to give things from Redmond a shot more than I previously would, especially when they make things for Mac.

Data Science Radar

Here at Mango Solutions, we do a lot of data science consultancy work. This means we face the challenges that any organisation looking to build and maintain a data science facility face. The principal challenge is acquiring the right skill-sets and understanding our team. To help us, we’ve developed a conceptual framework called the Data Science Radar (DSRadar for short). As discussed in my colleague Hannah’s blog last week, the DSRadar measures against the six key areas of data science and produces a profile of someone’s skill-set:
• Communicator: convey complex information to others
• Visualiser: produce informative and comprehensible visualisations
• Data Wrangler: manipulate data into the required format for analysis
• Modeller: apply statistical models to data to gain insight
• Programmer: develop code that facilitates analytics
• Technologist: build and maintain the infrastructure for analytics

How long does it take to get to the airport from NYC?

Todd W Schneider analyzed a database of 1.1 billion taxi rides in New York City from 2009-2015, and discovered some interesting insights on how New Yorkers use cabs.

Create ARIMA time series from bottom up

Creating a range of ARIMA models by hand by manipulating white noise, instead of arima.sim(), to make clear exactly how they work, and a animation to see several of them unfold.

Rated R: Recommended Reading

Below are my recommendations for good R reads. Some of these books go back a few years, but they continue to hold their value. With the possible exception of books that were based primarily on the S language, good R books don’t become obsolete. Unlike some other computer languages, R evolves mostly through new capabilities added by contributed packages, not through changes to the R core. The fact that the dplyr family of packages may make data wrangling more convenient in many circumstances doesn’t make a book that teaches data manipulation through base R functions any less relevant. In fact, some might argue that new students should be taught the basic functionally first. I am not a militant traditionalist, but it does seem to me that familiarity with the bare bones basics of the language will help newcomers to gain intuition about how R works.

Benford lays down the Law

A few months ago I received in the mail a book called An Introduction to Benford’s Law by Arno Berger and Theodore Hill. I eagerly opened it but I lost interest once I realized it was essentially a pure math book. Not that there’s anything wrong with math, it just wasn’t what I wanted to read.

Free gradient boosting lecture

We have always regretted that we didn’t get to cover gradient boosting in Practical Data Science with R (Manning 2014). To try make up for that we are sharing (for free) our GBM lecture from our (paid) video course Introduction to Data Science.