Python: Towards Out-of-core ND-Arrays
We propose a system for task-centered computation, show an example with out-of-core nd-arrays, and ask for comments.

Building Language Detector via Scikit-Learn
Percentage of English language across all of the languages is decreasing and will likely to do so incoming years. All(nearly) of the tweets are written in English in 2006 whereas in 2013, only half of the tweets are written in English as Japanese, Spanish, Portuguese and other languages increase their share in the tweets. This will likely to continue if we look at the population(excluding China as they have their own networks), India, Russia, Brazil(portuguese) and Germany does not contribute with comparison to their population.(based on the assumption that the smartphones will be ubiquitous and most people who have internet access would have access to smartphones as well).

The Future of Big Data? Three Use Cases of Prescriptive Analytics
In 2014, Gartner placed prescriptive analytics at the beginning of the Peak of Inflated Expectations in their Hype Cycle of Emerging Technologies. According to Gartner, it will take another 5-10 years before prescriptive analytics will be common in boardrooms around the world. But what is prescriptive analytics, how can we use it and how can it help organizations in their decision-making process?

Reading Excel Spreadsheets with Python and xlrd
Previously, we looked at how to create Microsoft Excel (i.e. *.xls) files using the xlwt package. Today we will be looking at how we can read an *.xls/*.xlsx file using a package called xlrd. The xlrd package can be run on Linux and Mac as well as Windows. This is great when you need to process an Excel file on a Linux server.

Multivariate Medians
However, unless you took a stats. course in Multivariate Analysis, most of you probably didn’t get to meet the median in a multivariate setting. Did you ever wonder why not?

Deep learning Reading List
Following is a growing list of some of the materials i found on the web for Deep Learning beginners.

Deep Learning in a Nutshell
Deep learning. Neural networks. Backpropagation. Over the past year or two, I’ve heard these buzz words being tossed around a lot, and it’s something that has definitely seized my curiosity recently. Deep learning is an area of active research these days, and if you’ve kept up with the field of computer science, I’m sure you’ve come across at least some of these terms at least once.

Python: Pandas Pivot Table Explained
Most people likely have experience with pivot tables in Excel. Pandas provides a similar function called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself struggling to remember how to use the syntax to format the output for my needs. This article will focus on explaining the pandas pivot_table function and how to use it for your data analysis.

Videos: The Machine Learning Summer School: 26 August to 6 September 2013
at the Max Planck Institute for Intelligent Systems, Tübingen, Germany

R: String to Date or NA
I’ve been trying to clean up a CSV file which contains some rows with dates and some not – I only want to keep the cells which do have dates so I’ve been trying to work out how to do that.

R: Dplyr – Mutate with strptime
Having worked out how to translate a string into a date or NA if it wasn’t the appropriate format the next thing I wanted to do was store the result of the transformation in my data frame.

The Rise of Data Science in the Age of Big Data Analytics
The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous (like, bringing-down-the-entire-financial-syste­m dangerous). Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.

Are Enterprises Ready for Billions of Devices to Join the Internet?
There are currently more devices connected to the Internet than there are people in the world. The Internet now connects a staggering 10 billion devices today. And this number will continue to grow, as more devices gain the ability to directly interface with the Internet or become physical representations of data accessible via Internet systems. This trend toward interactive device independence is collectively described as the Internet of Things (IoT).

Machine Learning Books Suggested by Michael I. Jordan from Berkeley
There has been a Machine Learning (ML) reading list of books in hacker news for a while, where Professor Michael I. Jordan recommend some books to start on ML for people who are going to devote many decades of their lives to the field, and who want to get to the research frontier fairly quickly. Recently he articulated the relationship between CS and Stats amazingly well in his recent reddit AMA, in which he also added some books that dig still further into foundational topics. I just list them here for people’s convenience and my own reference.

Widgets For Christmas
As a quick example, we can look at the DiagrammeR package released yesterday by Richard Iannone. DiagrammeR launched in non-htmlwidgets form severely hampering its ability to be easily used in multiple contexts. Converting it to htmlwidgets seemed like a great opportunity to illustrate both the ease of htmlwidgets creation and the powerful infrastructure offered by htmlwidgets. So, in a couple hours—easy to create, check—yesterday (most of the time spent on examples, documentation, and testing) with only a couple of lines of JavaScript—easy to create, check again—I was able to transform the DiagrammeR package into htmlwidgets.

Graphing Non-Linear Mathematical Expressions in R
In Part 20 of this series, let’s see how to create mathematical expressions for your graph in R. We’ll use an example of graphing a cosine curve, along with relevant Greek letters as the axis label , and printing the equation right on the graph.

Graphical Models 3
This is Christopher Bishop’s third talk on Graphical Models, given at the Machine Learning Summer School 2013, held at the Max Planck Institute for Intelligent Systems, in Tübingen, Germany, from 26 August to 6 September 2013.

Graphical Models 2
This is Christopher Bishop’s second talk on Graphical Models, given at the Machine Learning Summer School 2013, held at the Max Planck Institute for Intelligent Systems, in Tübingen, Germany, from 26 August to 6 September 2013.

Graphical Models 1
This is Christopher Bishop’s first talk on Graphical Models, given at the Machine Learning Summer School 2013, held at the Max Planck Institute for Intelligent Systems, in Tübingen, Germany, from 26 August to 6 September 2013.

Too large datasets for regression ? What about subsampling….
recently, a classmate working in an insurance company told me he had too large datasets to run simple regressions (GLM, which involves optimization issues), and that they were thinking of a reward for the one who will write the best R-code (at least the fastest). My first idea was to use subsampling techniques, saying that 10 regressions on 100,000 observations can take less time than a regression on 1,000,000 observations. And perhaps provide also better results…

Generating functions
I wanted to publish a post on generating functions, based on discussions I had with Jean-Francois while having our coffee after lunch a couple of times already. The other reason is that I publish my post while my student just finished their Probability exam (and there were a few questions on generating functions).

One Page R: A Survival Guide to Data Science with R
Many of the documents have been developed and tested whilst visiting the Shenzhen Institutes of Technology as an International Visiting Professor of the Chinese Academy of Sciences.

Programming tools: Adventures with R
A guide to the popular, free statistics and visualization software that gives scientists control of their own data analysis.

Visualizing overdispersion (with trees)
We started to discuss overdispersion when modeling claims frequency. In my previous post, I discussed computations of empirical variances with different exposure. But I did use only one factor to compute classes. Of course, it is possible to use much more factors. For instance, using cartesian products of factors,

Big Data Means Big Benefits For Retailers And Consumers
Personal data is a prickly topic. Businesses want to get their hands on it, consumers want to keep a hold of it, and many of us – businesses and consumers – are still a bit unsure as to all the rules and regulations surrounding it. Or at least that’s how it seems.

The Future for Consumer Goods in the Data Economy
Data is fundamentally changing the nature of our relationships. The use of social media is now underpinning the way in which we talk to each other, while e-commerce and advertising platforms are changing the way we communicate with brands. At times, however, the consumer packaged goods (CPG) sector appears hesitant about fully engaging with the emerging data economy. While some, such as Coca-Cola, may have embraced social media, the number of CPG brands where customer relationships are primarily mediated by data, rather than via retailers, appears limited.

Python: A YouTube video history metadata crawler (ytcrawl)
ytcrawl is a YouTube video viewcount (can also crawl subscribers/shares/watchtimes when available) history crawler. During middle 2013, YouTube has published videos’ history daily viewcount (when uploaders make it public, see the image below). These can be precious data for computational social science research. This crawler aims to help researchers efficiently download the data.

Analytics: Five Rules to Cut Through the Hype
Cut through the analytics hype by asking the right questions, discerning between value-add analytics, considering in and out of house solutions, forming an iterative analytics process, and making sure your organization uses it. Here are five rules to guide you through analytics maze:

Interactive Simple Networks
This post isn’t anything new in terms of analysis, but just a cooler look at a previous post. I looked at board members of large companies in a previous blog post and showed via a simple network how they share board members and provided some commentary about that in terms of how board membership represents desired influence in markets.

Change Point Detection in Time Series with R and Tableau
This blog post will show, how to apply such algorithms to univariate time series representing customer activity and present the results graphically. Visualizing the identified breaks provide an additional benefit for understanding customer behavior and also how those algorithms work.

Common Problems with Data
– Apostrophes
– Misspellings or multiple spellings
– Not converting currency
– Different currency formats
– Different date formats
– Using zero for null values
– Assuming a number is really a number
– Analytics software that only accepts numbers

What’s Hot & What’s Not in Data Science 2015
Interesting infographics from CrowdFlower. In the hot category, I would add data plumbing, sensor data to better predict Earthquakes, weather or solar flares, predictive analytics for flu and other health or environmental issues, automating data science and man-made statistical analyses, pricing optimization for medical procedures, customized drugs, car traffic optimization via sensor data, properly trained data scientists involved in decision and replacing business analysts, the death of the data silo.

Interactive Visualization enabled Feature Selection and Model Creation
“A picture is worth a thousand words” or in the case of Data Science, we could say “A picture is worth a thousand statistics”. Interactive Data Visualization or Visual Analytics has become one of the top trends in transforming business intelligence (BI) as technologies based on Visual Analytics have moved into widespread use.

Starting data analysis/wrangling with R: Things I wish I’d been told
R is a very powerful open source environment for data analysis, statistics and graphing, with thousands of packages available. After my previous blog post about likert-scales and metadata in R, a few of my colleagues mentioned that they were learning R through a Coursera course on data analysis. I have been working quite intensively with R for the last half year, and thought I’d try to document and share a few tricks, and things I wish I’d have known when I started out.

11 Clever Methods of Overfitting and how to avoid them
Overfitting is the bane of Data Science in the age of Big Data. John Langford reviews “clever” methods of overfitting, including traditional, parameter tweak, brittle measures, bad statistics, human-loop overfitting, and gives suggestions and directions for avoiding overfitting.
1. Traditional overfitting
2. Parameter tweak overfitting
3. Brittle measure
4. Bad statistics
5. Choice of measure
6. Incomplete Prediction
7. Human-loop overfitting
8. Data set selection
9. Reprobleming
10. Old datasets
11. Overfitting by review

Calling Scala code from R using jvmr
In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work.

Big Data Enters 2015
As we enter the new and exciting year 2015, the big data industry is well poised to achieve some truly great things. Our friends a DataRPM put together the compelling infographic below, “The Dawn of Smart Enterprise,” to highlight their predictions for how big data will advance in the coming year.

Talking Machines
Human conversations about machine learning.

Querying the Bitcoin blockchain with R
The crypto-currency Bitcoin and the way it generates “trustless trust” is one of the hottest topics when it comes to technological innovations right now. The way Bitcoin transactions always backtrace the whole transaction list since the first discovered block (the Genesis block) does not only work for finance. The first startups such as Blockstream already work on ways how to use this mechanism of “trustless trust” (i.e. you can trust the system without having to trust the participants) on related fields such as corporate equity.

The Data Science Skills Network
As a data scientist, I am usually heads down in numbers, patterns, and code, but as crazy as it sounds, one of the hardest parts of my job is actually describing what I do. There are plenty of resources that offer descriptions and guides on the career of a data scientist. I’ve heard them described as those at the intersection of statistics, hacking abilities, and domain expertise. Or, as data analysts who live in San Francisco. Rather than add a new definition to the collection, I thought I’d take a data-centric approach towards defining the role. I looked at what skills people with the title “Data Scientist” have listed on their LinkedIn profiles and aggregated the top ten by occurrence.

Get started with Hadoop and Spark in 10 minutes
This is where using a tool like Vagrant can be very useful. Vagrant uses an easy configurable workflow for automation of development environment setups. With simple commands like vagrant up and a single file to describe the type of machine you want, the software which needs to be installed and the way the machine can be accessed, configuring and setting up multiple VMs for a cluster is extremely easy. A list of available configurations for development boxes can be accessed at

Debugging Parallel Code with dbs()
You can use dbs() on any Unix-family platform, such as Macs, Linux or Cygwin.

Canonical Correlation Analysis on Imaging
In imaging, we deal with multivariate data, like in array form with several spectral bands. And trying to come up with interpretation across correlations of its dimensions is very challenging, if not impossible. For example let’s recall the number of spectral bands of AVIRIS data we used in the previous post. There are 152 bands, so in total there are 152· 152 = 23104 correlations of pairs of random variables. How will you be able to interpret that huge number of correlations? To engage on this, it might be better if we group these variables into two and study the relationship between these sets of variables. Such statistical procedure can be done using the canonical correlation analysis (CCA).

It’s Not About Big Data, It’s About More Data Sources
Big Data is a big term that’s having a moment right now. But while everyone is getting more and more eager to embrace the wonders of lots and lots of information, businesses should be aware that getting ready for Big Data is about more than just preparing for an onslaught of terabytes. What’s actually going to be the most distinctive part of this trend will be not the size but rather the sources of data. Businesses, take note: gaining a competitive edge will be about understanding and utilizing a multitude of new data sources to glean novel insights.

Causation vs Correlation: Visualization, Statistics, and Intuition
Visualizations of correlation vs. causation and some common pitfalls and insights involving the statistics are explored in this case study involving stock price time series.

pomegranate: Graphical models for Python, implemented in Cython for speed.
pomegranate is a package for graphical models and Bayesian statistics for Python, implemented in cython. It grew out of the YAHMM package, where many of the components used could be rearranged to do other cool things. It currently supports:
• Probability Distributions
• Finite State Machines
• Hidden Markov Models
• Discrete Bayesian Networks

Estimation in sequential A/B-tests
In this post I’m going to explain some problems regarding estimation in sequential analysis tests in general, and how they can be solved.

Why CIOs Should Turn To Cloud Based Data Analysis in 2015
Here are the 5 biggest reported advantages of choosing the cloud for data analysis:
1. Speed – Faster Time to Market
2. Extensibility – Adjusting to Change
3. Cost – Lower, Cheaper
4. Risk Mitigation – Changing Technological Landscape
5. IT as the Enabler

Automatic Detection of the Language of a Tweet
Two days ago, in my post to extract automatically my own tweets, and to generate some html list, I mentioned that it would be great if there were a function that could be used to distinguish tweets in English, and tweets in French (usually, I tweet in one of those two languages).

Customer Data Quality: What Is the Value-at-Risk?
If you work in financial services or, more specifically, Capital Markets, then you are likely to be familiar with the concept of VaR or “value at risk.” VaR is a statistical calculation used in finance to incorporate a quantifiable measure of the financial risk that an asset (or a portfolio of assets) will decline in value. The details of VaR can be complicated, but at a high level it calculates the maximum loss possible on an investment over a given time period, given a certain probability that events will cause that decline. VaR is not a perfect tool – after all, what form of predictive analytics is?