Some thoughts on Economics, Mathematics, Econometrics, Statistics, Machine Learning, etc

There were a lot of posts, recently, related to those topics, starting with Noah Smith ‘s piece entitled “Economics has a Math Problem” and more recently “Econometrics, Math, and Machine Learning…what?” by Matt Bogard. I don’t have (yet) a clear mind on those issues, but there are still a few thoughts that I wanted to share. I did not really want to, but I’ve been asked, on Twitter, and I thought it might be good to write them down, to clarify some ideas I have, but also (probably, hopefully) to get interesting feedbacks.

Nest Thermostat and R – Creating a Shiny dashboard

In this blog, we’re going to use the Nest Home Simulator, the Statistics Programming Language R and the Shiny package, which allows us to create awesome web interfaces to manage data. Also, we’re going to need the RJson package.

Roll Your Own Stats and Geoms in ggplot2 (Part 1: Splines!)

A huge change is coming to ggplot2 and you can get a preview of it over at Hadley’s github repo. I’ve been keenly interested in this as I will be fixing, finishing & porting coord_proj to it once it’s done.

5 Things Your Boss Wants to Know About DaaS

1. What is Data-as-a-Service anyway?
2. Who is using DaaS?
3. What are the components?
4. Show me the numbers!
5. How do we get started?

Straightening Loops: How to Vectorize Data Aggregation with pandas and NumPy

Consider the for loop. You’ve probably been writing them since the first day of your programming career. It works pretty well for pulling things out of arrays, aside from the occasional indexing error at the end of the list. Python even gives sweet syntactic sugar such as for .. in and list comprehensions to make loops easier. But there’s a problem with for loops, especially when iterating through large data structures utilized in typical data science applications. The problem with for loops is that they’re often very slow. This notebook demonstrates alternatives to loops in your code that offer performance and readability improvements of multiple orders of magnitude. It compares native Python loop performance to NumPy and pandas vectorized operations and provides recipes for performing efficient aggregation and transformation with pandas.

Data Science data architecture

This article describes the data architecture that allows data scientists to do what they do best: “drive the widespread use of data in decision-making”. It is intended for various audiences: for IT admins to better understand the needs of data scientists, for data scientists to better articulate their needs and in general for companies who are looking to setup a data science work stream. Data scientists are kind of a rare breed. Apart from data science, they need to understand business and they need to have IT hacking skills (i.e. ability to get things working in an IT landscape; not to be confused with a penetration/exploit type of hacker). The data scientist does understand more business that an IT person and understands more IT than a business person. The flip side: the data scientist does understand less IT than an IT person and understands less business than a business person. With this set of skills comes the request for a specific workflow and data architecture.