Pandas Tutorial 2: Aggregation and Grouping

Let´s continue with the pandas tutorial series. This is the second episode, where I´ll introduce aggregation (such as min, max, sum, count, etc.) and grouping. Both are very commonly used methods in analytics and data science projects – so make sure you go through every detail in this article!

Defining data science in 2018

I got my first data science job in 2012, the year Harvard Business Review announced data scientist to be the sexiest job of the 21st century. Two years later, I published a post on my then-favourite definition of data science, as the intersection between software engineering and statistics. Unfortunately, that definition became somewhat irrelevant as more and more people jumped on the data science bandwagon – possibly to the point of making data scientist useless as a job title. However, I still call myself a data scientist. Even better – I still get paid for being a data scientist. But what does it mean What do I actually do here This article is a short summary of my understanding of the definition of data science in 2018.

Data science is science’s second chance to get causal inference right.

Causal inference from observational data is the goal of many data analyses in the health and social sciences. However, academic statistics has often frowned upon data analyses with a causal objective. The introduction of the term ‘data science’ provides a historical opportunity to redefine data analysis in such a way that it naturally accommodates causal inference from observational data. Like others before, we organize the scientific contributions of data science into three classes of tasks: description, prediction, and causal inference. An explicit classification of data science tasks is necessary to discuss the data, assumptions, and analytics required to successfully accomplish each task. We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge. We discuss the implications for the use of data science to guide decision-making in the real world and to train data scientists.

Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

If you work with big data sets, you probably remember the ‘aha’ moment along your Python journey when you discovered the Pandas library. Pandas is a game-changer for data science and analytics, particularly if you came to Python because you were searching for something more powerful than Excel and VBA.

A Scikit-learn pipeline in Wallaroo

While it would seem that machine learning is taking over the world, a lot of the attention has been focused towards researching new methods and applications, and how to make a single model faster. At Wallaroo Labs we believe that, to make the benefits of machine learning ubiquitous, there needs to be a significant improvement in how we put those impressive models into production. This is where the stream computing paradigm becomes useful: as for any other type of computation, we can use streaming to apply machine learning models to a large quantity of incoming data, using available techniques in distributed computing. Nowadays, many applications with streaming data are either applying machine learning or have a use case for it. In this example, we will explore how we can build a machine learning pipeline inside Wallaroo, our high-performance stream processing engine, to classify images from the MNIST dataset, using a basic two-stage model in Python. While recognizing hand-written digits is a practically solved problem, even a simple example like the one we are presenting provides a real use case (imagine automated cheque reading in a large bank), and the same setup can be used as a starting point for virtually any machine learning application – just replace the model.

AI and the Emerging Crisis of Trust

Earlier this month, a newspaper in Ohio invited its Facebook followers to read the Declaration of Independence, which it posted in 12 bite-sized chunks in the days leading up to July 4. The first nine snippets posted fine, but the 10th was held up after Facebook flagged the post as ‘hate speech.’ Apparently, the company´s algorithms didn´t appreciate Thomas Jefferson´s use of the term ‘Indian Savages.’ It´s a small incident, but it highlights a larger point about the use of artificial intelligence and machine learning. Besides being used to filter content, these technologies are making their way into all aspects of life, from self-driving cars to medical diagnoses and even prison sentencing. It doesn´t matter how well the technology works on paper, if people don´t have confidence that AI is trustworthy and effective, it will not be able to flourish.

Building A Data Science Product in 10 Days

At startups, we often have the chance to create products from scratch. In this article, I´ll share how to quickly build valuable data science products, using my first project at Instacart as an example. Here is the problem. After adding items to the shopping cart on Instacart, a customer can select a delivery window during checkout (illustrated in Figure 1). Then, an Instacart shopper would try to deliver the groceries to the customer within the window. During peak times, our system often accepted more orders than our shoppers could handle, and some orders would be delivered late. We decided to leverage data science to address the lateness issue. The idea was to use data science models to estimate the delivery capacity for each window, and a window would be closed when the number of orders placed reaches its capacity.

Antipredator behavior with R (or why wildebeest should stay together)

In 2010, when I was studying my Biology Degree at Universidad Complutense in Madrid, I fell in love with a documentary miniseries called Great Migrations (National Geographic). Their episodes talk about awesome migrations of animals around the globe. One of these chapters is ‘Great Migrations: Science Of Migrations’. It shows how scientists study the patterns and processes of animal migration. One of these researchers is John Fryxell, from Gelph University in Canada. John explains how mathematical models can help to understand movement patterns. Through simulations, he shows how wildebeests maintaining a clustered movement pattern can more effectively avoid to be hunted by a virtual lion (here is the Spanish version in YouTube). I have to confess that each time I see those images, I think that I would be completely fulfilled with this kind of jobs.

REST APIs and Plumber

Moving R resources from development to production can be a challenge, especially when the resource isn´t something like a shiny application or rmarkdown document that can be easily published and consumed. Consider, as an example, a customer success model created in R. This model is responsible for taking customer data and returning a predicted outcome, like the likelihood the customer will churn. Once this model is developed and validated, there needs to be some way for the model output to be leveraged by other systems and individuals within the company.

New TDWI Research Report Explores How Organizations Using Predictive Analytics Are Making It Work

TDWI Research has released its newest Best Practices Report, Practical Predictive Analytics. This original, survey-based report looks at how organizations using the technology are making it work and how those exploring the technology are planning to implement it. It looks at the organizational, technology, process, and deployment challenges enterprises face and offers best practices and recommendations for success. According to the TDWI survey, predictive analytics is on the cusp of widespread adoption, but it remains elusive. Previous research, together with these survey results, indicates that had users stuck to their plans, three-fourths of them would have already adopted predictive analytics. In reality, slightly more than a third have done so. Report author Fern Halper, vice president and senior director of TDWI Research for advanced analytics, points to three practical considerations for making predictive analytics efforts successful: skills for predictive modeling, planning for model deployment, and infrastructure. Lack of skills ranked as the biggest barrier to adoption. She explains that to address this challenge many enterprises are looking to increase the skills of their employees as well as use some of the new breed of automated, easy-to-use predictive analytics tools that contain embedded intelligence. For those enterprises that use predictive analytics, model development and deployment remain challenging. The report also delves into the use cases for predictive analytics and the new technologies (including automation and open source) that assist in predictive analytics and machine learning. Best practices, including understanding business problems and maintaining high data quality, are also explored.

An Introduction to Mathematical Optimal Control Theory Version 0.2

In the words of one HN commenter, machine learning and OCT are attempting to solve the same problem: choose the optimal action to take at the current time for a given process. Control theorists normally start out with a model, or a family of potential models that describe the behavior of the process and work from there to determine the optimal action. This is very much an area of applied mathematics, and academics take rigorous approaches, but, in industry, many engineers just use a PID or LQR controller and call it a day, regardless how applicable they are to the actual system theoretically. Meanwhile, the reinforcement learning folk typically work on problems where the models are too complicated to work with computationally or often even to write down, so a more tractable approach is to learn a model and control policy from data.

Data Transfer Project

The Data Transfer Project was formed in 2017 to create an open-source, service-to-service data portability platform so that all individuals across the web could easily move their data between online service providers whenever they want. The contributors to the Data Transfer Project believe portability and interoperability are central to innovation. Making it easier for individuals to choose among services facilitates competition, empowers individuals to try new services and enables them to choose the offering that best suits their needs.
Data Transfer Project (DTP) is a collaboration of organizations committed to building a common framework with open-source code that can connect any two online service providers, enabling a seamless, direct, user initiated portability of data between the two platforms.
The Data Transfer Project uses services´ existing APIs and authorization mechanisms to access data. It then uses service specific adapters to transfer that data into a common format, and then back into the new service´s API.


Veneur is a distributed, fault-tolerant pipeline for runtime data. It provides a server implementation of the DogStatsD protocol or SSF for aggregating metrics and sending them to downstream storage to one or more supported sinks. It can also act as a global aggregator for histograms, sets and counters. More generically, Veneur is a convenient sink for various observability primitives with lots o output! See also:
• A unified, standard format for observability primitives, the SSF
• A proxy for resilient distributed aggregation, veneur-proxy
• A command line tool for emitting metrics, veneur-emit
• A poller for scraping Prometheus metrics, veneur-prometheus
• The sinks supported by Veneur
We wanted percentiles, histograms and sets to be global. We wanted to unify our observability clients, be vendor agnostic and build automatic features like SLI measurement. Veneur helps us do all this and more!

The five Cs

What does it take to build a good data product or service Not just a product or service that´s useful, or one that´s commercially viable, but one that uses data ethically and responsibly. We often talk about a product´s technology or its user experience, but we rarely talk about how to build a data product in a responsible way that puts the user in the center of the conversation. Those products are badly needed. News that people ‘don´t trust’ the data products they use – or that use them – is common. While Facebook has received the most coverage, lack of trust isn´t limited to a single platform. Lack of trust extends to nearly every consumer internet company, to large traditional retailers, and to data collectors and brokers in industry and government.
The five Cs:
• Consent
• Clarity
• Consistency and Trust
• Control and Transparency
• Consequences

American Statistical Association Statement on The Role of Statistics in Data Science

The rise of data science, including big data and data analytics, has recently attracted enormous attention in the popular press for its spectacular contributions in a wide range of scholarly disciplines and commercial endeavors. These successes are largely the fruit of the innovative and entrepreneurial spirit that characterize this burgeoning field. Nonetheless, its interdisciplinary nature means that a substantial collaborative effort is needed for it to realize its full potential for productivity and innovation. While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources; (ii) Statistics and Machine Learning convert data into knowledge; and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis. Certainly, data science intersects with numerous other disciplines and areas of research. Indeed it is difficult to think of an area of science, industry, commerce, or government that is not in some way involved in the data revolution. But it is databases, statistics, and distributed systems that provide the core pipeline. At its most fundamental level, we view data science as a mutually beneficial collaboration among these three professional communities, complemented with significant interactions with numerous related disciplines. For data science to fully realize its potential requires maximum and multifaceted collaboration among these groups. Statistics and machine learning play a central role in data science. Framing questions statistically allows us to leverage data resources to extract knowledge and obtain better answers. The central dogma of statistical inference, that there is a component of randomness in data, enables researchers to formulate questions in terms of underlying processes and to quantify uncertainty in their answers. A statistical framework allows researchers to distinguish between causation and correlation and thus to identify interventions that will cause changes in outcomes. It also allows them to establish methods for prediction and estimation, to quantify their degree of certainty, and to do all of this using algorithms that exhibit predictable and reproducible behavior. In this way, statistical methods aim to focus attention on findings that can be reproduced by other researchers with different data resources. Simply put, statistical methods allow researchers to accumulate knowledge.

Theoretical Impediments to Machine Learning

Current machine learning systems operate, almost exclusively, in a statistical, or model-free mode, which entails severe theoretical limits on their power and performance. Such systems cannot reason about interventions and retrospection and, therefore, cannot serve as the basis for strong AI. To achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks. To demonstrate the essential role of such models, I will present a summary of seven tasks which are beyond reach of current machine learning systems and which have been accomplished using the tools of causal modeling.