Using neural networks to detect car crashes in dashcam footage

Tasks that humans take for granted are often difficult for machines to complete. That’s why when you’re asked to prove yourself human through those CAPTCHA tests, you’re always asked a ridiculously simple question, e.g., whether an image contains a road sign or not, or selecting a subset of images that contain food (see Moravec’s Paradox). These tests are effective in determining whether a user is human precisely because image recognition in context is difficult for machines. Training computers to accurately answer these kinds of questions in an automated, efficient way for large amounts of data is complicated. To get around this, companies like Facebook and Amazon spend a lot of money to manually deal with image and video classification problems. For example, TechRepublic suggests that manual labeling of data may be the “future blue-collar job”, something that we already see companies like Facebook doing to curate newsfeed stories. Reviewing millions of images and videos to identify certain types of content by hand is extremely tedious and expensive. Yet despite this, few techniques exist to efficiently analyze image and video content in an automated way. In this post, I will describe how, as a Fellow for Insight Data Science, I built a classification machine learning algorithm (Crash Catcher!) that employs a hierarchical recurrent neural network to isolate specific, relevant content from millions of hours of video. In my case, my algorithm reviews dashboard camera footage to identify whether or not a car crash occurs. For businesses that may have millions of hours of video to sift through (for instance, an auto insurance company), the tool I created is extremely useful to automatically extract important and relevant content.

You might want to hire a Data Engineer instead of a Data Scientist

In this post, we’ll look at
• which aspects are required to develop a data-driven software product,
• how data scientists and data engineers fit these aspects,
• how to detect if your team needs a data engineer, and
• how to find a data-engineer for your team, if you need one.

Drawing 10 Million Points With ggplot: Clifford Attractors

From a technical point of view, the challenge is creating a data frame with all locations, since it must have 10 milion rows and must be populated sequentially. A very fast way to do it is using Rcpp package. To render the plot I use ggplot, which works quite well. Here you have the code to play with Clifford Attractors if you want …

The Practical Guide to Managing Data Science at Scale

Lessons from the field on managing data science projects and portfolios The ability to manage, scale, and accelerate an entire data science discipline increasingly separates successful organizations from those falling victim to hype and disillusionment. Data science managers have the most important and least understood job of the 21st century. This paper demystifies and elevates the current state of data science management. It identifies best practices to address common struggles around stakeholder alignment, the pace of model delivery, and the measurement of impact.

Automated root cause analysis for Spark application failures

Spark’s simple programming constructs and powerful execution engine have brought a diverse set of users to its platform. Many new big data applications are being built with Spark in fields like health care, genomics, financial services, self-driving technology, government, and media. Things are not so rosy, however, when a Spark application fails.

Excel vs R: A Brief Introduction to R

Quantitative research often begins with the humble process of counting. Historical documents are never as plentiful as a historian would wish, but counting words, material objects, court cases, etc. can lead to a better understanding of the sources and the subject under study. When beginning the process of counting, the first instinct is to open a spreadsheet. The end result might be the production of tables and charts created in the very same spreadsheet document. In this post, I want to show why this spreadsheet-centric workflow is problematic and recommend the use of a programming language such as R as an alternative for both analyzing and visualizing data. There is no doubt that the learning curve for R is much steeper than producing one or two charts in a spreadsheet. However, there are real long-term advantages to learning a dedicated data analysis tool like R. Such advice to learn a programming language can seem both daunting and vague, especially if you do not really understand what it means to code. For this reason, after discussing why it is preferable to analyze data with R instead of a spreadsheet program, this post provides a brief introduction to R, as well as an example of analysis and visualization of historical data with R.

Yield gap analysis of US rice production systems shows opportunities for improvement

Data were processed using the R statistics program (R Core Team, 2015). Spatial aggregation and visualization were accomplished using the following packages for R: ‘raster’ (Hijmans, 2015), ‘sp’ (Bivand et al., 2013; Pebesma and Bivand, 2005), ‘rgeos’ (Bivand and Rundel, 2016), ‘rgdal’ (Bivand et al., 2015), ‘RColorBrewer’(Neuwirth, 2014), ‘maps’ (Becker et al., 2016), ‘mapproj’ (McIlroy et al., 2015), and ‘maptools’ (Bivand and Lewin-Koh, 2016). Data analyses and regressions utilized the stan_glm() function in the ‘rstanarm’ package (Gabry and Goodrich, 2016), an interface to the Stan probabilistic programming language (Stan Development Team, 2016). All regressions followed standard recommendations and used weakly informative normal priors to regulate estimates (Gelman et al., 2013). In cases where effects were estimated at the state or zone level, multi-level models were utilized (Gelman and Hill, 2007), with states and zones representing two levels of hierarchy in the effects. Model assumptions and fit were assessed using diagnostic statistics and posterior predictive checks. Credible intervals were calculated as the 95% quantile of the posterior distributions. All data, model files, and R code are publicly available through the Open Science Framework (https://…/; Espe, 2016).

Predict Employee Turnover With Python

This post presents a reference implementation of an employee turnover analysis project that is built by using Python’s Scikit-Learn library. In this article, we introduce Logistic Regression, Random Forest, and Support Vector Machine. We also measure the accuracy of models that are built by using Machine Learning, and we assess directions for further development. And we will do all of the above in Python. Let’s get started!

Interpreting Machine Learning Models: An Overview

This post summarizes the contents of a recent O’Reilly article outlining a number of methods for interpreting machine learning models, beyond the usual go-to measures.

Practical Machine Learning with R and Python – Part 5

This is the 5th and probably penultimate part of my series on ‘Practical Machine Learning with R and Python’.