How Vertically Integrated AI Stacks Will Affect IT Organizations

You might be asking yourself, ‘What is a vertically integrated AI stack?’ The short answer is that there isn’t a perfect definition, but there are a few good starting points to add some clarity to the discussion. Some of the popular raw ingredients to create a vertically integrated AI stack are data, hardware, machine learning framework and the cloud platform.

Turbocharging Analytics at Uber with our Data Science Workbench

Before deciding to build our data science workbench, we evaluated multiple third-party solutions and determined that they could not easily scale to number of users or volume of data we anticipated on the platform, nor would they integrate well with Uber’s internal data tools and platforms. We also realized that building our own platform would enable us to target specific use cases, such as geospatial analytics, custom visualization, integration with Michelangelo (our machine learning framework), and deep learning. We concluded that our best option was to build an in-house solution. Before we began development, however, we needed to better understand user needs.

The Problem with Percentiles – Aggregation brings Aggravation

Percentiles have become one of the primary service level indicators to represent real systems monitoring performance. When used correctly, they provide a robust metric that can be used for base-of-mission critical service level objectives. However, there’s a reason for the ‘when used correctly’ above.

Tesseract 4 is here! State of the art OCR in R!

Last week Google and friends released the new major version of their OCR system: Tesseract 4. This release builds upon 2+ years of hard work and has completely overhauled the internal OCR engine. From the tesseract wiki: Tesseract 4.0 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract. We have now also updated the R package tesseract to ship with the new Tesseract 4 on MacOS and Windows. It uses the new engine by default, and the results are extremely impressive! Recognition is much more accurate then before, even without manually enhancing the image quality.

Explore Your Dataset in R

As person who works with data, one of the most exciting activities is to explore a fresh new dataset. You’re looking to understand what variables you have, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more. While there is a ton you can do to get up and running, I want to show you a few simple commands to help you get a fast overview of the data set you are working with.

Suppressed data (left-censored counts)

I experiment with some different ways of handling counts in tables that have been suppressed for confidentiality, and come up in favour of multiple imputation. The mice R package helpfully lets you define your own imputation algorithm.

Coding Gradient boosted machines in 100 lines of code

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project or problem. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature. While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I am a fan of implementing the algorithms in R by myself. Of course this means digging through the maths as well. However, you can challenge your understanding of the algorithm directly. In my two subsequent blog posts I will introduce two machine learning algorithms in under 150 lines of R Code. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.

RStudio 1.2 Preview – New Features in RStudio Pro

Today, we’re continuing our blog series on new features in RStudio 1.2. If you’d like to try these features out for yourself, you can download a preview release of RStudio Pro 1.2.

Simulated Annealing For Clustering Problems: Part 2

Hey everyone, This is the second and final part of this series. In this post, we will convert this paper into python code and thereby attain a practical understanding of what Simulated Annealing is, and how it can be used for Clustering. Part 1 of this series covers the theoretical explanation of Simulated Annealing (SA) with some examples. I recommend you to read it. However, if you are very knowledgeable or familiar with SA, I think you can understand most of what you are going to read, but you can always go to Part 1 and read it if anything is uncertain to you. I’m not going to explain the complete paper, only the practical part. So, let us start by knowing what clustering means. Clustering is a type of unsupervised learning technique where the goal is to attach labels to unlabeled objects based on the similarities between the objects. Let’s quickly jump into a real-world example where clustering can be used so that you will have a better understanding. Suppose there is a Shopping Website where you can view clothes, accessories and buy them if you like.

Fluid concepts and creative probabilities

Concept learning through probabilistic programming.


The open-source, functional database with Complex Event Processing in JavaScript.


HarvardX is a University-wide strategic initiative, overseen by the Office of the Vice Provost for Advances in Learning, to enable faculty to build and create open online learning experiences for residential and online use, and to enable groundbreaking research in online pedagogies.