The MIT Probabilistic Computing Project

The MIT Probabilistic Computing Project aims to build software and hardware systems that augment human and machine intelligence. We are currently focused on probabilistic programming. Probabilistic programming is an emerging field that draws on probability theory, programming languages, and systems programming to provide concise, expressive languages for modeling and general-purpose inference engines that both humans and machines can use. Our research projects include BayesDB and Picture, domain-specific probabilistic programming platforms aimed at augmenting intelligence in the fields of data science and computer vision, respectively. BayesDB, which is open source and in use by organizations like the Bill & Melinda Gates Foundation and JPMorgan, lets users who lack statistics training understand the probable implications of data by writing queries in a simple, SQL-like language. Picture, a probabilistic language being developed in collaboration with Microsoft, lets users solve hard computer vision problems such as inferring 3D models of faces, human bodies and novel generic objects from single images by writing short (<50 line) computer graphics programs that generate and render random scenes. Unlike bottom-up vision algorithms, Picture programs build on prior knowledge about scene structure and produce complete 3D wireframes that people can manipulate using ordinary graphics software. The core platform for our research is Venture, an interactive platform suitable for teaching and applications in fields ranging from statistics to robotics.


Is it possible to make statistical inference broadly accessible to non-statisticians without sacrificing mathematical rigor or inference quality? E.g. ‘INFER orbit_type FROM satellites WITH CONFIDENCE 0.7’ BayesDB is a probabilistic programming platform that enables users to query the probable implications of their data as directly as SQL databases enable them to query the data itself. The default modeling assumptions that BayesDB makes are suitable for a broad class of problems, but statisticians can customize these assumptions when necessary. BayesDB also enables domain experts that lack statistical expertise to perform qualitative model checking and encode simple forms of qualitative prior knowledge.

Intuition vs Unsupervised Learning – Agglomerative Clustering in practice

Clustering is a hugely important step of exploratory data analysis and finds plenty of great applications. Typically, clustering technique will identify different groups of observations among your data. For example, if you need to perform market segmentation, cluster analysis will help you with labeling each segment so that you can evaluate each segment’s potential and target some attractive segments. Therefore, your marketing program and positioning strategy rely heavily on the very fundamental step – grouping of your observations and creation of meaningful segments. We may also find many more use cases in computer science, biology, medicine or social science. However, it often turns out to be quite difficult to define properly how a well-separated cluster looks like.

R Plot Function – The Options

R’s plot function is probably the most used visualization function in R. It’s simple, easy and gets the job done. It’s also highly customizable. Adding unnecessary styling and information on a visualization/plot is not really recommended because it can take away from what’s being portrayed, but there are times when you have just have to. Whether it’s for pure aesthetics, to convey multiple things in one plot, or any other reason, here are the options you can use in R’s base plot() function.

Forecasting: Time Series Exploration Exercises (Part-1)

R provides powerful tools for forecasting time series data such as sales volumes, population sizes, and earthquake frequencies. A number of those tools are also simple enough to be used without mastering sophisticated underlying theories. This set of exercises is the first in a series offering a possibility to practice in the use of such tools, which include the ARIMA model, exponential smoothing models, and others.

Test driving Python integration in R, using the ‘reticulate’ package

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions. Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Measurement Units in R

We briefly review SI units, and discuss R packages that deal with measurement units, their compatibility and conversion. Built upon udunits2 and the UNIDATA udunits library, we introduce the package units that provides a class for maintaining unit metadata. When used in expression, it automatically converts units, and simplifies units of results when possible; in case of incompatible units, errors are raised. The class flexibly allows expansion beyond predefined units. Using units may eliminate a whole class of potential scientific programming mistakes. We discuss the potential and limitations of computing with explicit units.