Recidivism and logistic regression

In my previous article, I presented the problem of estimating a criminal’s risk of recidivism, focusing on the philosophical problems of attributing a probability to an individual. In this article I turn to the more practical problem of doing the estimation.


Correlation and Linear Regression

Before going into complex model building, looking at data relation is a sensible step to understand how your different variable interact together. Correlation look at trends shared between two variables, and regression look at causal relation between a predictor (independent variable) and a response (dependent) variable.


A python tutorial on bayesian modeling techniques (PYMC3)

Welcome to ‘Bayesian Modelling in Python’ – a tutorial for those interested in learning how to apply bayesian modelling techniques in python (PYMC3). This tutorial doesn’t aim to be a bayesian statistics tutorial – but rather a programming cookbook for those who understand the fundamental of bayesian statistics and want to learn how to build bayesian models using python. The tutorial sections and topics can be seen below.


Digging Into the Pronto Data Release

In October Seattle’s bike sharing service, Pronto, turned one year old and released a treasure-trove of data on the 140,000 individual trips during the first year. Here I want to dig into this data and answer a few questions:
•many naysayers insist that Seattle is too cold, too wet, too hilly to for bicycling to take off. How do these elements actually affect users of the Pronto system?
•what is the difference in Pronto usage by annual members and short-term users? How might Pronto evolve to be more useful to these groups?
•how do Pronto trips compare to trips by other cyclists in the city? Can characteristics of Pronto use give us insight into deeper trends within the city?
•Can we cleverly de-anonymize the data and learn about the usage patterns of individual members?


Advanced Jupyter Notebook Tricks — Part I

I love Jupyter notebooks! They’re great for experimenting with new ideas or data sets, and although my notebook ‘playgrounds’ start out as a mess, I use them to crystallize a clear idea for building my final projects. Jupyter is so great for interactive exploratory analysis that it’s easy to overlook some of its other powerful features and use cases. I wanted to write a blog post on some of the lesser known ways of using Jupyter — but there are so many that I broke the post into two parts. In Part 1, today, I describe how to use Jupyter to create pipelines and reports. In the next post, I will describe how to use Jupyter to create interactive dashboards.


Does mean centering or feature scaling affect a Principal Component Analysis?

Let us think about whether it matters or not if the variables are centered for applications such as Principal Component Analysis (PCA) if the PCA is calculated from the covariance matrix (i.e., the k principal components are the eigenvectors of the covariance matrix that correspond to the k largest eigenvalues).


Data Visualization in Python: Advanced Functionality in Seaborn

Seaborn is a Python data visualization library with an emphasis on statistical plots. The library is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once. In this post, we’ll cover three of Seaborn’s most useful functions: factorplot, pairplot, and jointgrid. Going a step further, we’ll show how we can get even more mileage out of these functions by stepping up to their even-more-powerful forms: FacetGrid, PairGrid, and JointGrid.


Unsupervised Learning of Video Representations using LSTMs

We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence.


Document Classification by Inversion of Distributed Language Representations

The goal of this note is to point out that any distributed representation can be turned into a classifier through inversion via Bayes rule. The approach is simple and modular, in that it will work with any language representation whose training can be formulated as optimizing a probability model.


Statistics Refresher

Let’s face it, a good statistics refresher is always worthwhile. There are times we all forget basic concepts and calculations. Therefore, I put together a document that could act as a statistics refresher and thought that I’d share it with the world. This is part one of a two part document that is still being completed. This refresher is based on Principles of Statistics by Balmer and Statistics in Plain English by Brightman.