Bayesian probabilistic models provide a nimble and expressive framework for modeling ‘small-world’ data. In contrast, deep learning offers a more rigid yet much more powerful framework for modeling data of massive size. Edward is a probabilistic programming library that bridges this gap: ‘black-box’ variational inference enables us to fit extremely flexible Bayesian models to large-scale data. Furthermore, these models themselves may take advantage of classic deep-learning architectures of arbitrary complexity. Edward uses TensorFlow for symbolic gradients and data flow graphs. As such, it interfaces cleanly with other libraries that do the same, namely TF-Slim, PrettyTensor and Keras. Personally, I’ve been working often with the latter, and am consistently delighted by the ease with which it allows me to specify complex neural architectures. The aim of this post is to lay a practical foundation for Bayesian modeling in Edward, then explore how, and how easily, we can extend these models in the direction of classical deep learning via Keras. It will give both a conceptual overview of the models below, as well as notes on the practical considerations of their implementation — what worked and what didn’t. Finally, this post will conclude with concrete ways in which to extend these models further, of which there are many. If you’re just getting started with Edward or Keras, I recommend first perusing the Edward tutorials and Keras documentation respectively. To ‘pull us down the path,’ we build three models in additive fashion: a Bayesian linear regression model, a Bayesian linear regression model with random effects, and a neural network with random effects. We fit them on the Zillow Prize dataset, which asks us to predict logerror (in house-price estimate, i.e. the ‘Zestimate’) given metadata for a list of homes. These models are intended to be demonstrative, not performant: they will not win you the prize in their current form.
At Google, we develop flexible state-of-the-art machine learning (ML) systems for computer vision that not only can be used to improve our products and services, but also spur progress in the research community. Creating accurate ML models capable of localizing and identifying multiple objects in a single image remains a core challenge in the field, and we invest a significant amount of time training and experimenting with these systems.
You’ve probably heard the adage “two heads are better than one.” Well, it applies just as well to machine learning where the combination of diverse approaches leads to better results. And if you’ve followed Kaggle competitions, you probably also know that this approach, called stacking, has become a staple technique among top Kagglers. In this interview, Marios Michailidis (AKA Competitions Grandmaster KazAnova on Kaggle) gives an intuitive overview of stacking, including its rise in use on Kaggle, and how the resurgence of neural networks led to the genesis of his stacking library introduced here, StackNet. He shares how to make StackNet-a computational, scalable and analytical, meta-modeling framework-part of your toolkit and explains why machine learning practitioners shouldn’t always shy away from complex solutions in their work.
This paper has a wonderful combination of properties: the results are easy to understand, somewhat surprising, and then leave you pondering over what it all might mean for a long while afterwards! By “generalize well,” the authors simply mean “what causes a network that performs well on training data to also perform well on the (held out) test data?” (As opposed to transfer learning, which involves applying the trained network to a related but different problem). If you think about that for a moment, the question pretty much boils down to “why do neural networks work as well as they do?” Generalisation is the difference between just memorising portions of the training data and parroting it back, and actually developing some meaningful intuition about the dataset that can be used to make predictions. So it would be somewhat troubling, would it not, if the answer to the question “why do neural networks work (generalize) as well as they do?” turned out to be “we don’t really know!”
Bots are going to disrupt the software industry in the same way the web and mobile revolutions did. History has taught us that great opportunities arise in these revolutions: we’ve seen how successful companies like Uber, Airbnb, and Salesforce were created as a result of new technology, user experience, and distribution channels. At the end of this book, I hope you will be better prepared to grab these opportunities and design a great product for this bot revolution. Our lives have become full of bots in 2017 —I wake up in the morning and ask Amazon’s Alexa (a voice bot by Amazon) to play my favorite bossa nova, Amy (an email bot by x.ai) emails me about today’s meetings, and Slackbot (a bot powered by Slack) sends me a notification to remind me to buy airline tickets to NYC today. Bots are everywhere!
Partial Least Squares (PLS) is a popular method for relative importance analysis in fields where the data typically includes more predictors than observations. Relative importance analysis is a general term applied to any technique used for estimating the importance of predictor variables in a regression model. The output is a set of scores which enable the predictor variables to be ranked based upon how strongly each influences the outcome variable. There are a number of different approaches to calculating relative importance analysis including Relative Weights and Shapley Regression as described here and here. In this blog post I briefly describe an alternative method – Partial Least Squares. Because it effectively compresses the data before regression, PLS is particularly useful when the number of predictor variables is more than the number of observations.
Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R. We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out here.
In my early post (https://…/monotonic-binning-with-smbinning-package ), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.
Part 2 of 2 in the series Set Theory
At the R/Finance conference last month, I demonstrated how to operationalize models developed in Microsoft R Server as web services using the mrsdeploy package. Then, I used that deployed model to generate predictions for loan delinquency, using a Python script as the client. (You can see slides here, and a video of the presentation below.)
Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know. In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. Today, we’ll practice how to use the nnet and neuralnet packages to create a feedforward neural networks, which we introduce in the last set of exercises. In this type of neural network, all the neuron from the input layer are linked to the neuron from the hidden layer and all of those neuron are linked to the output layer, like seen on this image. Since there’s no cycle in this network, the information flow in one direction from the input layer to the hidden layers to the output layer. For more information about those types of neural network you can read this page.
So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay. Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample. When using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.