Building a Random Forest from Scratch & Understanding Real-World Data Products (ML for Programmers – Part 3)

As data scientists and machine learning practitioners, we come across and learn a plethora of algorithms. Have you ever wondered where each algorithm’s true usefulness lies? Are most machine learning techniques learned with the primary aim of scaling a hackathon’s leaderboard? Not necessarily. It’s important to examine and understand where and how machine learning is used in real-world industry scenarios. That’s where most of us are working (or will eventually work). And that’s what I aim to show in this part 3 of our popular series covering the introduction to machine learning course!

Python Multi-Threading vs Multi-Processing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.

Nonparametric Relative Survival Analysis with the R Package relsurv

Relative survival methods are crucial with data in which the cause of death information is either not given or inaccurate, but cause-specific information is nevertheless required. This methodology is standard in cancer registry data analysis and can also be found in other areas. The idea of relative survival is to join the observed data with the general mortality population data and thus extract the information on the disease-specific hazard. While this idea is clear and easy to understand, the practical implementation of the estimators is rather complex since the population hazard for each individual depends on demographic variables and changes in time. A considerable advance in the methodology of this field has been observed in the past decade and while some methods represent only a modification of existing estimators, others require newly programmed functions. The package relsurv covers all the steps of the analysis, from importing the general population tables to estimating and plotting the results. The syntax mimics closely that of the classical survival packages like survival and cmprsk, thus enabling the users to directly use its functions without any further familiarization. In this paper we focus on the nonparametric relative survival analysis, and in particular, on the two key estimators for net survival and crude probability of death. Both estimators were first presented in our package and are still missing in many other software packages, a fact which greatly hampers their frequency of use. The paper offers guidelines for the actual use of the software by means of a detailed nonparametric analysis of the data describing the survival of patients with colon cancer. The data have been provided by the Cancer Registry of Slovenia.

December Reading for Econometricians

• Askanazi, R., F. X. Diebold, F. Schorfheide, & M. Shin, 2018. On the comparison of interval forecasts.
• Meintanis, S. G., J. Ngatchou-Wandji, & J. Allison, 2018. Testing for serial independence in vector autoregressive modesl.
• Milas, C., T. Panagiotidis, & T. Dergiades, 2018. Twitter versus traditional news media: Evidence for the sovereign bond markets
• Ollech, D., 2018. Seasonal adjustment of daily time series.
• Sickles, R. C., W. Song, & V. Zelenyuk, 2018. Econometric analysis of productivity: Theory and implementation in .
• Skeels, C. L. & F. Windmeijer, 2018. On the Stock-Yogo tables.

Digit Recognizer – Introduction to Kaggle Competitions with Image Classification Task (0.995)

In today’s article, I am going to show you how to master machine learning skills by participating in Kaggle data science competitions. We will solve a simple written digits recognition task (MNIST) and compare our results with others’. You will be challenged to master your analytical and coding skills in order to succeed in a highly competitive Kaggle community.

Facebook´s Open-Source Reinforcement Learning Platform – A Deep Dive

Facebook decided to open-source the platform that they created to solve end-to-end Reinforcement Learning problems at the scale they are working on. So of course I just had to try this 😉 Let’s go through this together on how they installed it and what you should do to get this working yourself.

How Well My Time Series Models Performed? Actual Vs. Prediction – R

Time series analysis and forecasting for the monthly accident and emergency attendances to National Health Services (NHS) in England was an interesting project. There I had built four different forecasting models to predict the monthly Total Attendances to NHS organizations in the period between Aug-2018 till July-2019. To read more about it (click here). The four models were: Seasonal Naïve (SN), Linear Regression (LR), ARIMA and Auto-ARIMA model. However, I built those models in August 2018. After that, I have kept monitoring the actual statistics release of Total Attendances by the end of every month via the NHS website?-?A&E Attendances and Emergency Admissions. It has been almost three months now and in general, I can say that those models behave quite excellent. But here we will dig deep more on how they performed by comparing the predicted data of each model in three months ago with actual ones. This will be accomplished simply by calculating the Absolute Error (AE), Root Mean Square Error (RMSE), and Root Mean Square Log Error (RMSLE).

A short tutorial on Fuzzy Time Series – Part II (with an case study on Solar Energy)

In the first part of this tutorial I briefly explained time series analysis, fuzzy sets and what are the Fuzzy Time Series?-?FTS, with a short introduction to the pyFTS library. It was awesome! I got feedback from people around the world and I heard very good ideas coming from them. But, as occurs in every introductory study, we used very simple methods and data with the aim of to facilitate the general understanding of this approach. But now, with the assumption that the readers already know the background needed, we can advance a little bit and play with more useful things. To start we going to see an intuitive introduction to the concept of bias/variance, then we will go learn some of the most import FTS hyperparameters and their impact on model’s accuracy. To finish we will employ some FTS models to model and predict solar radiation time series, useful on photovoltaic energy prediction.

Why ensembling works: The intuition behind Opendoor’s home pricing

At Opendoor, we leverage machine learning to price homes that customers sell to us. Because we make real offers on homes worth hundreds of thousands of dollars, it is critical that our models make accurate price estimates. If our estimates are too low, we are unfair to our customers. If our estimates are too high, we might pay more for the home than the market would allow us to sell it for, and as a result we might lose money. Neither of these outcomes are good, and that’s why we always work hard to improve our models. A popular way to improve model accuracy is to build multiple models and then compute a weighted average of their estimates. For example, if one model says that a home is worth $295K and another model says the home is worth $305K, we might average the two estimates together to make a final estimate of $300K. This approach is called ensembling. When it comes to improving accuracy, ensembling is effective. What is probably most surprising is that adding a new model to the ensemble usually helps even if the new model is inferior to all of the original models.

A different kind of (deep) learning: part 1

In this and the following posts, I will try to discuss some of the most interesting approaches: some of them are doing things, and dub it ‘Different kinds of (deep) learning’. I by no means try to predict the future developments in deep learning, but merely to describe recent interesting works, that perhaps doesn´t get the spotlight. This may serve the readers for a few purposes:
• You may be interested in learning about works you didn´t know of.
• You may get new ideas for your own work.
• You may learn of relation between logical parts and tasks in deep learning that you were not aware of.
The first part of this series will be about self supervised learning that was one of the main drivers for me to write this series.

Machine learning in Clojure with XGBoost and clj-boost

You probably never heard about Clojure at all, let alone for data science and machine learning. So why would you be interested in using it for doing these things? I’ll tell you why: to make what matters (data) first class! In Clojure we don’t deal with classes, objects and so on, everything is just data. And that data is immutable. This means that if you mess up with your transformations, the data will be ok and you won’t have to start all over again. The previous is just one of the reasons popping out of my head, another one might be the JVM. Yeah, we all hate it to some extent, but make no mistake: it is developed since 1991 and has always been production grade. This is why businesses all around the world still run and develop software on the JVM. Considering this, Clojure can be accepted even by the most conservative corporations out there because down below it is fundamentally something they already know. The second point I would like to make is the REPL. This isn’t like the usual language shell (for instance the Python REPL) which are usually very basic and annoying, but it has superpowers! Would it be nice to go seeing live what’s happening with that service in production? Done! Would you like to have a plotting service that can be used for both production and experimenting? Done! Do you happen to have nested data structures and it would be nice to explore them visually to better understand them? Done! This means you don’t really need something like a Jupyter notebook, though if you feel more comfortable in such environment there are options out there that work seamlessly: clojupyter is a Clojure kernel for Jupyter notebooks, and Gorilla REPL is a native Clojure notebook solution.

Curiosity in Deep Reinforcement Learning

Understanding Random Network Distillation

What’s New in Deep Learning Research: Microsoft Also Wants to Use Neural Networks to Design Neural Networks

Choosing the right network architecture is a highly subjective aspects of deep learning systems. How many layers should we use? What’s the right size for a convolution operation? What nodes should we use as an input of a recurrent neural network(RNN)? Deep learning practitioners agonize over those types of questions every single time they are faced with building a new neural architecture. The most ambitious circles within the deep learning community believe that neural network architecture can be modeled itself as a deep learning problem. The idea of relying on deep learning systems to design deep learning systems is a fascinating area of research. The deep learning space has baptized this approach as neural architecture search(NAS) and is counts with two main schools of thought: reinforcement learning(RL) and evolutionary algorithms(EA). Both approaches have the limitation that they conduct architecture searches in a discrete space. In RL based methods, the choice of a component of the architecture is regarded as an action. A sequence of actions defines an architecture of a neural network, whose dev set accuracy is used as the reward. In EA based method, search is performed through mutations and re-combinations of architectural components, where those architectures with better performances will be picked to continue evolution. Discrete space searches make sense in some constrained scenarios but result inefficient when comes to create deep learning agents that operate in real world environments which are continuous by definition. Recently, researchers from the Microsoft Research Lab in Beijing proposed a new method for discovering neural architectures in a continuous space.

De(Coding) Random Forests

Motivation: Random Forest Ensembles are widely used for real-world machine learning problems, Classification as well as Regression. Their popularity can be attributed to the fact that practitioners often get optimal results using a Random Forest algorithm with minimal data cleaning, and no feature scaling. For a better grasp of the underlying principles, I decided to code a Random Forest from scratch, and perform a few visualizations of different features in the data, and their role in deciding the final result. Widely believed to be a black box, a quick walk-through of the algorithm will prove it is actually quite interpretable, apart from being a powerful technique leveraging the ‘power of the majority vote’.

Training a Goal-Oriented Chatbot with Deep Reinforcement Learning – Part I

In this series we are going to be learning about goal-oriented chatbots and training one with deep reinforcement learning in python! All from scratch!