Internet of Things and data mining: From applications to techniques and systems

The Internet of Things (IoT) is the result of the convergence of sensing, computing, and networking technologies, allowing devices of varying sizes and computational capabilities (things) to intercommunicate. This communication can be achieved locally enabling what is known as edge and fog computing, or through the well-established Internet infrastructure, exploiting the computational resources in the cloud. The IoT paradigm enables a new breed of applications in various areas including health care, energy management and smart cities. This paper starts off with reviewing these applications and their potential benefits. Challenges facing the realization of such applications are then discussed. The sheer amount of data stemmed from devices forming the IoT requires new data mining systems and techniques that are discussed and categorized later in this paper. Finally, the paper is concluded with future research directions.

Introduction to Monte Carlo Method

In this tutorial, the reader will learn the Monte Carlo methodology and its applications in data science, like integral approximation, and parameter estimation.

Image clustering with Keras and k-Means

A while ago, I wrote two blogposts about image classification with Keras and about how to use your own models or pretrained models for predictions and using LIME to explain to predictions.

Visualize your CV’s timeline with R (Gantt chart style)

I have been improving my curriculum vitae (CV) these last days and what a nice idea came up: we can use R to generate a neat organized plot showing our different roles and jobs throughout the years. Not only it will draw our recruiters’ attention but you will also show some of your proficiency with visualizations. In my case, as a data scientist, that is one of the most important skills to have when showing your results to C-level guys or explaining a model’s performance to non-technical people.

‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.

In my last blogpost about Random Forests I introduced the Bootcamp. The next part I published was about Neural Networks and Deep Learning. Every video of our bootcamp will have example code and tasks to promote hands-on learning. While the practical parts of the bootcamp will be using Python, below you will find the English R version of this Neural Nets Practical Example, where I explain how neural nets learn and how the concepts and techniques translate to training neural nets in R with the H2O Deep Learning function.

ON the evolution of Data Engineering

A few years ago being a data engineer meant managing data in and out of a database, creating pipelines in SQL or Procedural SQL and doing some form of ETL to load data in a data-warehouse, creating data-structures to unify, standardize and (de)normalize datasets for analytical purpose in a non-realtime manner. Some companies were adding to that a more front facing business components that involved building analytic cubes and dashboard for business users. In 2018 and beyond the role and scope of the data engineers has changed quite drastically. The emergence of data products has created a gap to fill which required a mix of skills not traditionally embedded within typical development teams, the more Software Development Oriented data engineers and the more data oriented Backend Engineers were in a prime role to fill this gap. This evolution was facilitated by a growing number of technologies that helped to bridge the gap both for those of Data Engineering and those of a more Backend Engineering background.

Interpreting Linear Prediction Models

Although linear models are one of the simplest machine learning techniques, they are still a powerful tool for predictions. This is particularly due to the fact that linear models are especially easy to interpret. Here, I discuss the most important aspects when interpreting linear models by example of ordinary least-squares regression using the airquality data set.


Deep learning operations rethinked (supports tf, pytorch, chainer, gluon and others)

Data over Substance – For Now

There was an interesting article in today’s US print edition of the Wall Street Journal. It was titled, Wall Street Analysts Are Selling More Data. The article explains how increasingly investment firms are collecting more data streams, much from new and even unconventional sources, and packaging the data up and selling it on to their clients. What was included in the article, inside the column and not implied by the article title, was the point that those same investment firms were selling much less research. Under the previous model (if that’s the word to use), such firms would buy data and financial analysts would pour over the data, think hard, draw conclusions and insights, and sell those insights to clients. Now the value of financial research is falling and the value of raw data is rising. Yet another point, hidden further inside the article, is that we are not only talking of raw data increasing in value. Some of that raw data is what we might call packaged insight. In other words, an analytic or algorithm is applied to the data so that the client licenses the data and an additional analytics to go with it. So it’s not just raw data. Some of the repeatable human-driven analysis is now packaged and applied (and sold) as an analytic.

Best Practices for Using Notebooks for Data Science

Are you interested in implementing notebooks for data science? Check out these 5 things to consider as you begin the process.
#1 One notebook, one focus.
#2 State is explicit.
#3 Push code in modules
#4 Use speaking variables and tidy up your code
#5 Label diagrams

Spinning Up in Deep RL

We’re releasing Spinning Up in Deep RL, an educational resource designed to let anyone learn to become a skilled practitioner in deep reinforcement learning. Spinning Up consists of crystal-clear examples of RL code, educational exercises, documentation, and tutorials.

How to Fit Large Neural Networks on the Edge

Deploying memory-hungry deep learning algorithms is a challenge for anyone who wants to create a scalable service. Cloud services are expensive in the long run. Deploying models offline on edge devices is cheaper, and has other benefits as well. The only disadvantage is that they have a paucity of memory and compute power. This blog explores a few techniques that can be used to fit neural networks in memory-constrained settings. Different techniques are used for the ‘training’ and ‘inference’ stages, and hence they are discussed separately.

4 ways to be more efficient using RStudio’s Code Snippets, with 11 ready to use examples

In this post we will look at yet another productivity increasing feature of the RStudio IDE – Code Snippets. Code Snippets let us easily insert and potentially execute predefined pieces of code and work not just for R code, but many other languages as well.

Voronoi diagram with ggvoronoi package with Train Station data

I’ve always been curious to make Voronoi diagram, I just think they are beautiful! When I came across data set with train stations in Japan. I instantly thought this would be great data sets to make Voronoi diagram! I’ve gotten data sets from (Ekidata)[http://…/] site. I’m amazed how many train stations we have in Japan, as well as coverage of train systems in Japan. There are couple of packages I could’ve used to make Voronoi diagram, but I’ve utilized package ggvoronoi. I really like using ‘outline’ inside of geom_voronoi function to mask out the shape! (Which I wasn’t sure how to do before using deldir package).

Exploring Models with lime

Recently at work I’ve been asked to help some clinicians understand why my risk model classifies specific patients as high risk. Just prior to this work I stumbled across the work of some data scientists at the University of Washington called lime. LIME stands for ‘Local Interpretable Model-Agnostic Explanations’. The idea is that I can answer those questions I’m getting from clinicians for a specific patient by locally fitting a linear (aka ‘interpretable’) model in the parameter space just around my data point. I decided to pursue lime as a solution and the last few months I’ve been focusing on implementing this explainer for my risk model. Happily, I also discovered an R package that implements this solution that originated in python.

The Four Stages of Mastering Statistical Analysis

At The Analysis Factor, we are on a mission to help researchers improve their statistical skills so they can do amazing research.
We all tend to think of ‘Statistical Analysis’ as one big skill, but it’s not.
Stage 1: The Fundamentals
Stage 2: Linear Models
Stage 3: Extensions of Linear Models
Stage 4: Advanced Models

The xtensor vision

Here we’re laying out a vision for the xtensor project, the n-dimensional array in the C++ language – that makes it easy to write high-performance code and bind it to the languages of data science (Python, Julia and R).

Prototyping a Recommender System Step by Step Part 1: KNN Item-Based Collaborative Filtering

Most internet products we use today are powered by recommender systems. Youtube, Netflix, Amazon, Pinterest, and long list of other internet products all rely on recommender systems to filter millions of contents and make personalized recommendations to their users. Recommender systems are well-studied and proven to provide tremendous values to internet businesses and their consumers. In fact, I was shock at the news that Netflix awarded a $1 million prize to a developer team in 2009, for an algorithm that increased the accuracy of the company’s recommendation system by 10%.

A brief introduction to Intent Classification

Recently I learned about something called ‘intent classification’ for a project, so I thought to share it with all of you and how I create a classifier for it. Intent classification is an important component of Natural Language Understanding (NLU) systems in any chatbot platform.

Interpretable ML with Additive Models

Earlier this year I went to an excellent talk by Rich Caruana from Microsoft Research about his approach to building interpretable ML models for medical applications. As a proponent of interpretable ML, Rich Caruana was featured in a heated debate with Yann LeCun at NIPS last year. Here I’ll summarize his idea and give it my own interpretation. The main idea is that we can a build a humanly interpretable ML model by making it additive with explicitly specified interactions. So what is an additive model? Think good old regression models. In a simple regression model, the effect of one variable does not depend on other variables and their effects add up to the total effect. If we can understand each variable individually or marginally, we would be able to interpret the entire model. However, this will no longer be true in the presence of interactions. Interaction between two variables means that the effects of two variables are dependent on each other. Interactions can make a model more difficult to interpret. Even if we understand how each variable changes the model prediction, we still would not figure out the model prediction when several variables change. It’s like having several beneficial drugs that turn into poison when combined. When there are a lot of variables and many interactions, it would become impossible to untangle the effects of variables. This is exactly the problem with most ML models. In a decision tree, we can split branches by one variable first then split again by another variable; this is essentially creating interactions. We can interpret a small decision tree by following the splits but this would easily become a pain when the trees get bigger or form a random forest. In a multi-layer neural network, interactions are created implicitly when propagating through the hidden layers as each hidden unit is a non-linear combination of the input.