An overview of feature selection strategies

Feature selection and engineering are the most important factors which affect the success of predictive modeling. This remains true even today despite the success of deep learning, which comes with automatic feature engineering. Parsimonious and interpretable models provide simple insights into business problems and therefore they are deemed very valuable. Furthermore, in many occasions the underlying size and structure of the data being analyzed may not allow the use of complex models that have many parameters to tune. For example, in clinical settings where the number of samples is usually much lower than the number of features one could extract (e.g. gene expression studies, patient specific physiologic data etc.), simpler models are preferable. These high dimensional problems pose significant challenges, and numerous techniques have been developed over the years to provide solutions. In this post, I will review a few of the common methods and illustrate their use in some sample cases. I will focus on binary classification problems for simplicity.


Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI, IBM Watson

For most businesses, machine learning seems close to rocket science, appearing expensive and talent demanding. And, if you´re aiming at building another Netflix recommendation system, it really is. But the trend of making everything-as-a-service has affected this sophisticated sphere, too. You can jump-start an ML initiative without much investment, which would be the right move if you are new to data science and just want to grab the low hanging fruit. One of ML´s most inspiring stories is the one about a Japanese farmer who decided to sort cucumbers automatically to help his parents with this painstaking operation. Unlike the stories that abound about large enterprises, the guy had neither expertise in machine learning, nor a big budget. But he did manage to get familiar with TensorFlow and employed deep learning to recognize different classes of cucumbers. By using machine learning cloud services, you can start building your first working models, yielding valuable insights from predictions with a relatively small team. We´ve already discussed machine learning strategy. Now let´s have a look at the best machine learning platforms on the market and consider some of the infrastructural decisions to be made.


6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study

6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study


Deequ – Unit Tests for Data

Deequ is a library built on top of Apache Spark for defining ‘unit tests for data’, which measure data quality in large datasets.


A Performance Benchmark of Different AutoML Frameworks

In a recent blog post our CEO Sebastian Heinz wrote about Google’s newest stroke of genius – AutoML Vision. A cloud service ‘that is able to build deep learning models for image recognition completely fully automated and from scratch’. AutoML Vision is part of the current trend towards the automation of machine learning tasks. This trend started with automation of hyperparameter optimization for single models (Including services like SigOpt, Hyperopt, SMAC), went along with automated feature engineering and selection (see my colleague Lukas’ blog post about our bounceR package) towards full automation of complete data pipelines including automated model stacking (a common model ensembling technique). One company at the frontier of this development is certainly h2o.ai. They developed both a free Python/R library (H2O AutoML) as well as an enterprise ready software solution called Driverless AI. But H2O is by far not the only player on the field. This blog post will provide you with a short comparison between two freely available Auto ML solutions and compare them by predictive performance as well as general usability.


Shiny application in production with ShinyProxy, Docker and Debian

You created some great Shiny applications, following our advice of Shiny packaging for example, and you want to put them into production, self-hosting, so that others can enjoy them, without limitations, on the Internet or on an internal server of your company? ShinyProxy is for you!. ShinyProxy v2.0 has recently been released. What a great opportunity to talk about its implementation! A major interest of ShinyProxy is its ability to create an independent Docker container for each user who connects to your Shiny application. This overrides the limitations on the number of users faced with ShinyServer. Are you interested? Follow the installation procedure….


A short tutorial on Fuzzy Time Series

There are several methods of analysis and forecasting, from the traditional and consecrated statistical tools (ARMA, ARIMA, SARIMA, Holt-Winters, etc), to the new computational intelligence tools (recurrent neural networks, LSTM, GRU, etc). There is no perfect method, neither the one I going to present here. But some key features distinguish the Fuzzy Time Series e turn it on a attractive option:
• Readability
• Manageability
• Simplicity
• Scalability
Hereafter I going to assume that you don´t have a machine learning (with focus on fuzzy systems) and time series background and I will present the key concepts of these fields. Then the Fuzzy Time Series methods will be introduced with the help of the pyFTS library. Let´s go?


Illustrated Guide to Recurrent Neural Networks

Hi and welcome to an Illustrated guide to recurrent neural networks. I’m Michael also known as LearnedVector. I’m a machine learning engineer in the A.I. voice assistant space. If you are just getting started in ML and want to get some intuition behind Recurrent neural networks, this post is for you.