Data Science Simplified Part 7: Log-Log Regression Models

In the last few blog posts of this series, we discussed simple linear regression model. We discussed multivariate regression model and methods for selecting the right model. Fernando has now created a better model.

How to build an image recognizer in R using just a few images

Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server. The process of training an image recognition system requires LOTS of images — millions and millions of them. The process involves feeding those images into a deep neural network, and during that process the network generates ‘features’ from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these ‘features’, you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.

Data wrangling : Cleansing – Regular expressions (1/3)

Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the fourth part of the series and it aims to cover the cleaning of data used. At previous parts we learned how to import, reshape and transform data. The rest of the series will be dedicated to the data cleansing process. On this post we will go through the regular expressions, a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings.In particular, we will cover the foundations of regular expression syntax.

Understanding overfitting: an inaccurate meme in supervised learning

Aim In this post, we will give an intuition on why model validation as approximating generalization error of a model fit and detection of overfitting can not be resolved simultaneously on a single model. We will work on a concrete example workflow in understanding overfitting, overtraining and a typical final model building stage after some conceptual introduction. We will avoid giving a reference to the Bayesian interpretations and regularisation and restrict the post to regression and cross-validation. While regularisation has different ramification due to its mathematical properties and prior distributions have different implications in Bayesian statistics. We assume an introductory background in machine learning, so this is not a beginners tutorial.

Shiny 1.0.4

Shiny 1.0.4 is now available on CRAN. For most Shiny users, the most exciting news is that file inputs now support dragging and dropping. It is now possible to add and remove tabs from a tabPanel, with the new functions insertTab(), appendTab(), prependTab(), and removeTab(). It is also possible to hide and show tabs with hideTab() and showTab(). Shiny also has a new a function, onStop(), which registers a callback function that will execute when the application exits. (Note that this is different from the existing onSessionEnded(), which registers a callback that executes when a user’s session ends. An application can serve multiple sessions.) This can be useful for cleaning up resources when an application exits, such as database connections. This release of Shiny also has many minor new features and bug fixes. For a the full set of changes, see the changelog.

Contouring learning rate to optimize neural nets

Learning rate is the rate at which the accumulation of information in a neural network progresses over time. The learning rate determines how quickly (and whether at all) the network reaches the optimum, most conducive location in the network for the specific output desired. In plain Stochastic Gradient Descent (SGD), the learning rate is not related to the shape of the error gradient because a global learning rate is used, which is independent of the error gradient. However, there are many modifications that can be made to the original SGD update rule that relates the learning rate to the magnitude and orientation of the error gradient.

QVC: Real-Time Data is the Future of ECommerce

Take a visit to most malls today and you’ll be a witness to an industry under siege. Retailers with physical stores have been struggling to compete with online competition as customers equipped with mobile phones check prices, product reviews, and do other research to help their shopping efforts. Customers are moving at top speed. Physical stores have a tough time keeping up.

Generative Adversarial Networks (GANs): Engine and Applications

Generative adversarial networks (GANs) are a class of neural networks that are used in unsupervised machine learning. They help to solve such tasks as image generation from descriptions, getting high resolution images from low resolution ones, predicting which drug could treat a certain disease, retrieving images that contain a given pattern, etc. The Statsbot team asked a data scientist, Anton Karazeev, to make the introduction to GANs engine and their applications in everyday life.

Next Generation, Artificial Intelligence and Machine Learning

Artificial Intelligence (A.I.) will soon be at the heart of every major technological system in the world including: cyber and homeland security, payments, financial markets, biotech, healthcare, marketing, natural language processing, computer vision, electrical grids, nuclear power plants, air traffic control, and Internet of Things (IoT). While A.I. seems to have only recently captured the attention of humanity, the reality is that A.I. has been around for over 60 years as a technological discipline. In the late 1950’s, Arthur Samuel wrote a checkers playing program that could learn from its mistakes and thus, over time, became better at playing the game. MYCIN, the first rule-based expert system, was developed in the early 1970’s and was capable of diagnosing blood infections based on the results of various medical tests. The MYCIN system was able to perform better than non-specialist doctors. While Artificial Intelligence is becoming a major staple of technology, few people understand the benefits and shortcomings of A.I. and Machine Learning technologies.

A New Beginning to Deep Learning

The first winter occurred in the 1970s, followed by another one in 1980s for some reason or the other, but majorly due to less resources. I agree that there have been many major breakthroughs but here’s my attempt to illustrate the timeline of major events…

The Rise of GPU Databases

The recent but noticeable shift from CPUs to GPUs is mainly due to the unique benefits they bring to sectors like AdTech, finance, telco, retail, or security/IT . We examine where GPU databases shine.