Two years ago, Google unveiled its Tensor Processing Units or TPUs – specialized chips that live in the company´s data centers and make light work of AI tasks. Now, the company is moving its AI expertise down from the cloud, and has taken the wraps off its new Edge TPU; a tiny AI accelerator that will carry out machine learning jobs in IoT devices. The Edge TPU is designed to do what´s known as ‘inference.’ This is the part of machine learning where an algorithm actually carries out the task it was trained to do; like, for example, recognizing an object in a picture. Google´s server-based TPUs are optimized for the training part of this process, while these new Edge TPUs will do the inference. These new chips are destined to be used in enterprise jobs, not your next smartphone. That means tasks like automating quality control checks in factories. Doing this sort of job on-device has a number of advantages over using hardware that has to sent data over the internet for analysis. On-device machine learning is generally more secure; experiences less downtime; and delivers faster results. That´s the sales pitch anyway.
To turn data science from a scam to source of value, enterprises need to consider turning their data science programs from research endeavors into integral parts of their business and processes. At the same time, they need to consider laying down a true information architecture foundation. We frame this as the AI ladder: Data foundation, analytics, machine learning, AI/Cognitive.
Clustering functional data is mostly based on the projection of the curves onto an adequate basis and building random effects models of the basis coefficients. The parameters can be fitted with an EM algorithm. Alternatively, distance models based on the coefficients are used in the literature. Similar to the case of clustering multidimensional data, a variety of derivations of different models has been published. Although their calculation procedure is similar, their implementations are very different including distinct hyperparameters and data formats as input. This makes it difficult for the user to apply and particularly to compare them. Furthermore, they are mostly limited to specific basis functions. This paper aims to show the common elements between existing models in highly cited articles, first on a theoretical basis. Later their implementation is analyzed and it is illustrated how they could be improved and extended to a more general level. A special consideration is given to those models designed for sparse measurements. The work resulted in the R package funcy which was built to integrate the modified and extended algorithms into a unique framework.
Judging by the list of countries putting out policy papers on AI and automation technologies, there is very strong interest in AI across the globe. In order to asses the current state of readiness across regions, we recently conducted a survey (full report forthcoming) of the state of adoption of machine learning tools and technologies (a lot of what is being currently described as AI is really ML). The survey yielded 11,400+ respondents, including 2,000 respondents from Europe.
After reading this tutorial you’ll know how to embed R & Python scripts in T-SQL statements & know what data types are used to pass data between SQL & Python/R.
Comparing multiple models is one of the core but also one of the trickiest element of data analysis. Under a Bayesian framework the loo package in R allows you to derive (among other things) leave-one-out cross-validation metrics to compare the predictive abilities of different models.
In an earlier post, I focused on an in-depth visit with CHAID (Chi-square automatic interaction detection). Quoting myself, I said ‘As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it’. In this post I’ll spend a little time comparing CHAID with a random forest algorithm in the ranger library and with a gradient boosting algorithm via the xgboost library. I’ll use the exact same data set for all three so we can draw some easy comparisons about their speed and their accuracy.
Humans have the magical ability to plan for future events, for future gain. It´s not quite a uniquely human trait. Because apparently ravens can match a 4-year-old. An abundance of data, and some very nice R packages, make our ability to plan all the more powerful. A couple of months ago we looked at sales from an historical perspective in Digital Marketplace. Six months later. In this post, we´ll use the sales data to March 31st to model a time-series forecast for the next two years. The techniques apply to any time series with characteristics of trend, seasonality or longer-term cycles. Why forecast sales? Business plans require a budget, e.g. for resources, marketing and office space. A good projection of revenue provides the foundation for the budget. And, for an established business, with historical data, time-series forecasting is one way to deliver a robust projection. The forecast assumes one continues to do what one´s doing. So, it provides a good starting-point. Then one might, for example, add assumptions about new products or services.
A while ago I was running good old sample and comparing its performance to my lpm2_kdtree function in the BalancedSampling package (Grafström and Lisic, 2016). In this comparison I noticed that sample was in some cases slower than my balanced sampling method when using sampling weights.
Data driven machine learning for predictive modeling problems (classification, regression, or survival analysis) typically involves a number of steps beginning with data preprocessing and ending with performance evaluation. A large number of packages providing tools for the individual steps are available for R, but there is a lack of tools for facilitating rigorous performance evaluation of the complete procedures assembled from them by means of cross-validation, bootstrap, or similar methods. Such a tool should strictly prevent test set observations from influencing model training and meta-parameter tuning, so-called information leakage, in order to not produce overly optimistic performance estimates. Here we present a new package for R denoted emil (evaluation of modeling without information leakage) that offers this form of performance evaluation. It provides a transparent and highly customizable framework for facilitating the assembly, execution, performance evaluation, and interpretation of complete procedures for classification, regression, and survival analysis. The components of package emil have been designed to be as modular and general as possible to allow users to combine, replace, and extend them if needed. Package emil was also developed with scalability in mind and has a small computational overhead, which is a key requirement for analyzing the very big data sets now available in fields like medicine, physics, and finance. First package emil’s functionality and usage is explained. Then three specific application examples are presented to show its potential in terms of parallelization, customization for survival analysis, and development of ensemble models. Finally a brief comparison to similar software is provided.
We describe the R package rmcfs that implements an algorithm for ranking features from high dimensional data according to their importance for a given supervised classification task. The ranking is performed prior to addressing the classification task per se. This R package is the new and extended version of the MCFS (Monte Carlo feature selection) algorithm where an early version was published in 2005. The package provides an easy R interface, a set of tools to review results and the new ID (interdependency discovery) component. The algorithm can be used on continuous and/or categorical features (e.g., gene expression and phenotypic data) to produce an objective ranking of features with a statistically well-defined cutoff between informative and non-informative ones. Moreover, the directed ID graph that presents interdependencies between informative features is provided.
Supervised neural networks have been applied as a machine learning technique to identify and predict emergent patterns among multiple variables. A common criticism of these methods is the inability to characterize relationships among variables from a fitted model. Although several techniques have been proposed to ‘illuminate the black box’, they have not been made available in an open-source programming environment. This article describes the NeuralNetTools package that can be used for the interpretation of supervised neural network models created in R. Functions in the package can be used to visualize a model using a neural network interpretation diagram, evaluate variable importance by disaggregating the model weights, and perform a sensitivity analysis of the response variables to changes in the input variables. Methods are provided for objects from many of the common neural network packages in R, including caret, neuralnet, nnet, and RSNNS. The article provides a brief overview of the theoretical foundation of neural networks, a description of the package structure and functions, and an applied example to provide a context for model development with NeuralNetTools. Overall, the package provides a toolset for neural networks that complements existing quantitative techniques for data-intensive exploration.
Using a single machine learning model may not always fit the data. Optimizing its parameters also may not help. One solution is to combine multiple models together to fit the data. This tutorial discusses the importance of ensemble learning with gradient boosting as a study case.