# Magister Dixit

“The future is undoubtedly attached to uncertainty, and this uncertainty can be estimated.” Data Science Heroes

# If you did not already know

Compressed Randomized UTV (CoR-UTV)
Low-rank matrix approximations play a fundamental role in numerical linear algebra and signal processing applications. This paper introduces a novel rank-revealing matrix decomposition algorithm termed Compressed Randomized UTV (CoR-UTV) decomposition along with a CoR-UTV variant aided by the power method technique. CoR-UTV is primarily developed to compute an approximation to a low-rank input matrix by making use of random sampling schemes. Given a large and dense matrix of size $m\times n$ with numerical rank $k$, where $k \ll \text{min}$, CoR-UTV requires a few passes over the data, and runs in $O(mnk)$ floating-point operations. Furthermore, CoR-UTV can exploit modern computational platforms and, consequently, can be optimized for maximum efficiency. CoR-UTV is simple and accurate, and outperforms reported alternative methods in terms of efficiency and accuracy. Simulations with synthetic data as well as real data in image reconstruction and robust principal component analysis applications support our claims. …

Ordinal Monte Carlo Tree Search (Ordinal MCTS)
In many problem settings, most notably in game playing, an agent receives a possibly delayed reward for its actions. Often, those rewards are handcrafted and not naturally given. Even simple terminal-only rewards, like winning equals 1 and losing equals -1, can not be seen as an unbiased statement, since these values are chosen arbitrarily, and the behavior of the learner may change with different encodings, such as setting the value of a loss to -0:5, which is often done in practice to encourage learning. It is hard to argue about good rewards and the performance of an agent often depends on the design of the reward signal. In particular, in domains where states by nature only have an ordinal ranking and where meaningful distance information between game state values are not available, a numerical reward signal is necessarily biased. In this paper, we take a look at Monte Carlo Tree Search (MCTS), a popular algorithm to solve MDPs, highlight a reoccurring problem concerning its use of rewards, and show that an ordinal treatment of the rewards overcomes this problem. Using the General Video Game Playing framework we show a dominance of our newly proposed ordinal MCTS algorithm over preference-based MCTS, vanilla MCTS and various other MCTS variants. …

Scaled Exponentially-Regularized Linear Unit (SERLU)
Recently, self-normalizing neural networks (SNNs) have been proposed with the intention to avoid batch or weight normalization. The key step in SNNs is to properly scale the exponential linear unit (referred to as SELU) to inherently incorporate normalization based on central limit theory. SELU is a monotonically increasing function, where it has an approximately constant negative output for large negative input. In this work, we propose a new activation function to break the monotonicity property of SELU while still preserving the self-normalizing property. Differently from SELU, the new function introduces a bump-shaped function in the region of negative input by regularizing a linear function with a scaled exponential function, which is referred to as a scaled exponentially-regularized linear unit (SERLU). The bump-shaped function has approximately zero response to large negative input while being able to push the output of SERLU towards zero mean statistically. To effectively combat over-fitting, we develop a so-called shift-dropout for SERLU, which includes standard dropout as a special case. Experimental results on MNIST, CIFAR10 and CIFAR100 show that SERLU-based neural networks provide consistently promising results in comparison to other 5 activation functions including ELU, SELU, Swish, Leakly ReLU and ReLU. …

# R Packages worth a look

Comprehensive, User-Friendly Toolkit for Probing Interactions (interactions)
A suite of functions for conducting and interpreting analysis of statistical interaction in regression models that was formerly part of the ‘jtools’ pa …

Quick Serialization of R Objects (qs)
Provides functions for quickly writing and reading any R object to and from disk. This package makes use of the ‘zstd’ library for compression and deco …

Simple ‘htmlwidgets’ Image Viewer with WebGL Brightness/Contrast (imageviewer)
Display a 2D-matrix data as a interactive zoomable gray-scale image viewer, providing tools for manual data inspection. The viewer window shows cursor …

# Document worth reading: “A Survey of Neuromorphic Computing and Neural Networks in Hardware”

Neuromorphic computing has come to refer to a variety of brain-inspired computers, devices, and models that contrast the pervasive von Neumann computer architecture. This biologically inspired approach has created highly connected synthetic neurons and synapses that can be used to model neuroscience theories as well as solve challenging machine learning problems. The promise of the technology is to create a brain-like ability to learn and adapt, but the technical challenges are significant, starting with an accurate neuroscience model of how the brain works, to finding materials and engineering breakthroughs to build devices to support these models, to creating a programming framework so the systems can learn, to creating applications with brain-like capabilities. In this work, we provide a comprehensive survey of the research and motivations for neuromorphic computing over its history. We begin with a 35-year review of the motivations and drivers of neuromorphic computing, then look at the major research areas of the field, which we define as neuro-inspired models, algorithms and learning approaches, hardware and devices, supporting systems, and finally applications. We conclude with a broad discussion on the major research topics that need to be addressed in the coming years to see the promise of neuromorphic computing fulfilled. The goals of this work are to provide an exhaustive review of the research conducted in neuromorphic computing since the inception of the term, and to motivate further work by illuminating gaps in the field where new research is needed. A Survey of Neuromorphic Computing and Neural Networks in Hardware

# Whats new on arXiv

Variational autoencoders are powerful algorithms for identifying dominant latent structure in a single dataset. In many applications, however, we are interested in modeling latent structure and variation that are enriched in a target dataset compared to some background—e.g. enriched in patients compared to the general population. Contrastive learning is a principled framework to capture such enriched variation between the target and background, but state-of-the-art contrastive methods are limited to linear models. In this paper, we introduce the contrastive variational autoencoder (cVAE), which combines the benefits of contrastive learning with the power of deep generative models. The cVAE is designed to identify and enhance salient latent features. The cVAE is trained on two related but unpaired datasets, one of which has minimal contribution from the salient latent features. The cVAE explicitly models latent features that are shared between the datasets, as well as those that are enriched in one dataset relative to the other, which allows the algorithm to isolate and enhance the salient latent features. The algorithm is straightforward to implement, has a similar run-time to the standard VAE, and is robust to noise and dataset purity. We conduct experiments across diverse types of data, including gene expression and facial images, showing that the cVAE effectively uncovers latent structure that is salient in a particular analysis.
Stochastic multiplayer games (SMGs) have gained attention in the field of strategy synthesis for multi-agent reactive systems. However, standard SMGs are limited to modeling systems where all agents have full knowledge of the state of the game. In this paper, we introduce delayed-action games (DAGs) formalism that simulates hidden-information games (HIGs) as SMGs, by eliminating hidden information by delaying a player’s actions. The elimination of hidden information enables the usage of SMG off-the-shelf model checkers to implement HIGs. Furthermore, we demonstrate how a DAG can be decomposed into a number of independent subgames. Since each subgame can be independently explored, parallel computation can be utilized to reduce the model checking time, while alleviating the state space explosion problem that SMGs are notorious for. In addition, we propose a DAG-based framework for strategy synthesis and analysis. Finally, we demonstrate applicability of the DAG-based synthesis framework on a case study of a human-on-the-loop unmanned-aerial vehicle system that may be under stealthy attack, where the proposed framework is used to formally model, analyze and synthesize security-aware strategies for the system.
State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.
The main goal of statistical learning theory is to provide a fundamental framework for the problem of decision making and model construction based on sets of data. Here, we present a brief introduction to the fundamentals of statistical learning theory, in particular the difference between empirical and structural risk minimization, including one of its most prominent implementations, i.e. the Support Vector Machine.
We present $\alpha$-loss, $\alpha \in [1,\infty]$, a tunable loss function for binary classification that bridges log-loss ($\alpha=1$) and $0$$1$ loss ($\alpha = \infty$). We prove that $\alpha$-loss has an equivalent margin-based form and is classification-calibrated, two desirable properties for a good surrogate loss function for the ideal yet intractable $0$$1$ loss. For logistic regression-based classification, we provide an upper bound on the difference between the empirical and expected risk for $\alpha$-loss by exploiting its Lipschitzianity along with recent results on the landscape features of empirical risk functions. Finally, we show that $\alpha$-loss with $\alpha = 2$ performs better than log-loss on MNIST for logistic regression.
Marginal Structural Models (MSM)~\cite{Robins00} are the most popular models for causal inference from time-series observational data. However, they have two main drawbacks: (a) they do not capture subject heterogeneity, and (b) they only consider fixed time intervals and do not scale gracefully with longer intervals. In this work, we propose a new family of MSMs to address these two concerns. We model the potential outcomes as a three-dimensional tensor of low rank, where the three dimensions correspond to the agents, time periods and the set of possible histories. Unlike the traditional MSM, we allow the dimensions of the tensor to increase with the number of agents and time periods. We set up a weighted tensor completion problem as our estimation procedure, and show that the solution to this problem converges to the true model in an appropriate sense. Then we show how to solve the estimation problem, providing conditions under which we can approximately and efficiently solve the estimation problem. Finally we propose an algorithm based on projected gradient descent, which is easy to implement, and evaluate its performance on a simulated dataset.
We consider the problem of estimating the probability distribution of a discrete random variable in the setting where the observations are corrupted by outliers. Assuming that the discrete variable takes k values, the unknown parameter p is a k-dimensional vector belonging to the probability simplex. We first describe various settings of contamination and discuss the relation between these settings. We then establish minimax rates when the quality of estimation is measured by the total-variation distance, the Hellinger distance, or the L2-distance between two probability measures. Our analysis reveals that the minimax rates associated to these three distances are all different, but they are all attained by the maximum likelihood estimator. Note that the latter is efficiently computable even when the dimension is large. Some numerical experiments illustrating our theoretical findings are reported.
Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing.
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).
The advent of modern technology, permitting the measurement of thousands of characteristics simultaneously, has given rise to floods of data characterized by many large or even huge datasets. This new paradigm presents extraordinary challenges to data analysis and the question arises: how can conventional data analysis methods, devised for moderate or small datasets, cope with the complexities of modern data? The case of high dimensional data is particularly revealing of some of the drawbacks. We look at the case where the number of characteristics measured in an object is at least the number of observed objects and conclude that this configuration leads to geometrical and mathematical oddities and is an insurmountable barrier for the direct application of traditional methodologies. If scientists are going to ignore fundamental mathematical results arrived at in this paper and blindly use software to analyze data, the results of their analyses may not be trustful, and the findings of their experiments may never be validated. That is why new methods together with the wise use of traditional approaches are essential to progress safely through the present reality.
We study the interplay between memorization and generalization of overparametrized networks in the extreme case of a single training example. The learning task is to predict an output which is as similar as possible to the input. We examine both fully-connected and convolutional networks that are initialized randomly and then trained to minimize the reconstruction error. The trained networks take one of the two forms: the constant function (‘memorization’) and the identity function (‘generalization’). We show that different architectures exhibit vastly different inductive bias towards memorization and generalization. An important consequence of our study is that even in extreme cases of overparameterization, deep learning can result in proper generalization.
This paper introduces a new method for model selection and more generally hyperparameter selection in machine learning. The paper first proves a relationship between generalization error and a difference of description lengths of the training data; we call this difference differential description length (DDL). This allows prediction of generalization error from the training data \emph{alone} by performing encoding of the training data. This can now be used for model selection by choosing the model that has the smallest predicted generalization error. We show how this encoding can be done for linear regression and neural networks. We provide experiments showing that this leads to smaller generalization error than cross-validation and traditional MDL and Bayes methods.
Originally inspired by neurobiology, deep neural network models have become a powerful tool of machine learning and artificial intelligence, where they are used to approximate functions and dynamics by learning from examples. Here we give a brief introduction to neural network models and deep learning for biologists. We introduce feedforward and recurrent networks and explain the expressive power of this modeling framework and the backpropagation algorithm for setting the parameters. Finally, we consider how deep neural networks might help us understand the brain’s computations.
If we assume that earthquakes are chaotic, and influenced locally then chaos theory suggests that there should be a temporal association between earthquakes in a local region that should be revealed with statistical examination. To date no strong relationship has been shown (refs not prediction). However, earthquakes are basically failures of structured material systems, and when multiple failure mechanisms are present, prediction of failure is strongly inhibited without first separating the mechanisms. Here we show that by separating earthquakes statistically, based on their central tensor moment structure, along lines first suggested by a separation into mechanisms according to depth of the earthquake, a strong indication of temporal association appears. We show this in earthquakes above 200 Km along the pacific ring of fire, with a positive association in time between earthquakes of the same statistical type and a negative association in time between earthquakes of different types. Whether this can reveal either useful mechanistic information to seismologists, or can result in useful forecasts remains to be seen.
We study a classic algorithmic problem through the lens of statistical learning. That is, we consider a matching problem where the input graph is sampled from some distribution. This distribution is unknown to the algorithm; however, an additional graph which is sampled from the same distribution is given during a training phase (preprocessing). More specifically, the algorithmic problem is to match $k$ out of $n$ items that arrive online to $d$ categories ($d\ll k \ll n$). Our goal is to design a two-stage online algorithm that retains a small subset of items in the first stage which contains an offline matching of maximum weight. We then compute this optimal matching in a second stage. The added statistical component is that before the online matching process begins, our algorithms learn from a training set consisting of another matching instance drawn from the same unknown distribution. Using this training set, we learn a policy that we apply during the online matching process. We consider a class of online policies that we term \emph{thresholds policies}. For this class, we derive uniform convergence results both for the number of retained items and the value of the optimal matching. We show that the number of retained items and the value of the offline optimal matching deviate from their expectation by $O(\sqrt{k})$. This requires usage of less-standard concentration inequalities (standard ones give deviations of $O(\sqrt{n})$). Furthermore, we design an algorithm that outputs the optimal offline solution with high probability while retaining only $O(k\log \log n)$ items in expectation.
We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set.
We study the expressive power of kernel methods and the algorithmic feasibility of multiple kernel learning for a special rich class of kernels. Specifically, we define \emph{Euclidean kernels}, a diverse class that includes most, if not all, families of kernels studied in literature such as polynomial kernels and radial basis functions. We then describe the geometric and spectral structure of this family of kernels over the hypercube (and to some extent for any compact domain). Our structural results allow us to prove meaningful limitations on the expressive power of the class as well as derive several efficient algorithms for learning kernels over different domains.
When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.
We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.
Rational decision making in its linguistic description means making logical decisions. In essence, a rational agent optimally processes all relevant information to achieve its goal. Rationality has two elements and these are the use of relevant information and the efficient processing of such information. In reality, relevant information is incomplete, imperfect and the processing engine, which is a brain for humans, is suboptimal. Humans are risk averse rather than utility maximizers. In the real world, problems are predominantly non-convex and this makes the idea of rational decision-making fundamentally unachievable and Herbert Simon called this bounded rationality. There is a trade-off between the amount of information used for decision-making and the complexity of the decision model used. This explores whether machine rationality is subjective and concludes that indeed it is.
In this paper, we present a dynamic non-diagonal regularization for interior point methods. The non-diagonal aspect of this regularization is implicit, since all the off-diagonal elements of the regularization matrices are cancelled out by those elements present in the Newton system, which do not contribute important information in the computation of the Newton direction. Such a regularization has multiple goals. The obvious one is to improve the spectral properties of the Newton system solved at each iteration of the interior point method. On the other hand, the regularization matrices introduce sparsity to the aforementioned linear system, allowing for more efficient factorizations. We also propose a rule for tuning the regularization dynamically based on the properties of the problem, such that sufficiently large eigenvalues of the non-regularized system are perturbed insignificantly. This alleviates the need of finding specific regularization values through experimentation, which is the most common approach in literature. We provide perturbation bounds for the eigenvalues of the non-regularized system matrix and then discuss the spectral properties of the regularized matrix. Finally, we demonstrate the efficiency of the method applied to solve standard small and medium-scale linear and convex quadratic programming test problems.
We present a novel and hierarchical approach for supervised classification of signals spanning over a fixed graph, reflecting shared properties of the dataset. To this end, we introduce a Convolutional Cluster Pooling layer exploiting a multi-scale clustering in order to highlight, at different resolutions, locally connected regions on the input graph. Our proposal generalises well-established neural models such as Convolutional Neural Networks (CNNs) on irregular and complex domains, by means of the exploitation of the weight sharing property in a graph-oriented architecture. In this work, such property is based on the centrality of each vertex within its soft-assigned cluster. Extensive experiments on NTU RGB+D, CIFAR-10 and 20NEWS demonstrate the effectiveness of the proposed technique in capturing both local and global patterns in graph-structured data out of different domains.
Session-based recommender systems (SBRS) are an emerging topic in the recommendation domain and have attracted much attention from both academia and industry in recent years. Most of existing works only work on modelling the general item-level dependency for recommendation tasks. However, there are many more other challenges at different levels, e.g., item feature level and session level, and from various perspectives, e.g., item heterogeneity and intra- and inter-item feature coupling relations, associated with SBRS. In this paper, we provide a systematic and comprehensive review on SBRS and create a hierarchical and in-depth understanding of a variety of challenges in SBRS. To be specific, we first illustrate the value and significance of SBRS, followed by a hierarchical framework to categorize the related research issues and methods of SBRS and to reveal its intrinsic challenges and complexities. Further, a summary together with a detailed introduction of the research progress is provided. Lastly, we share some prospects in this research area.
Today’s AI still faces two major challenges. One is that in most industries, data exists in the form of isolated islands. The other is the strengthening of data privacy and security. We propose a possible solution to these challenges: secure federated learning. Beyond the federated learning framework first proposed by Google in 2016, we introduce a comprehensive secure federated learning framework, which includes horizontal federated learning, vertical federated learning and federated transfer learning. We provide definitions, architectures and applications for the federated learning framework, and provide a comprehensive survey of existing works on this subject. In addition, we propose building data networks among organizations based on federated mechanisms as an effective solution to allow knowledge to be shared without compromising user privacy.
Generating informative responses in end-to-end neural dialogue systems attracts a lot of attention in recent years. Various previous work leverages external knowledge and the dialogue contexts to generate such responses. Nevertheless, few has demonstrated their capability on incorporating the appropriate knowledge in response generation. Motivated by this, we propose a novel open-domain conversation generation model in this paper, which employs the posterior knowledge distribution to guide knowledge selection, therefore generating more appropriate and informative responses in conversations. To the best of our knowledge, we are the first one who utilize the posterior knowledge distribution to facilitate conversation generation. Our experiments on both automatic and human evaluation clearly verify the superior performance of our model over the state-of-the-art baselines.
Before training a neural net, a classic rule of thumb is to randomly initialize the weights so that the variance of the preactivation is preserved across all layers. This is traditionally interpreted using the total variance due to randomness in both networks (weights) and samples. Alternatively, one can interpret the rule of thumb as preservation of the \emph{sample} mean and variance for a fixed network, i.e., preactivation statistics computed over the random sample of training samples. The two interpretations differ little for a shallow net, but the difference is shown to be large for a deep ReLU net by decomposing the total variance into the network-averaged sum of the sample variance and square of the sample mean. We demonstrate that the latter term dominates in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width. Our experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance. We discuss the implications for the alternative rule of thumb that a network should be initialized to be at the ‘edge of chaos.’
Learning to solve diagrammatic reasoning (DR) can be a challenging but interesting problem to the computer vision research community. It is believed that next generation pattern recognition applications should be able to simulate human brain to understand and analyze reasoning of images. However, due to the lack of benchmarks of diagrammatic reasoning, the present research primarily focuses on visual reasoning that can be applied to real-world objects. In this paper, we present a diagrammatic reasoning dataset that provides a large variety of DR problems. In addition, we also propose a Knowledge-based Long Short Term Memory (KLSTM) to solve diagrammatic reasoning problems. Our proposed analysis is arguably the first work in this research area. Several state-of-the-art learning frameworks have been used to compare with the proposed KLSTM framework in the present context. Preliminary results indicate that the domain is highly related to computer vision and pattern recognition research with several challenging avenues.
We estimate model parameters of L\’evy-driven causal CARMA random fields by fitting the empirical variogram to the theoretical counterpart using a weighted least squares (WLS) approach. Subsequent to deriving asymptotic results for the variogram estimator, we show strong consistency and asymptotic normality of the parameter estimator. Furthermore, we conduct a simulation study to assess the quality of the WLS estimator for finite samples. For the simulation we utilize numerical approximation schemes based on truncation and discretization of stochastic integrals and we analyze the associated simulation errors in detail. Finally, we apply our results to real data of the cosmic microwave background.
This paper presents a novel, high-performance, graphical processing unit-based algorithm for efficiently solving two-dimensional linear programs in batches. The domain of two-dimensional linear programs is particularly useful due to the prevalence of relevant geometric problems. Batch linear programming refers to solving numerous different linear programs within one operation. By solving many linear programs simultaneously and distributing workload evenly across threads, graphical processing unit utilization can be maximized. Speedups of over 22 times and 63 times are obtained against state-of-the-art graphics processing unit and CPU linear program solvers, respectively.
In this paper we propose to perform model ensembling in a multiclass or a multilabel learning setting using Wasserstein (W.) barycenters. Optimal transport metrics, such as the Wasserstein distance, allow incorporating semantic side information such as word embeddings. Using W. barycenters to find the consensus between models allows us to balance confidence and semantics in finding the agreement between the models. We show applications of Wasserstein ensembling in attribute-based classification, multilabel learning and image captioning generation. These results show that the W. ensembling is a viable alternative to the basic geometric or arithmetic mean ensembling.
To relieve the pain of manually selecting machine learning algorithms and tuning hyperparameters, automated machine learning (AutoML) methods have been developed to automatically search for good models. Due to the huge model search space, it is impossible to try all models. Users tend to distrust automatic results and increase the search budget as much as they can, thereby undermining the efficiency of AutoML. To address these issues, we design and implement ATMSeer, an interactive visualization tool that supports users in refining the search space of AutoML and analyzing the results. To guide the design of ATMSeer, we derive a workflow of using AutoML based on interviews with machine learning experts. A multi-granularity visualization is proposed to enable users to monitor the AutoML process, analyze the searched models, and refine the search space in real time. We demonstrate the utility and usability of ATMSeer through two case studies, expert interviews, and a user study with 13 end users.

# Book Memo: “Keras to Kubernetes”

 The Journey of a Machine Learning Model to Production We have seen an exponential growth in the use of Artificial Intelligence (AI) over last few years. AI is becoming the new electricity and is touching every industry from retail to manufacturing to healthcare to entertainment. Within AI, we’re seeing a particular growth in Machine Learning (ML) and Deep Learning (DL) applications. ML is all about learning relationships from labeled (Supervised) or unlabeled data (Unsupervised). DL has many layers of learning and can extract patterns from unstructured data like images, video, audio, etc. Machine Learning with Keras and Kubernetes takes you through real-world examples of building a Keras model for detecting logos in images. You will then take that trained model and package it as a web application container before learning how to deploy this model at scale on a Kubernetes cluster.

# Distilled News

If you haven’t heard of Docker, it is a system that allows projects to be split into discrete units (i.e. containers) that each operate within their own virtual environment. Each container has a blueprint written in its Dockerfile that describes all of the operating parameters including operating system and package dependencies/requirements. Docker images are easily distributed and, because they are self-contained, will operate on any other system that has Docker installed, include servers. When multiple instances/users attempt to start a Shiny App at the same time, only a single R session is initiated on the serving machine. This is problematic. For example, if one user starts a process that takes 10 seconds to complete, all other users will need to wait until that process has completed before any other tasks can be processed.
IBM is leveraging Kubernetes to enable its Watson AI to run on public clouds AWS, Google, and Microsoft Azure. The move signals a shift in strategy for IBM.
I recently came across one of the most well-intended, and most unnerving, applications of AI in recruiting; a talking robot head pitched as a solution to avoid bias in interviewing. Picture a robot the size of an Alexa with an actual human face painted to the top. The face changes, tries to show expression and and non verbal cues. It wasn’t a joke either. The face was meant to be there. Think interviewing with Chucky if you need to visualize this.
I have worked on the problem of open-sourcing Machine Learning versus sensitivity for a long time, especially in disaster response contexts: when is it right/wrong to release data or a model publicly? This article is a list of frequently asked questions, the answers that are best practice today, and some examples of where I have encountered them.
I have been working with Tensorflow during last months and I realized that, although there is a large number of Github repositories with many different and complex models, is hard to find a simple example that shows you how to obtain your own dataset from the web and apply some Deep Learning on it. In this post I pretended to provide an example of this task but being keeping it as simple as possible. I will show you how to obtain online unlabeled data, how to create a simple convolutional network, train it with some supervised data and use it later to classify the data we have gathered from the web.
I am going to give intuitive understanding of Regularization method in as simple words as possible. Firstly, I will discuss some basic ideas, so if you think you are already families with those, feel free to move ahead.
Have you heard of RStudio Connect, but do not know where to start? Maybe you are trying to show your manager how Shiny applications can be deployed in production, or convince a DevOps engineer that R can fit into her existing tooling. Perhaps you want to explore the functionality of RStudio’s Professional products to see if they fit the needs you have in your work. Today, we are excited to announce the RStudio QuickStart, which allows you to try out RStudio Connect for free from your desktop.
For the past year, my team and I have been working on a personalized user experience in the Taboola feed. We used Multi-Task Learning (MTL) to predict multiple Key Performance Indicators (KPIs) on the same set of input features, and implemented a Deep Learning (DL) model in TensorFlow to do so. Back when we started, MTL seemed way more complicated to us than it does now, so I wanted to share some of the lessons learned.
For background to this post, please see Learn #MachineLearning Coding Basics in a weekend. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book.
Feature engineering plays a key role in machine learning, data mining, and data analytics. This article provides a general definition for feature engineering, together with an overview of the major issues, approaches, and challenges of the field.
Emotion is a fundamental element of human society. If you think about it, everything worth analyzing is influenced by human behavior. Cyber attacks are highly impacted by disgruntled employees who may either ignore due diligence or engage in insider misuse. The stock market depends on the effect of the economic climate, which itself is dependent on the aggregate behavior of the masses. In the field of communication, it is common knowledge that what we say account for only 7% of the message while the rest 93% is encoded in facial expressions and other non-verbal cues. Entire fields of psychology and behavioral economics are dedicated to this field. That being said, the ability to measure and analyze emotions effectively will enable us to improve society in remarkable ways. For example, a psychology professor at the University of California, San Francisco, Paul Ekman, describes in his book, Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage, how reading facial expression can help psychologists find signs of potential suicide attempts while the patient lies about such intentions. Sounds like a job for facial recognition models? What about neural mapping? Can we effectively map emotional states from neural impulses? What about improving cognitive abilities? Or even emotional intelligence and effective communication? There are plenty of problems in the world to solve using the vast array of unstructured data that is available to us.
Dropout is commonly used to regularize deep neural networks; however, applying dropout on fully-connected layers and applying dropout on convolutional layers are fundamentally different operations. While it is known in the deep learning community that dropout has limited benefits when applied to convolutional layers, I wanted to show a simple mathematical example of why the two are different. To do so, I’ll define how dropout operates on fully-connected layers, define how dropout operates on convolutional layers and contrast the two operations.
Discussions on the interplay of humans and Artificial Intelligence tend to pose the issue in the language of opposition. However, according to the thinking of evolutionary biologist Richard Dawkins, tools such as AI can be better thought of as part of our extended phenotype. A phenotype refers to the observable characteristic of an organism, and the idea of the extended phenotype is that this should not be limited to biological processes, but include all of the effects that the genes have upon their environment, both internally and externally.
Deep Neural Networks are highly expressive machine learning networks that have been around for many decades. In 2012, with gains in computing power and improved tooling, a family of these machine learning models called ConvNets started achieving state of the art performance on visual recognition tasks. Up to this point, machine learning algorithms simply didn’t work well enough for anyone to be surprised when it failed to do the right thing.
Deep learning a subset of machine learning, has delivered super-human accuracy in a variety of practical uses in the past decade. From revolutionizing customer experience, machine translation, language recognition, autonomous vehicles, computer vision, text generation, speech understanding, and a multitude of other AI applications. In contrast to machine learning where an AI agent learns from data based on machine learning algorithms, deep learning is based on a neural network architecture which acts similarly to the human brain, and allows the AI agent to analyze data fed in?-?in a structure similar to the way humans do. Deep learning models do not require algorithms to specify what to do with the data, which is made possible thanks to the extraordinary amount of data we as humans, collect and consume?-?which in turn is fed to deep learning models .
Consensus is growing that model operationalization – rather than model development – is today’s biggest hurdle for data science. Production deployment techniques are generally one-offs, and data scientists and data engineers often lack the skills to operationalize models. Application integration, model monitoring and tuning, and workflow automation are often afterthoughts. Sometimes called the ‘last mile’ for analytics, this is where data science meets production IT. And it’s where business value is (or is not) created. Achieving the vision of becoming a model-driven business that deploys and iterates models at scale requires something that only a handful of companies have: ModelOps.

# R Packages worth a look

Similarity-Based Segmentation of Multidimensional Signals (segmenTier)
A dynamic programming solution to segmentation based on maximization of arbitrary similarity measures within segments. The general idea, theory and thi …

Analysis of Longitudinal Data with Irregular Observation Times (IrregLong)
Analysis of longitudinal data for which the times of observation are random variables that are potentially associated with the outcome process. The pac …

Shiny Matrix Input Field (shinyMatrix)
Implements a custom matrix input field.

# Distilled News

It isn’t intuitive, creative, inspired, generalized, or conscious. Will it ever be like us? Will it ever think like us? As I study data science I learn a little more about artificial intelligence each day. I practice wielding the tools in my machine learning tool box, and I read articles?-?and the more I learn, the more annoyed I get by what I read. Piece after piece of journalism adapts the same breathless tone toward AI. An article will begin by describing the algorithms behind a real achievement but will always take a leap toward a vision of the future. Some day it will do more, they say: more than play Go; more than flip a burger; more than guide a missile. Some day it will do everything that you can do. I don’t want to hear another vision of the future. I want to know the steps that will take us to that moment when our machines’ intelligence matches ours. Start by thinking about our own thinking. We know in a broad way that intelligence means more than just mastery of a set of skills or a system of knowledge. We grow and adapt, we dream and create, we delight each other and surprise ourselves. We cannot quantify the entirety of our own intelligence, and indeed we are only in the infancy of our study of the brain and the gut. But we can quantify the intelligence of the machines that we build. We know how to do this because we have painstakingly constructed each model, framework, and algorithm.
It is common secret that Python programming language has a solid claim to being the fastest-growing major programming language witnessing an extraordinary growth in the last five years, as seen by Stack Overflow traffic. Based on data describing the Stack Overflow question views which go to late 2011, the growth of Python relative to five other major programming languages is plotted.
In this post, I walk through the code I used to make a nice diagram illustrating the parameters in a logistic growth curve. I made this figure for a conference submission. I had a tight word limit (600 words) and a complicated statistical method (Bayesian nonlinear mixed effects beta regression), so I wanted to use a diagram to carry some of the expository load. Also, figures didn’t count towards the word limit, so that was a bonus.
In this article, we will see how to do web scraping with R while doing so, we’ll leverage functional programming in R to scale it up. The nature of the article is more like a cookbook-format rather than a documentation/tutorial-type, because the objective here is to explain how effectively web scraping can be coupled with Functional Programming
In this article you’ll learn the basics steps to performing time-series analysis and concepts like trend, stationarity, moving averages, etc. You’ll also explore exponential smoothing methods, and learn how to fit an ARIMA model on non-stationary data.
In Thinking, Fast and Slow, Daniel Kahneman discusses at length the importance of stereotypes in understanding many decision-making processes. A so-called System 1 is used for quick decision-making: it allows us to recognize people and objects, helps us focus our attention, and encourages us to fear spiders. It is based on knowledge stored in memory and accessible without intention, and without effort. It can be contrasted with System 2, which allows for more complex decision-making, requiring discipline and sequential reflection. In the first case, our brain uses the stereotypes that govern judgments of representativeness, and uses this heuristic to make decisions. If I cook a fish for friends who have come to eat, I open a bottle of white wine. The cliché ‘fish goes well with white wine’ allows me to make a decision quickly, without having to think about it. Stereotypes are statements about a group that are accepted (at least provisionally) as facts about each member. Whether correct or not, stereotypes are the basic tools for thinking about categories in System 1. But in many cases, a more in-depth, more sophisticated reflection – corresponding to System 2 – will make it possible to make a more judicious, even optimal decision. Without choosing any red wine, a pinot noir could perhaps also be perfectly suitable for roasted red mullets.
In the previous post (https://…rm-random-in-hyper-parameter-optimization ), it is shown how to identify the optimal hyper-parameter in a General Regression Neural Network by using the Sobol sequence and the uniform random generator respectively through the N-fold cross validation. While the Sobol sequence yields a slightly better performance, outcomes from both approaches are very similar, as shown below based upon five trials with 20 samples in each. Both approaches can be generalized from one-dimensional to multi-dimensional domains, e.g. boosting or deep learning.
In the first part, you learned about trends and seasonality, smoothing models and ARIMA processes. In this part, you’ll learn how to deal with seasonal models and how to implement Seasonal Holt-Winters and Seasonal ARIMA (SARIMA).
In part 1, we looked at the theory behind Q-learning using a very simple dungeon game with two strategies: the accountant and the gambler. This second part takes these examples, turns them into Python code and trains them in the cloud, using the Valohai deep learning management platform. Due to the simplicity of our example, we will not use any libraries like TensorFlow or simulators like OpenAI Gym on purpose. Instead we will code everything ourselves from scratch to provide the full picture.
In one of his books, Isaac Asimov envisions a future where computers have become so intelligent and powerful, that they are able to answer any question. In that future, Asimov postulates, scientists don’t become unnecessary. Instead, they’re left with a difficult task: figuring out how to ask the computers the right questions: those that yield an insightful, useful answer. We’re not quite there yet, but in some sense we are.
In many (business) cases it is equally important to not only have an accurate, but also an interpretable model. Oftentimes, apart from wanting to know what our model’s house price prediction is, we also wonder why it is this high/low and which features are most important in determining the forecast. Another example might be predicting customer churn?-?it is very nice to have a model that is successfully predicting which customers are prone to churn, but identifying which variables are important can help us in early detection and maybe even improving the product/service!
In a previous post, popular time series analysis techniques were introduced. Here, we will apply those techniques in Python for stock prediction. Specifically, we will use the historical stock price of the New Germany Fund (GF) to try to predict the closing price in the next five trading days.
Imbalanced data problems in binary prediction models and a simple but effective way to take care of them with Python and R.
Image restoration refers to the task of recovery of an unknown true image from its degraded image. The degradation of image may occur during image formation, transmission, and storage. This task has a wide scope of usage for satellite imaging , low-light photography and due to advancement in digital technology, computational and communication technology restoration of clean image from the degraded image is very important and hence, has evolved into a field of research which intersects with image processing, computer vision, and computational imaging.
If you’ve heard of different kinds of convolutions in Deep Learning (e.g. 2D / 3D / 1×1 / Transposed / Dilated (Atrous) / Spatially Separable / Depthwise Separable / Flattened / Grouped / Shuffled Grouped Convolution), and got confused what they actually mean, this article is written for you to understand how they actually work. Here in this article, I summarize several types of convolution commonly used in Deep Learning, and try to explain them in a way that is accessible for everyone. Besides this article, there are several good articles from others on this topic. Please check them out (listed in the Reference).
If you’re going to do Machine Learning in Python, Scikit Learn is the gold standard. Scikit-learn provides a wide selection of supervised and unsupervised learning algorithms. Best of all, it’s by far the easiest and cleanest ML library. Scikit learn was created with a software engineering mindset. It’s core API design revolves around being easy to use, yet powerful, and still maintaining flexibility for research endeavours. This robustness makes it perfect for use in any end-to-end ML project, from the research phase right down to production deployments.

# If you did not already know

Kriging Models
In statistics, originally in geostatistics, Kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial spline chosen to optimize smoothness of the fitted values. Under suitable assumptions on the priors, Kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most likely intermediate values. The method is widely used in the domain of spatial analysis and computer experiments. The technique is also known as Kolmogorov Wiener prediction. …

Maximum Correntropy Criterion Kalman Filter (MCC-KF)
We present robust dynamic resource allocation mechanisms to allocate application resources meeting Service Level Objectives (SLOs) agreed between cloud providers and customers. In fact, two filter-based robust controllers, i.e. H-infinity filter and Maximum Correntropy Criterion Kalman filter (MCC-KF), are proposed. The controllers are self-adaptive, with process noise variances and covariances calculated using previous measurements within a time window. In the allocation process, a bounded client mean response time (mRT) is maintained. Both controllers are deployed and evaluated on an experimental testbed hosting the RUBiS (Rice University Bidding System) auction benchmark web site. The proposed controllers offer improved performance under abrupt workload changes, shown via rigorous comparison with current state-of-the-art. On our experimental setup, the Single-Input-Single-Output (SISO) controllers can operate on the same server where the resource allocation is performed; while Multi-Input-Multi-Output (MIMO) controllers are on a separate server where all the data are collected for decision making. SISO controllers take decisions not dependent to other system states (servers), albeit MIMO controllers are characterized by increased communication overhead and potential delays. While SISO controllers offer improved performance over MIMO ones, the latter enable a more informed decision making framework for resource allocation problem of multi-tier applications. …

Scale Aware Feature Encoder (SAFE)
In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales. SAFE is composed of a multi-scale convolutional encoder and a scale attention network. The multi-scale convolutional encoder targets at extracting character features under multiple scales, and the scale attention network is responsible for selecting features from the most relevant scale(s). SAFE has two main advantages over the traditional single-CNN encoder used in current state-of-the-art text recognizers. First, it explicitly tackles the scale problem by extracting scale-invariant features from the characters. This allows the recognizer to put more effort in handling other challenges in scene text recognition, like those caused by view distortion and poor image quality. Second, it can transfer the learning of feature encoding across different character scales. This is particularly important when the training set has a very unbalanced distribution of character scales, as training with such a dataset will make the encoder biased towards extracting features from the predominant scale. To evaluate the effectiveness of SAFE, we design a simple text recognizer named scale-spatial attention network (S-SAN) that employs SAFE as its feature encoder, and carry out experiments on six public benchmarks. Experimental results demonstrate that S-SAN can achieve state-of-the-art (or, in some cases, extremely competitive) performance without any post-processing. …