# Magister Dixit

“Feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.” Scott Locklin ( 2014 )

# Finding out why

Visual objects are composed of a recursive hierarchy of perceptual wholes and parts, whose properties, such as shape, reflectance, and color, constitute a hierarchy of intrinsic causal factors of object appearance. However, object appearance is the compositional consequence of both an object’s intrinsic and extrinsic causal factors, where the extrinsic causal factors are related to illumination, and imaging conditions. Therefore, this paper proposes a unified tensor model of wholes and parts, and introduces a compositional hierarchical tensor factorization that disentangles the hierarchical causal structure of object image formation, and subsumes multilinear block tensor decomposition as a special case. The resulting object representation is an interpretable combinatorial choice of wholes’ and parts’ representations that renders object recognition robust to occlusion and reduces training data requirements. We demonstrate ourapproach in the context of face recognition by training on an extremely reduced dataset of synthetic images, and report encouragingface verification results on two datasets – the Freiburg dataset, andthe Labeled Face in the Wild (LFW) dataset consisting of real world images, thus, substantiating the suitability of our approach for data starved domains.
Multiple imputation is widely used to handle confounders missing at random in causal inference. Although Rubin’s combining rule is simple, it is not clear weather or not the standard multiple imputation inference is consistent when coupled with the commonly-used average causal effect (ACE) estimators. This article establishes a unified martingale representation for the average causal effect (ACE) estimators after multiple imputation. This representation invokes the wild bootstrap inference to provide consistent variance estimation. Our framework applies to asymptotically normal ACE estimators, including the regression imputation, weighting, and matching estimators. We extend to the scenarios when both outcome and confounders are subject to missingness and when the data are missing not at random.
Machine learning practice is often impacted by confounders. Confounding can be particularly severe in remote digital health studies where the participants self-select to enter the study. While many different confounding adjustment approaches have been proposed in the literature, most of these methods rely on modeling assumptions, and it is unclear how robust they are to violations of these assumptions. This realization has recently motivated the development of restricted permutation methods to quantify the influence of observed confounders on the predictive performance of a machine learning models and evaluate if confounding adjustment methods are working as expected. In this paper we show, nonetheless, that restricted permutations can generate biased estimates of the contribution of the confounders to the predictive performance of a learner, and we propose an alternative approach to tackle this problem. By viewing a classification task from a causality perspective, we are able to leverage conditional independence tests between predictions and test set labels and confounders in order to detect confounding on the predictive performance of a classifier. We illustrate the application of our causality-based approach to data collected from mHealth study in Parkinson’s disease.
Large-scale trends in urban crime and global terrorism are well-predicted by socio-economic drivers, but focused, event-level predictions have had limited success. Standard machine learning approaches are promising, but lack interpretability, are generally interpolative, and ineffective for precise future interventions with costly and wasteful false positives. Here, we are introducing Granger Network inference as a new forecasting approach for individual infractions with demonstrated performance far surpassing past results, yet transparent enough to validate and extend social theory. Considering the problem of predicting crime in the City of Chicago, we achieve an average AUC of ~90\% for events predicted a week in advance within spatial tiles approximately $1000$ ft across. Instead of pre-supposing that crimes unfold across contiguous spaces akin to diffusive systems, we learn the local transport rules from data. As our key insights, we uncover indications of suburban bias — how law-enforcement response is modulated by socio-economic contexts with disproportionately negative impacts in the inner city — and how the dynamics of violent and property crimes co-evolve and constrain each other — lending quantitative support to controversial pro-active policing policies. To demonstrate broad applicability to spatio-temporal phenomena, we analyze terror attacks in the middle-east in the recent past, and achieve an AUC of ~80% for predictions made a week in advance, and within spatial tiles measuring approximately 120 miles across. We conclude that while crime operates near an equilibrium quickly dissipating perturbations, terrorism does not. Indeed terrorism aims to destabilize social order, as shown by its dynamics being susceptible to run-away increases in event rates under small perturbations.
Pragmatic randomized trials are designed to provide evidence for clinical decision-making rather than regulatory approval. Common features of these trials include the inclusion of heterogeneous or diverse patient populations in a wide range of care settings, the use of active treatment strategies as comparators, unblinded treatment assignment, and the study of long-term, clinically relevant outcomes. These features can greatly increase the usefulness of the trial results for patients, clinicians, and other stakeholders. However, these features also introduce an increased risk of non-adherence, which reduces the value of the intention-to-treat effect as a patient-centered measure of causal effect. In these settings, the per-protocol effect provides useful complementary information for decision making. Unfortunately, there is little guidance for valid estimation of the per-protocol effect. Here, we present our full guidelines for analyses of pragmatic trials that will result in more informative causal inferences for both the intention-to-treat effect and the per-protocol effect.

Python Library: cause-ml

Causal ML benchmarking and development tools
In the end what does Causality have to do with machine learning? Machine Learning is about prediction and causality about real effects, do these two themes have something in common? Yes, a lot in common and this series of posts tries to bridge these two sub-areas of Data Science. I like to think that Machine Learning is just a Data Grinder, if you put good quality data, you get good quality predictions, but if you put garbage, it will keep grinding, but don’t expect good predictions to come out, it’s just ground garbage , and that’s what we’ll talk about in this post.
Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the ‘slow feature’ hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model (‘Seq-U-Net’) to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.

# Distilled News

In this post, I’m going to explore machine learning algorithms for time-series analysis and explain why they don’t work for day trading. If you’re a novice in this field you might get fooled by authors with amazing results where test data match predictions almost perfectly. A common trick is to show a plot with predicted values on a long period of data, which creates an illusion that lag is insignificant or you’ll not see it at all. Lag is what makes predictions useless and I’ll show you an example later in this post. There are other ways to make predictions look legit, some of them I’m sure made by mistake. But don’t get discouraged and keep in mind that the model can be as good as your data, and lack of it is the main stumbling block on your way to getting solid results.
Making ML Proofs-of-Concept (POC) is easy, but maintaining them in Production is frustrating and expensive. It comes down to chaotic and blindsided monitoring and triage processes. To fix this, we want to introduce a concept called AI Performance Management (AiPM), explain how teams can adopt it with 5 tactical steps, and show you a tool to accelerate the process.
In a recent collaboration with experts from natural and medical sciences, we show how Invertible Neural Networks can help us deal with the ill-posed inverse problems that often arise in these fields. This page aims to provide an intuitive introduction to the idea.
Depth estimation is a computer vision task designed to estimate depth from a 2D image. The task requires an input RGB image and outputs a depth image. The depth image includes information about the distance of the objects in the image from the viewpoint, which is usually the camera taking the image. Some of the applications of depth estimation include smoothing blurred parts of an image, better rendering of 3D scenes, self-driving cars, grasping in robotics, robot-assisted surgery, automatic 2D-to-3D conversion in film, and shadow mapping in 3D computer graphics, just to mention a few. In this guide, we’ll look at papers aimed at solving these problems using deep learning. The two images below provide a clear illustration of depth estimation in practice.
Any deep learning model learns from the data and that data must be collected or uploading on the server (one machine or in a data center). A most realistic and meaningful deep learning model can learn from personal data. Personal data is extremely private and sensitive and no one would like to send or upload it on the server. Federated learning is a collaborative machine learning approach in which we trained a model without centralizing data on the server and this is the main kind of a revolution.
Today’s virtual assistants help users to accomplish a wide variety of tasks, including finding flights, searching for nearby events and movies, making reservations, sourcing information from the web and more. They provide this functionality by offering a unified natural language interface to a wide variety of services across the web. Large-scale virtual assistants, like Google Assistant, need to integrate with a large and constantly increasing number of services, each with potentially overlapping functionality, over a wide variety of domains. Supporting new services with ease, without collection of additional data or retraining the model, and reducing maintenance workload are necessary to accommodate future growth. Despite tremendous progress, however, these challenges have often been overlooked in state-of-the-art models. This is due, in part, to the absence of suitable datasets that match the scale and complexity confronted by such virtual assistants.
Some major updates to AzureR packages this week! As well as last week’s AzureRMR update, there are changes to AzureStor, AzureVM, AzureGraph and AzureContainers. All of these are live on CRAN.
At last week’s Microsoft Ignite conference in Orlando, our team delivered a series of 6 talks about AI and machine learning applications with Azure. The videos from each talk are linked below, and you can watch every talk from the conference online (no registration necessary). Each of our talks also comes with a companion Github repository, where you can find all of the code and scripts behind the demonstrations, so you can deploy and run them yourself.
Explainable AI (XAI) is a sub-field of AI which has been gaining ground in the recent past. And as I machine learning practitioner dealing with customers day in and day out, I can see why. I’ve been an analytics practitioner for more than 5 years and I swear, the hardest part of a machine learning project is not creating the perfect model which beats all the benchmarks. It’s the part where you convince the customer why and how it works.
Let us consider the following task: we have a bunch of evenly distributed time series of different lengths. The goal is to cluster time series by defining general patterns that are presented in the data. Here I’d like to present one approach to solving this task. We will use hierarchical clustering and DTW algorithm as a comparison metric to the time series. The solution worked well on HR data (employee historical scores). For other types of time series, DTW function may work worse than other metrics like CID (Complexity Invariant Distance), MAE or correlation.
This guide will be useful if you are a bit familiar with pretained models but want to know how to use them in Keras. Keras contains 10 pretrained models for image classification. These models are trained on Imagenet data.
OpenML is an online Machine Learning (ML) experiments database accessible to everyone for free. The core idea is to have a single repository of datasets and results of ML experiments on them. Despite having gained a lot of popularity in recent years, with a plethora of tools now available, the numerous ML experimentations continue to happen in silos and not necessarily as one whole shared community.

# Whats new on arXiv

Normative expert systems have not become commonplace because they have been difficult to build and use. Over the past decade, however, researchers have developed the influence diagram, a graphical representation of a decision maker’s beliefs, alternatives, and preferences that serves as the knowledge base of a normative expert system. Most people who have seen the representation find it intuitive and easy to use. Consequently, the influence diagram has overcome significantly the barriers to constructing normative expert systems. Nevertheless, building influence diagrams is not practical for extremely large and complex domains. In this book, I address the difficulties associated with the construction of the probabilistic portion of an influence diagram, called a knowledge map, belief network, or Bayesian network. I introduce two representations that facilitate the generation of large knowledge maps. In particular, I introduce the similarity network, a tool for building the network structure of a knowledge map, and the partition, a tool for assessing the probabilities associated with a knowledge map. I then use these representations to build Pathfinder, a large normative expert system for the diagnosis of lymph-node diseases (the domain contains over 60 diseases and over 100 disease findings). In an early version of the system, I encoded the knowledge of the expert using an erroneous assumption that all disease findings were independent, given each disease. When the expert and I attempted to build a more accurate knowledge map for the domain that would capture the dependencies among the disease findings, we failed. Using a similarity network, however, we built the knowledge-map structure for the entire domain in approximately 40 hours. Furthermore, the partition representation reduced the number of probability assessments required by the expert from 75,000 to 14,000.
The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase interactions. Existing flat, word level explanations of predictions hardly unveil how neural networks handle compositional semantics to reach predictions. To tackle the challenge, we study hierarchical explanation of neural network predictions. We identify non-additivity and independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase interactions. We show prior efforts on hierarchical explanations, e.g. contextual decomposition, however, do not satisfy the desired properties mathematically. In this paper, we propose a formal way to quantify the importance of each word or phrase for hierarchical explanations. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms apply to hierarchical visualization of compositional semantics, extraction of classification rules and improving human trust of models.
Perhaps the simplest type of multilingual transfer learning is instance-based transfer learning, in which data from the target language and the auxiliary languages are pooled, and a single model is learned from the pooled data. It is not immediately obvious when instance-based transfer learning will improve performance in this multilingual setting: for instance, a plausible conjecture is this kind of transfer learning would help only if the auxiliary languages were very similar to the target. Here we show that at large scale, this method is surprisingly effective, leading to positive transfer on all of 35 target languages we tested. We analyze this improvement and argue that the most natural explanation, namely direct vocabulary overlap between languages, only partially explains the performance gains: in fact, we demonstrate target-language improvement can occur after adding data from an auxiliary language with no vocabulary in common with the target. This surprising result is due to the effect of transitive vocabulary overlaps between pairs of auxiliary and target languages.
Semantic text matching, which matches a target text to a source text, is a general problem in many domains like information retrieval, question answering, and recommendation. There are several challenges for this problem, such as semantic gaps between words, implicit matching, and mismatch due to out-of-vocabulary or low-frequency words, etc. Most existing studies made great efforts to overcome these challenges by learning good representations for different text pieces or operating on global matching signals to get the matching score. However, they did not learn the local fine-grained interactive information for a specific source and target pair. In this paper, we propose a novel interactive attention model for semantic text matching, which learns new representations for source and target texts through interactive attention via global matching matrix and updates local fine-grained relevance between source and target. Our model could enrich the representations of source and target objects by adopting global relevance and learned local fine-grained relevance. The enriched representations of source and target encode global relevance and local relevance of each other, therefore, could empower the semantic match of texts. We conduct empirical evaluations of our model with three applications including biomedical literature retrieval, tweet and news linking, and factoid question answering. Experimental results on three data sets demonstrate that our model significantly outperforms competitive baseline methods.
Pre-trained language representation models (PLMs) learn effective language representations from large-scale unlabeled corpora. Knowledge embedding (KE) algorithms encode the entities and relations in knowledge graphs into informative embeddings to do knowledge graph completion and provide external knowledge for various NLP applications. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which not only better integrates factual knowledge into PLMs but also effectively learns knowledge graph embeddings. Our KEPLER utilizes a PLM to encode textual descriptions of entities as their entity embeddings, and then jointly learn the knowledge embeddings and language representations. Experimental results on various NLP tasks such as the relation extraction and the entity typing show that our KEPLER can achieve comparable results to the state-of-the-art knowledge-enhanced PLMs without any additional inference overhead. Furthermore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text descriptions, to evaluate KE embedding methods in both the traditional transductive setting and the challenging inductive setting, which needs the models to predict entity embeddings for unseen entities. Experiments demonstrate our KEPLER can achieve good results in both settings.
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.
Probably the most important problem in machine learning is the preliminary biasing of a learner’s hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner’s hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner’s hypothesis space for the learning of future tasks drawn from the same environment. An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used). It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.
The Data Clustering (DC) problem is of central importance for the area of Machine Learning (ML), given its usefulness to represent data structural similarities from input spaces. Differently from Supervised Machine Learning (SML), which relies on the theoretical frameworks of the Statistical Learning Theory (SLT) and the Algorithm Stability (AS), DC has scarce literature on general-purpose learning guarantees, affecting conclusive remarks on how those algorithms should be designed as well as on the validity of their results. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on Topological Spaces, instead of metric ones as typically approached in the literature. From that, we show that the DC problem presents an analogous dilemma to the Bias-Variance one, which is here referred to as the Coarse-Refinement (CR) dilemma. CR is intended to clarify the contrast between: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting); consequently, the CR dilemma suggests the need of a relaxation of Kleinberg’s richness axiom. Experimental results were used to illustrate that multidimensional persistent homology support the measurement of divergences among DC models, leading to a consistency criterion.
Probabilistic machine learning enabled by the Bayesian formulation has recently gained significant attention in the domain of automated reasoning and decision-making. While impressive strides have been made recently to scale up the performance of deep Bayesian neural networks, they have been primarily standalone software efforts without any regard to the underlying hardware implementation. In this paper, we propose an ‘All-Spin’ Bayesian Neural Network where the underlying spintronic hardware provides a better match to the Bayesian computing models. To the best of our knowledge, this is the first exploration of a Bayesian neural hardware accelerator enabled by emerging post-CMOS technologies. We develop an experimentally calibrated device-circuit-algorithm co-simulation framework and demonstrate $23.6\times$ reduction in energy consumption against an iso-network CMOS baseline implementation.
We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which ‘any’ online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any $\gamma$-discounted tabular RL problem, with probability at least $1-\delta$, it learns an $\epsilon$-optimal policy using at most $\tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right)$ samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of $|\mathcal{S}|$,$|\mathcal{A}|$, though at the cost of potential approximation bias.
The tremendous recent success of deep neural networks (DNNs) has sparked a surge of interest in understanding their predictive ability. Unlike the human visual system which is able to generalize robustly and learn with little supervision, DNNs normally require a massive amount of data to learn new concepts. In addition, research works also show that DNNs are vulnerable to adversarial examples-maliciously generated images which seem perceptually similar to the natural ones but are actually formed to fool learning models, which means the models have problem generalizing to unseen data with certain type of distortions. In this paper, we analyze the generalization ability of DNNs comprehensively and attempt to improve it from a geometric point of view. We propose adversarial margin maximization (AMM), a learning-based regularization which exploits an adversarial perturbation as a proxy. It encourages a large margin in the input space, just like the support vector machines. With a differentiable formulation of the perturbation, we train the regularized DNNs simply through back-propagation in an end-to-end manner. Experimental results on various datasets (including MNIST, CIFAR-10/100, SVHN and ImageNet) and different DNN architectures demonstrate the superiority of our method over previous state-of-the-arts. Code and models for reproducing our results will be made publicly available.
Question answering (QA) aims to understand user questions and find appropriate answers. In real-world QA systems, Frequently Asked Question (FAQ) based QA is usually a practical and effective solution, especially for some complicated questions (e.g., How and Why). Recent years have witnessed the great successes of knowledge graphs (KGs) utilized in KBQA systems, while there are still few works focusing on making full use of KGs in FAQ-based QA. In this paper, we propose a novel Knowledge Anchor based Question Answering (KAQA) framework for FAQ-based QA to better understand questions and retrieve more appropriate answers. More specifically, KAQA mainly consists of three parts: knowledge graph construction, query anchoring and query-document matching. We consider entities and triples of KGs in texts as knowledge anchors to precisely capture the core semantics, which brings in higher precision and better interpretability. The multi-channel matching strategy also enable most sentence matching models to be flexibly plugged in out KAQA framework to fit different real-world computation costs. In experiments, we evaluate our models on a query-document matching task over a real-world FAQ-based QA dataset, with detailed analysis over different settings and cases. The results confirm the effectiveness and robustness of the KAQA framework in real-world FAQ-based QA.
This paper proposes a hardware-oriented dropout algorithm, which is efficient for field programmable gate array (FPGA) implementation. In deep neural networks (DNNs), overfitting occurs when networks are overtrained and adapt too well to training data. Consequently, they fail in predicting unseen data used as test data. Dropout is a common technique that is often applied in DNNs to overcome this problem. In general, implementing such training algorithms of DNNs in embedded systems is difficult due to power and memory constraints. Training DNNs is power-, time-, and memory- intensive; however, embedded systems require low power consumption and real-time processing. An FPGA is suitable for embedded systems for its parallel processing characteristic and low operating power; however, due to its limited memory and different architecture, it is difficult to apply general neural network algorithms. Therefore, we propose a hardware-oriented dropout algorithm that can effectively utilize the characteristics of an FPGA with less memory required. Software program verification demonstrates that the performance of the proposed method is identical to that of conventional dropout, and hardware synthesis demonstrates that it results in significant resource reduction.
Multilayer networks allow for modeling complex relationships, where individuals are embedded in multiple social networks at the same time. Given the ubiquity of such relationships, these networks have been increasingly gaining attention in the literature. This paper presents the first analysis of the robustness of centrality measures against strategic manipulation in multilayer networks. More specifically, we consider an ‘evader’ who strategically chooses which connections to form in a multilayer network in order to obtain a low centrality-based ranking-thereby reducing the chance of being highlighted as a key figure in the network-while ensuring that she remains connected to a certain group of people. We prove that determining an optimal way to ‘hide’ is NP-complete and hard to approximate for most centrality measures considered in our study. Moreover, we empirically evaluate a number of heuristics that the evader can use. Our results suggest that the centrality measures that are functions of the entire network topology are more robust to such a strategic evader than their counterparts which consider each layer separately.
There is a recent surge of interest in cross-modal representation learning corresponding to images and text. The main challenge lies in mapping images and text to a shared latent space where the embeddings corresponding to a similar semantic concept lie closer to each other than the embeddings corresponding to different semantic concepts, irrespective of the modality. Ranking losses are commonly used to create such shared latent space — however, they do not impose any constraints on inter-class relationships resulting in neighboring clusters to be completely unrelated. The works in the domain of visual semantic embeddings address this problem by first constructing a semantic embedding space based on some external knowledge and projecting image embeddings onto this fixed semantic embedding space. These works are confined only to image domain and constraining the embeddings to a fixed space adds additional burden on learning. This paper proposes a novel method, HUSE, to learn cross-modal representation with semantic information. HUSE learns a shared latent space where the distance between any two universal embeddings is similar to the distance between their corresponding class embeddings in the semantic embedding space. HUSE also uses a classification objective with a shared classification layer to make sure that the image and text embeddings are in the same shared latent space. Experiments on UPMC Food-101 show our method outperforms previous state-of-the-art on retrieval, hierarchical precision and classification results.
This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuous-density hidden Markov model with a Gaussian mixture model expressed in the reduced-dimensional space. The analysis can be incorporated into a neural network, which is named a time-series discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through time-based learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable high-accuracy classification of high-dimensional time-series patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for high-dimensional artificial data and EEG signals in the experiments conducted during the study.
The in-depth analysis of time series has gained a lot of research interest in recent years, with the identification of periodic patterns being one important aspect. Many of the methods for identifying periodic patterns require time series’ season length as input parameter. There exist only a few algorithms for automatic season length approximation. Many of these rely on simplifications such as data discretization and user defined parameters. This paper presents an algorithm for season length detection that is designed to be sufficiently reliable to be used in practical applications and does not require any input other than the time series to be analyzed. The algorithm estimates a time series’ season length by interpolating, filtering and detrending the data. This is followed by analyzing the distances between zeros in the directly corresponding autocorrelation function. Our algorithm was tested against a comparable algorithm and outperformed it by passing 122 out of 165 tests, while the existing algorithm passed 83 tests. The robustness of our method can be jointly attributed to both the algorithmic approach and also to design decisions taken at the implementational level.
Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exact same instance to similar looks/design or a common category. Therefore, training image triplets/pairs used for metric learning inherently possess different degree of information. However, the existing methods often treats them with equal importance during training. This hinders capturing the underlying granularities in feature similarity required for effective visual search. In view of this, we propose a new deep semantic granularity metric learning (SGML) that develops a novel idea of leveraging attribute semantic space to capture different granularity of similarity, and then integrate this information into deep metric learning. The proposed method simultaneously learns image attributes and embeddings using multitask CNNs. The two tasks are not only jointly optimized but are further linked by the semantic granularity similarity mappings to leverage the correlations between the tasks. To this end, we propose a new soft-binomial deviance loss that effectively integrates the degree of information in training samples, which helps to capture visual similarity at multiple granularities. Compared to recent ensemble-based methods, our framework is conceptually elegant, computationally simple and provides better performance. We perform extensive experiments on benchmark metric learning datasets and demonstrate that our method outperforms recent state-of-the-art methods, e.g., 1-4.5\% improvement in Recall@1 over the previous state-of-the-arts [1],[2] on DeepFashion In-Shop dataset.
In this paper the problem of learning appropriate bias for an environment of related tasks is examined from a Bayesian perspective. The environment of related tasks is shown to be naturally modelled by the concept of an {\em objective} prior distribution. Sampling from the objective prior corresponds to sampling different learning tasks from the environment. It is argued that for many common machine learning problems, although we don’t know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by sampling from the objective prior. Bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, and the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous.
In this paper the problem of {\em learning} appropriate domain-specific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common {\em internal representation} or {\em preprocessing} then the number of examples required per task for good generalisation when learning $n$ tasks simultaneously scales like $O(a + \frac{b}{n})$, where $O(a)$ is a bound on the minimum number of examples required to learn a single task, and $O(a + b)$ is a bound on the number of examples required to learn each task independently. An experiment providing strong qualitative support for the theoretical results is reported.
We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.
The scattering transform is a multilayered wavelet-based deep learning architecture that acts as a model of convolutional neural networks. Recently, several works have introduced generalizations of the scattering transform for non-Euclidean settings such as graphs. Our work builds upon these constructions by introducing windowed and non-windowed graph scattering transforms based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts. This work helps bridge the gap between scattering and other graph neural networks by introducing a large family of networks with provable stability and invariance guarantees. This lays the groundwork for future deep learning architectures for graph-structured data that have learned filters and also provably have desirable theoretical properties.
Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes in the training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.901 and 0.973, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.
Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fast-sampled operational data that should enable quicker decision-making. This work presents a general interpretable framework for analyzing real-time PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces on-the-fly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA.

# Document worth reading: “Adversarial Examples in Modern Machine Learning: A Review”

Recent research has found that many families of machine learning models are vulnerable to adversarial examples: inputs that are specifically designed to cause the target model to produce erroneous outputs. In this survey, we focus on machine learning models in the visual domain, where methods for generating and detecting such examples have been most extensively studied. We explore a variety of adversarial attack methods that apply to image-space content, real world adversarial attacks, adversarial defenses, and the transferability property of adversarial examples. We also discuss strengths and weaknesses of various methods of adversarial attack and defense. Our aim is to provide an extensive coverage of the field, furnishing the reader with an intuitive understanding of the mechanics of adversarial attack and defense mechanisms and enlarging the community of researchers studying this fundamental set of problems. Adversarial Examples in Modern Machine Learning: A Review

# If you did not already know

Autoencoding Binary Classifier (ABC)
We propose the Autoencoding Binary Classifiers (ABC), a novel supervised anomaly detector based on the Autoencoder (AE). There are two main approaches in anomaly detection: supervised and unsupervised. The supervised approach accurately detects the known anomalies included in training data, but it cannot detect the unknown anomalies. Meanwhile, the unsupervised approach can detect both known and unknown anomalies that are located away from normal data points. However, it does not detect known anomalies as accurately as the supervised approach. Furthermore, even if we have labeled normal data points and anomalies, the unsupervised approach cannot utilize these labels. The ABC is a probabilistic binary classifier that effectively exploits the label information, where normal data points are modeled using the AE as a component. By maximizing the likelihood, the AE in the proposed ABC is trained to minimize the reconstruction error for normal data points, and to maximize it for known anomalies. Since our approach becomes able to reconstruct the normal data points accurately and fails to reconstruct the known and unknown anomalies, it can accurately discriminate both known and unknown anomalies from normal data points. Experimental results show that the ABC achieves higher detection performance than existing supervised and unsupervised methods. …

Parikh Matrix
Parikh Matrices are a newly developed tool for studying numerical properties of words in terms of their (scattered) subwords. They were introduced by Mateescu et al. in 2000 and continuously received attention from the research community ever since.
Mateescu et al (2000) introduced an interesting new tool, called Parikh matrix, to study in terms of subwords, the numerical properties of words over an alphabet. The Parikh matrix gives more information than the well-known Parikh vector of a word which counts only occurrences of symbols in a word. …

Graph Node-Feature Convolution
Graph convolutional network (GCN) is an emerging neural network approach. It learns new representation of a node by aggregating feature vectors of all neighbors in the aggregation process without considering whether the neighbors or features are useful or not. Recent methods have improved solutions by sampling a fixed size set of neighbors, or assigning different weights to different neighbors in the aggregation process, but features within a feature vector are still treated equally in the aggregation process. In this paper, we introduce a new convolution operation on regular size feature maps constructed from features of a fixed node bandwidth via sampling to get the first-level node representation, which is then passed to a standard GCN to learn the second-level node representation. Experiments show that our method outperforms competing methods in semi-supervised node classification tasks. Furthermore, our method opens new doors for exploring new GCN architectures, particularly deeper GCN models. …

Higher-Order Kolmogorov-Smirnov Test
We present an extension of the Kolmogorov-Smirnov (KS) two-sample test, which can be more sensitive to differences in the tails. Our test statistic is an integral probability metric (IPM) defined over a higher-order total variation ball, recovering the original KS test as its simplest case. We give an exact representer result for our IPM, which generalizes the fact that the original KS test statistic can be expressed in equivalent variational and CDF forms. For small enough orders ($k \leq 5$), we develop a linear-time algorithm for computing our higher-order KS test statistic; for all others ($k \geq 6$), we give a nearly linear-time approximation. We derive the asymptotic null distribution for our test, and show that our nearly linear-time approximation shares the same asymptotic null. Lastly, we complement our theory with numerical studies. …

# Document worth reading: “Optimization Models for Machine Learning: A Survey”

This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted. Optimization Models for Machine Learning: A Survey

# Distilled News

Achieve state-of-the-art multi-label and multi-class text classification with XLNet. At the time of its publication on 19 June 2019, XLNet achieved state-of-the-art results on 18 tasks including text classification, question-answering, natural language inference, sentiment analysis, and document ranking. It even outperformed BERT on 20 tasks! Developed by Carnegie Mellon University and Google Brain, XLNet is a permutation-based auto-regressive language model. We will not delve too much into the inner workings of the model as there are a lot of great resources out there for this purpose. Rather, this article will focus on the application of XLNet to the problem of multi-label and multi-class text classification.
From Uber to Facebook, what the architectures used to power the machine learning workloads of the internet giants. Despite the hype surrounding machine learning and artificial intelligence(AI) most efforts in the enterprise remain in a pilot stage. Part of the reason for this phenomenon is the natural experimentation associated with machine learning projects but also there is a significant component related to the lack of maturity of machine learning architectures. This problem is particularly visible in enterprise environments in which the new application lifecycle management practices of modern machine learning solutions conflicts with corporate practices and regulatory requirements. What are the key architecture building blocks that organizations should put in place when adopting machine learning solutions? The answer is not very trivial but recently we have seen some efforts from research labs and AI data science that are starting to lay down the path of what can become reference architectures for large scale machine learning solutions.
Airflow is becoming the industry standard for authoring data engineering and model pipeline workflows. This chapter of my book explores the process of taking a simple pipeline that runs on a single EC2 instance to a fully-managed Kubernetes ecosystem responsible for scheduling tasks. This posts omits the sections on the fully-managed solutions with GKE and Cloud Composer.
Artificial Intelligence – Cloud and Edge implementations takes an engineering-led approach for the deployment of AI to Edge devices within the framework of the cloud. We often use the word ‘engineering’ in casual conversation. However, in this context, we attach a specific meaning to Engineering. Engineering is the use of scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The American Engineers’ Council for Professional Development defines engineering as: (specific emphasis of interest highlighted)
Statistical estimation usually has the following setup. There is a sample (observed, usually randomly chosen, set of values of measurable quantities) from some general population (whole set of values of the same measurable quantities). We need to make conclusions about the general population based on a sample. This is done by computing summary values (called statistics) of a sample, and making reasonable assumptions (with process usually called inference) about how these values are close to values that potentially can be computed based on whole general population. Thus, summary value based on a sample (sample statistic) is an estimation of potential summary value based on a general population (true value). How can we make inference about quality of this estimation? This question itself describes statistical uncertainty and can be unfolded into a deep philosophical question about probability, nature, and life in general. Basically, the answer depends on assumptions about the relation between sample, general population, and statistic.
In today’s world, being a Data Scientist is not limited to those without technical knowledge. While it is recommended and sometimes important to know a little bit of code, you can get by with just intuitive knowledge. Especially if you’re on H2O’s Driverless AI platform. If you haven’t heard of H2O.ai, it is the company that created the open-source machine learning platform, H2O, which is used by many in the Fortune 500. H2O aims at creating efficiency-driven machine learning environments by leveraging its user-friendly interface and modular capabilities.
How do the hyperparameters for a decision tree affect your model and how do you choose which ones to tune?
This project was created in an attempt to learn and understand how various classification algorithms work within a Natural Language Processing Model. Natural Language Processing, which I will now refer to as NLP, is a branch of machine learning that focuses on enabling computers to interpret and process human languages in both speech and text forms.
We have intellectual property (IP) protection watermarks on media contents such as images, musics and etc. How about Deep Neural Network (DNN)?
Decision tree’s are one of many supervised learning algorithms available to anyone looking to make predictions of future events based on some historical data and, although there is no one generic tool optimal for all problems, decision tree’s are hugely popular and turn out to be very effective in many machine learning applications. To understand the intuition behind the decision tree, consider the problem of designing an algorithm to automatically differentiate between apples and pears (class labels) given only their width and height measurements (features).
Convolutional Neural Network (CNN) is a special type of deep neural network that performs impressively in computer vision problems such as image classification, object detection, etc. In this article, we are going to create an image classifier with Tensorflow by implementing a CNN to classify cats & dogs. With traditional programming is it not possible to build scalable solutions for problems like computer vision since it is not feasible to write an algorithm that is generalized enough to identify the nature of images. With machine learning, we can build an approximation that is sufficient enough for use-cases by training a model for given examples and predict for unseen data.