# Distilled News

This article is the second in a series of articles aimed at demystifying the theory behind neural networks and how to design and implement them for solving practical problems. In this article, I will cover the design and optimization aspects of neural networks in detail.
• Anatomy of a neural network
• Activation functions
• Loss functions
• Output units
• Architecture
These tutorials are largely based on the notes and examples from multiple classes taught at Harvard and Stanford in the computer science and data science departments.
• Use the statsmodels Python module to implement a Kalman Filter model with external control inputs,
• Use Maximum Likelihood to estimate unknown parameters in the Kalman Filter model matrices,
• See how cumulative impact can be modeled via the Kalman Filter. (This article uses the fitness-fatigue model of athletic performance as an example and doubles as Modeling Cumulative Impact Part IV.)
Google Colaboratory is a research tool for data science and machine learning. It’s a jupyter notebook environment that requires no setup to use. It is by far one of the most top tools especially for data scientists because you don’t have to manually install all the packages and libraries, just import them directly by calling them. Whereas in normal IDE you have to manually install the libraries. And moreover notebooks are meant for code and explanation, it often should look like a blog post. I have been using Google colab from past two months and it has been the best tool for me. In this blog, I would be giving you guys some tips and tricks about mastering the Google Colab. Stay tuned read all the points, these were the features which even I was struggling to implement at the first place, now I mastered it. Let’s see the top best features of Google Colab notebook.
What is Source-Based-Language-Learning in this context? Very simple – it is my way of describing the process of learning a language to literally understand a source (i.e. book, speech, etc.). In the specific case of what I will be sharing, it translates to learning Classical Arabic to be able to read/comprehend the Quran (in its native language, without translation). So why all this drama of using a Genetic Algorithm (GA) to learn a language? To understand this, it will require a better sense of the problem statement.
The 1990s saw the emergence of cognitive models that depend on very high dimensionality and randomness. They include Holographic Reduced Representations, Spatter Code, Semantic Vectors, Latent Semantic Analysis, Context-Dependent Thinning, and Vector-Symbolic Architecture. They represent things in highdimensional vectors that are manipulated by operations that produce new high-dimensional vectors in the style of traditional computing, in what is called here hyperdimensional computing on account of the very high dimensionality. The paper presents the main ideas behind these models, written as a tutorial essay in hopes of making the ideas accessible and even provocative. A sketch of how we have arrived at these models, with references and pointers to further reading, is given at the end. The thesis of the paper is that hyperdimensional representation has much to offer to students of cognitive science, theoretical neuroscience, computer science and engineering, and mathematics.
Since being open sourced in 2015, TensorFlow has had a significant impact on many industries. With TensorFlow 2.0’s eager execution, intuitive high-level APIs, and flexible model building on any platform, it’s cementing its place as the production-ready, end-to-end platform driving the machine learning revolution. At TensorFlow World you’ll see TensorFlow 2.0 in action, discover new ways to use it, and learn how to successfully implement it in your enterprise.
I recently published a paper in SPIE 2019 that is related to a system that estimates the age of a person using Chest X-Rays (CXR) and deep learning. Such a system can be utilized in scenarios where the age information of the patient is missing. Forensics is an example of an area that could benefit. More interestingly though, by using deep network activation maps we can visualize which anatomical areas of CXRs that age affects most; offering insight on what the network ‘sees’ to estimate age. It might be too early to tell how age estimation and visualization on CXRs can have clinical implications. Nevertheless, age discrepancy between the network’s prediction and the real patient age can be useful for preventative counseling of patient health status. Excerpts from the paper as well as new experiments are provided in this post.
The output of your fraud detection model is the probability [0.0-1.0] that a transaction is fraudulent. If this probability is below 0.5, you classify the transaction as non-fraudulent; otherwise, you classify the transaction as fraudulent.
In this article, I will explain how to use active learning to iteratively improve the performance of a machine learning model. This technique is applicable to any model but for the purpose of this article, I will illustrate how it’s done to improve a binary text classifier. All the materials covered in this article are based on the 2018 Strata Data Conference Tutorial titled ‘Using R and Python for scalable data science, machine learning and AI’ from Microsoft. I assume the reader is familiar with the concept of active learning in the context of machine learning. If not, then the lead section of this Wikipedia article serves as a good introduction.
The Knowledge Graph is Google’s semantic database. This is where entities are placed in relation to one another, assigned attributes and set in a thematic context or an ontology. But what is an entity? And how does the Knowledge Graph actually work? Find the answers to these questions in our latest Unwrapping the Secrets of SEO, the last in part three in Olaf Kopp’s series looking at Google’s semantics and machine learning.
PyRobot is a framework and ecosystem that enables AI researchers and students to get up and running with a robot in just a few hours, without specialized knowledge of the hardware or of details such as device drivers, control, and planning.
122 slides, very readable, about learning from images, from video, and from video with sound.
Probabilistic programming aims to help users make decisions under uncertainty. The user writes code representing a probabilistic model, and receives outcomes as distributions or summary statistics. We consider probabilistic programming for end-users, in particular spreadsheet users, estimated to number in tens to hundreds of millions. We examine the sources of uncertainty actually encountered by spreadsheet users, and their coping mechanisms, via an interview study. We examine spreadsheet-based interfaces and technology to help reason under uncertainty, via probabilistic and other means. We show how uncertain values can propagate uncertainty through spreadsheets, and how sheet-defined functions can be applied to handle uncertainty. Hence, we draw conclusions about the promise and limitations of probabilistic programming for end-users.

# R Packages worth a look

Automatic Estimation of Number of Principal Components in PCA (pesel)
Automatic estimation of number of principal components in PCA with PEnalized SEmi-integrated Likelihood (PESEL). See Piotr Sobczyk, Malgorzata Bogdan, …

Fast Implementation of Dijkstra Algorithm (cppRouting)
Calculation of distances, shortest paths and isochrones on weighted graphs using several variants of Dijkstra algorithm. Proposed algorithms are unidir …

Robust P-Value Combination Methods (metapro)
The meta-analysis is performed to increase the statistical power by integrating the results from several experiments. The p-values are often combined i …

Estimation in Nonprobability Sampling (NonProbEst)
Different inference procedures are proposed in the literature to correct for selection bias that might be introduced with non-random selection mechanis …

# Whats new on arXiv

Group convolution works well with many deep convolutional neural networks (CNNs) that can effectively compress the model by reducing the number of parameters and computational cost. Using this operation, feature maps of different group cannot communicate, which restricts their representation capability. To address this issue, in this work, we propose a novel operation named Hierarchical Group Convolution (HGC) for creating computationally efficient neural networks. Different from standard group convolution which blocks the inter-group information exchange and induces the severe performance degradation, HGC can hierarchically fuse the feature maps from each group and leverage the inter-group information effectively. Taking advantage of the proposed method, we introduce a family of compact networks called HGCNets. Compared to networks using standard group convolution, HGCNets have a huge improvement in accuracy at the same model size and complexity level. Extensive experimental results on the CIFAR dataset demonstrate that HGCNets obtain significant reduction of parameters and computational cost to achieve comparable performance over the prior CNN architectures designed for mobile devices such as MobileNet and ShuffleNet.
Identifying statistically significant dependency between variables is a key step in scientific discoveries. Many recent methods, such as distance and kernel tests, have been proposed for valid and consistent independence testing and can be applied to data in Euclidean and non-Euclidean spaces. However, in those works, $n$ pairs of points in $\mathcal{X} \times \mathcal{Y}$ are observed. Here, we consider the setting where a pair of $n \times n$ graphs are observed, and the corresponding adjacency matrices are treated as kernel matrices. Under a $\rho$-correlated stochastic block model, we demonstrate that a na\’ive test (permutation and Pearson’s) for a conditional dependency graph model is invalid. Instead, we propose a block-permutation procedure. We prove that our procedure is valid and consistent — even when the two graphs have different marginal distributions, are weighted or unweighted, and the latent vertex assignments are unknown — and provide sufficient conditions for the tests to estimate $\rho$. Simulations corroborate these results on both binary and weighted graphs. Applying these tests to the whole-organism, single-cell-resolution structural connectomes of C. elegans, we identify strong statistical dependency between the chemical synapse connectome and the gap junction connectome.
In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.
Graph Neural Networks (GNNs) are based on repeated aggregations of information across nodes’ neighbors in a graph. However, because common neighbors are shared between different nodes, this leads to repeated and inefficient computations. We propose Hierarchically Aggregated computation Graphs (HAGs), a new GNN graph representation that explicitly avoids redundancy by managing intermediate aggregation results hierarchically, eliminating repeated computations and unnecessary data transfers in GNN training and inference. We introduce an accurate cost function to quantitatively evaluate the runtime performance of different HAGs and use a novel HAG search algorithm to find optimized HAGs. Experiments show that the HAG representation significantly outperforms the standard GNN graph representation by increasing the end-to-end training throughput by up to 2.8x and reducing the aggregations and data transfers in GNN training by up to 6.3x and 5.6x, while maintaining the original model accuracy.
Unrolled neural networks emerged recently as an effective model for learning inverse maps appearing in image restoration tasks. However, their generalization risk (i.e., test mean-squared-error) and its link to network design and train sample size remains mysterious. Leveraging the Stein’s Unbiased Risk Estimator (SURE), this paper analyzes the generalization risk with its bias and variance components for recurrent unrolled networks. We particularly investigate the degrees-of-freedom (DOF) component of SURE, trace of the end-to-end network Jacobian, to quantify the prediction variance. We prove that DOF is well-approximated by the weighted \textit{path sparsity} of the network under incoherence conditions on the trained weights. Empirically, we examine the SURE components as a function of train sample size for both recurrent and non-recurrent (with many more parameters) unrolled networks. Our key observations indicate that: 1) DOF increases with train sample size and converges to the generalization risk for both recurrent and non-recurrent schemes; 2) recurrent network converges significantly faster (with less train samples) compared with non-recurrent scheme, hence recurrence serves as a regularization for low sample size regimes.
After learning a concept, humans are also able to continually generalize their learned concepts to new domains by observing only a few labeled instances without any interference with the past learned knowledge. In contrast, learning concepts efficiently in a continual learning setting remains an open challenge for current Artificial Intelligence algorithms as persistent model retraining is necessary. Inspired by the Parallel Distributed Processing learning and the Complementary Learning Systems theories, we develop a computational model that is able to expand its previously learned concepts efficiently to new domains using a few labeled samples. We couple the new form of a concept to its past learned forms in an embedding space for effective continual learning. Doing so, a generative distribution is learned such that it is shared across the tasks in the embedding space and models the abstract concepts. This procedure enables the model to generate pseudo-data points to replay the past experience to tackle catastrophic forgetting.
Graph Neural Networks (GNNs) have boosted the performance of many graph related tasks such as node classification and graph classification. Recent researches show that graph neural networks are vulnerable to adversarial attacks, which deliberately add carefully created unnoticeable perturbation to the graph structure. The perturbation is usually created by adding/deleting a few edges, which might be noticeable even when the number of edges modified is small. In this paper, we propose a graph rewiring operation which affects the graph in a less noticeable way compared to adding/deleting edges. We then use reinforcement learning to learn the attack strategy based on the proposed rewiring operation. Experiments on real world graphs demonstrate the effectiveness of the proposed framework. To understand the proposed framework, we further analyze how its generated perturbation to the graph structure affects the output of the target model.
Cyberspace has gradually replaced the physical reality, its role evolving from a simple enabler of daily live processes to a necessity for modern existence. As a result of this convergence of physical and virtual realities, for all processes being critically dependent on networked communications, information representative of our physical, logical and social thoughts is constantly being generated in cyberspace. The interconnection and integration of links between our physical and virtual realities creates a new hyperspace as a source of data and information. Additionally, significant studies in cyber analysis have predominantly revolved around a single linear analysis of information from a single source of evidence (The Network). These studies are limited in their ability to understand the dynamics of relationships across the multiple dimensions of cyberspace. This paper introduces a multi-dimensional perspective for data identification in cyberspace. It provides critical discussions for identifying entangled relationships amongst entities across cyberspace.
The 2016 Guidelines for Assessment and Instruction in Statistics Education (GAISE) College Report emphasized six recommendations to teach introductory courses in statistics. Among them: use of real data with context and purpose. Many educators have created databases consisting of multiple data sets for use in class; sometimes making hundreds of data sets available. Yet `the context and purpose’ component of the data may remain elusive if just a generic database is made available. We describe the use of open data in introductory courses. Countries and cities continue to share data through open data portals. Hence, educators can find regional data that engages their students more effectively. We present excerpts from case studies that show the application of statistical methods to data on: crime, housing, rainfall, tourist travel, and others. Data wrangling and discussion of results are recognized as important case study components. Thus the open data based case studies attend most GAISE College Report recommendations. Reproducible \textsf{R} code is made available for each case study. Example uses of open data in more advanced courses in statistics are also described.
Smartphones have become the ultimate ‘personal’ computer, yet despite this, general-purpose data-mining and knowledge discovery tools for mobile devices are surprisingly rare. DataLearner is a new data-mining application designed specifically for Android devices that imports the Weka data-mining engine and augments it with algorithms developed by Charles Sturt University. Moreover, DataLearner can be expanded with additional algorithms. Combined, DataLearner delivers 40 classification, clustering and association rule mining algorithms for model training and evaluation without need for cloud computing resources or network connectivity. It provides the same classification accuracy as PCs and laptops, while doing so with acceptable processing speed and consuming negligible battery life. With its ability to provide easy-to-use data-mining on a phone-size screen, DataLearner is a new portable, self-contained data-mining tool for remote, personalised and learning applications alike. DataLearner features four elements – this paper, the app available on Google Play, the GPL3-licensed source code on GitHub and a short video on YouTube.
Financial institutions are currently looking into technologies for permissioned blockchains. A major effort in this direction is Hyperledger, an open source project hosted by the Linux Foundation and backed by a consortium of over a hundred companies. A key component in permissioned blockchain protocols is a byzantine fault tolerant (BFT) consensus engine that orders transactions. However, currently available BFT solutions in Hyperledger (as well as in the literature at large) are inadequate for financial settings; they are not designed to ensure fairness or to tolerate selfish behavior that arises when financial institutions strive to maximize their own profit. We present FairLedger, a permissioned blockchain BFT protocol, which is fair, designed to deal with rational behavior, and, no less important, easy to understand and implement. The secret sauce of our protocol is a new communication abstraction, called detectable all-to-all (DA2A), which allows us to detect participants (byzantine or rational) that deviate from the protocol, and punish them. We implement FairLedger in the Hyperledger open source project, using Iroha framework, one of the biggest projects therein. To evaluate FairLegder’s performance, we also implement it in the PBFT framework and compare the two protocols. Our results show that in failure-free scenarios FairLedger achieves better throughput than both Iroha’s implementation and PBFT in wide-area settings.
Open-domain targeted sentiment analysis aims to detect opinion targets along with their sentiment polarities from a sentence. Prior work typically formulates this task as a sequence tagging problem. However, such formulation suffers from problems such as huge search space and sentiment inconsistency. To address these problems, we propose a span-based extract-then-classify framework, where multiple opinion targets are directly extracted from the sentence under the supervision of target span boundaries, and corresponding polarities are then classified using their span representations. We further investigate three approaches under this framework, namely the pipeline, joint, and collapsed models. Experiments on three benchmark datasets show that our approach consistently outperforms the sequence tagging baseline. Moreover, we find that the pipeline model achieves the best performance compared with the other two models.
Large companies need to monitor various metrics (for example, Page Views and Revenue) of their applications and services in real time. At Microsoft, we develop a time-series anomaly detection service which helps customers to monitor the time-series continuously and alert for potential incidents on time. In this paper, we introduce the pipeline and algorithm of our anomaly detection service, which is designed to be accurate, efficient and general. The pipeline consists of three major modules, including data ingestion, experimentation platform and online compute. To tackle the problem of time-series anomaly detection, we propose a novel algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN). Our work is the first attempt to borrow the SR model from visual saliency detection domain to time-series anomaly detection. Moreover, we innovatively combine SR and CNN together to improve the performance of SR model. Our approach achieves superior experimental results compared with state-of-the-art baselines on both public datasets and Microsoft production data.
Classical Machine Learning (ML) pipelines often comprise of multiple ML models where models, within a pipeline, are trained in isolation. Conversely, when training neural network models, layers composing the neural models are simultaneously trained using backpropagation. We argue that the isolated training scheme of ML pipelines is sub-optimal, since it cannot jointly optimize multiple components. To this end, we propose a framework that translates a pre-trained ML pipeline into a neural network and fine-tunes the ML models within the pipeline jointly using backpropagation. Our experiments show that fine-tuning of the translated pipelines is a promising technique able to increase the final accuracy.
Enabling a machine to read and comprehend the natural language documents so that it can answer some questions remains an elusive challenge. In recent years, the popularity of deep learning and the establishment of large-scale datasets have both promoted the prosperity of Machine Reading Comprehension. This paper aims to present how to utilize the Neural Network to build a Reader and introduce some classic models, analyze what improvements they make. Further, we also point out the defects of existing models and future research directions
Residual Networks with convolutional layers are widely used in the field of machine learning. Since they effectively extract features from input data by stacking multiple layers, they can achieve high accuracy in many applications. However, the stacking of many layers raises their computation costs. To address this problem, we propose Network Implosion, it erases multiple layers from Residual Networks without degrading accuracy. Our key idea is to introduce a priority term that identifies the importance of a layer; we can select unimportant layers according to the priority and erase them after the training. In addition, we retrain the networks to avoid critical drops in accuracy after layer erasure. A theoretical assessment reveals that our erasure and retraining scheme can erase layers without accuracy drop, and achieve higher accuracy than is possible with training from scratch. Our experiments show that Network Implosion can, for classification on Cifar-10/100 and ImageNet, reduce the number of layers by 24.00 to 42.86 percent without any drop in accuracy.
As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum-likelihood parameters subject to these fairness constraints. Our approach iteratively discovers and eliminates discrimination patterns until a fair model is learned. An empirical evaluation on three real-world datasets demonstrates that we can remove exponentially many discrimination patterns by only adding a small fraction of them as constraints.
A huge volume of user-generated content is daily produced on social media. To facilitate automatic language understanding, we study keyphrase prediction, distilling salient information from massive posts. While most existing methods extract words from source posts to form keyphrases, we propose a sequence-to-sequence (seq2seq) based neural keyphrase generation framework, enabling absent keyphrases to be created. Moreover, our model, being topic-aware, allows joint modeling of corpus-level latent topic representations, which helps alleviate the data sparsity that widely exhibited in social media language. Experiments on three datasets collected from English and Chinese social media platforms show that our model significantly outperforms both extraction and generation models that do not exploit latent topics. Further discussions show that our model learns meaningful topics, which interprets its superiority in social media keyphrase generation.
Complaining is a basic speech act regularly used in human and computer mediated communication to express a negative mismatch between reality and expectations in a particular situation. Automatically identifying complaints in social media is of utmost importance for organizations or brands to improve the customer experience or in developing dialogue systems for handling and responding to complaints. In this paper, we introduce the first systematic analysis of complaints in computational linguistics. We collect a new annotated data set of written complaints expressed in English on Twitter.\footnote{Data and code is available here: \url{https://…/complaints-social-media}} We present an extensive linguistic analysis of complaining as a speech act in social media and train strong feature-based and neural models of complaints across nine domains achieving a predictive performance of up to 79 F1 using distant supervision.
To be successful in real-world tasks, Reinforcement Learning (RL) needs to exploit the compositional, relational, and hierarchical structure of the world, and learn to transfer it to the task at hand. Recent advances in representation learning for language make it possible to build models that acquire world knowledge from text corpora and integrate this knowledge into downstream decision making problems. We thus argue that the time is right to investigate a tight integration of natural language understanding into RL in particular. We survey the state of the field, including work on instruction following, text games, and learning from textual domain knowledge. Finally, we call for the development of new environments as well as further investigation into the potential uses of recent Natural Language Processing (NLP) techniques for such tasks.
We study the problem of computing the minimum adversarial perturbation of the Nearest Neighbor (NN) classifiers. Previous attempts either conduct attacks on continuous approximations of NN models or search for the perturbation by some heuristic methods. In this paper, we propose the first algorithm that is able to compute the minimum adversarial perturbation. The main idea is to formulate the problem as a list of convex quadratic programming (QP) problems that can be efficiently solved by the proposed algorithms for 1-NN models. Furthermore, we show that dual solutions for these QP problems could give us a valid lower bound of the adversarial perturbation that can be used for formal robustness verification, giving us a nice view of attack/verification for NN models. For $K$-NN models with larger $K$, we show that the same formulation can help us efficiently compute the upper and lower bounds of the minimum adversarial perturbation, which can be used for attack and verification.

# If you did not already know

Ensemble Actor-Critic (EAC)
We propose a new policy iteration theory as an important extension of soft policy iteration and Soft Actor-Critic (SAC), one of the most efficient model free algorithms for deep reinforcement learning. Supported by the new theory, arbitrary entropy measures that generalize Shannon entropy, such as Tsallis entropy and Renyi entropy, can be utilized to properly randomize action selection while fulfilling the goal of maximizing expected long-term rewards. Our theory gives birth to two new algorithms, i.e., Tsallis entropy Actor-Critic (TAC) and Renyi entropy Actor-Critic (RAC). Theoretical analysis shows that these algorithms can be more effective than SAC. Moreover, they pave the way for us to develop a new Ensemble Actor-Critic (EAC) algorithm in this paper that features the use of a bootstrap mechanism for deep environment exploration as well as a new value-function based mechanism for high-level action selection. Empirically we show that TAC, RAC and EAC can achieve state-of-the-art performance on a range of benchmark control tasks, outperforming SAC and several cutting-edge learning algorithms in terms of both sample efficiency and effectiveness. …

Robust Regression Extended with Ensemble Loss Function (RELF)
Ensemble techniques are powerful approaches that combine several weak learners to build a stronger one. As a meta-learning framework, ensemble techniques can easily be applied to many machine learning methods. Inspired by ensemble techniques, in this paper we propose an ensemble loss functions applied to a simple regressor. We then propose a half-quadratic learning algorithm in order to find the parameter of the regressor and the optimal weights associated with each loss function. Moreover, we show that our proposed loss function is robust in noisy environments. For a particular class of loss functions, we show that our proposed ensemble loss function is Bayes consistent and robust. Experimental evaluations on several datasets demonstrate that our proposed ensemble loss function significantly improves the performance of a simple regressor in comparison with state-of-the-art methods. …

StartNet
We propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos. Previous methods aim to localize action starts by learning feature representations that can directly separate the start point from its preceding background. It is challenging due to the subtle appearance difference near the action starts and the lack of training data. Instead, StartNet decomposes ODAS into two stages: action classification (using ClsNet) and start point localization (using LocNet). ClsNet focuses on per-frame labeling and predicts action score distributions online. Based on the predicted action scores of the past and current frames, LocNet conducts class-agnostic start detection by optimizing long-term localization rewards using policy gradient methods. The proposed framework is validated on two large-scale datasets, THUMOS’14 and ActivityNet. The experimental results show that StartNet significantly outperforms the state-of-the-art by 15%-30% p-mAP under the offset tolerance of 1-10 seconds on THUMOS’14, and achieves comparable performance on ActivityNet with 10 times smaller time offset. …

Sub-LInear Deep Learning Engine (SLIDE)
Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters with enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, with the end of Moore’s law, there is a limit to such scaling. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, which drastically reduce the computation during both training and inference, with simple multi-core parallelism on a modest CPU. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level. We provide codes and benchmark scripts for reproducibility. …

# Document worth reading: “Shannon’s entropy and its Generalizations towards Statistics, Reliability and Information Science during 1948-2018”

Starting from the pioneering works of Shannon and Weiner in 1948, a plethora of works have been reported on entropy in different directions. Entropy-related review work in the direction of statistics, reliability and information science, to the best of our knowledge, has not been reported so far. Here we have tried to collect all possible works in this direction during the period 1948-2018 so that people interested in entropy, specially the new researchers, get benefited. Shannon’s entropy and its Generalizations towards Statistics, Reliability and Information Science during 1948-2018

# Whats new on arXiv

Although timing and synchronization of a dynamically-changing set of elements and their related power considerations are essential to many cyber-physical systems (CPS), they are absent from today’s programming languages, forcing programmers to handle these matters outside of the language and on a case-by-case basis. This paper proposes a framework for adding time-related concepts to languages. Complementing prior work in this area, this paper develops the notion of dynamically federated islands of variable-precision synchronization and coordinated entities through synergistic activities at the language, system, network, and device levels. At the language level, we explore constructs that capture key timing and synchronization concepts and, at the system level, we propose a flexible intermediate language that represents both program logic and timing constraints together with run-time mechanisms. At the network level, we argue for architectural extensions that permit the network to act as a combined computing, communication, storage, and synchronization platform and at the device level, we explore architectural concepts that can lead to greater interoperability, easy establishment of timing constraints, and more power-efficient designs.
We consider the stochastic multi-armed bandit problem and the contextual bandit problem with historical observations and pre-clustered arms. The historical observations can contain any number of instances for each arm, and the pre-clustering information is a fixed clustering of arms provided as part of the input. We develop a variety of algorithms which incorporate this offline information effectively during the online exploration phase and derive their regret bounds. In particular, we develop the META algorithm which effectively hedges between two other algorithms: one which uses both historical observations and clustering, and another which uses only the historical observations. The former outperforms the latter when the clustering quality is good, and vice-versa. Extensive experiments on synthetic and real world datasets on Warafin drug dosage and web server selection for latency minimization validate our theoretical insights and demonstrate that META is a robust strategy for optimally exploiting the pre-clustering information.
In real-world networks the interactions between network elements are inherently time-delayed. These time-delays can not only slow the network but can have a destabilizing effect on the network’s dynamics leading to poor performance. The same is true in computational networks used for machine learning etc. where time-delays increase the network’s memory but can degrade the network’s ability to be trained. However, not all networks can be destabilized by time-delays. Previously, it has been shown that if a network or high-dimensional dynamical system is intrinsically stable, which is a stronger form of the standard notion of global stability, then it maintains its stability when constant time-delays are introduced into the system. Here we show that intrinsically stable systems, including intrinsically stable networks and a broad class of switched systems, i.e. systems whose mapping is time-dependent, remain stable in the presence of any type of time-varying time-delays whether these delays are periodic, stochastic, or otherwise. We apply these results to a number of well-studied systems to demonstrate that the notion of intrinsic stability is both computationally inexpensive, relative to other methods, and can be used to improve on some of the best known stability results. We also show that the asymptotic state of an intrinsically stable switched system is exponentially independent of the system’s initial conditions.
Reducing the latency variance in machine learning inference is a key requirement in many applications. Variance is harder to control in a cloud deployment in the presence of stragglers. In spite of this challenge, inference is increasingly being done in the cloud, due to the advent of affordable machine learning as a service (MLaaS) platforms. Existing approaches to reduce variance rely on replication which is expensive and partially negates the affordability of MLaaS. In this work, we argue that MLaaS platforms also provide unique opportunities to cut the cost of redundancy. In MLaaS platforms, multiple inference requests are concurrently received by a load balancer which can then create a more cost-efficient redundancy coding across a larger collection of images. We propose a novel convolutional neural network model, Collage-CNN, to provide a low-cost redundancy framework. A Collage-CNN model takes a collage formed by combining multiple images and performs multi-image classification in one shot, albeit at slightly lower accuracy. We then augment a collection of traditional single image classifiers with a single Collage-CNN classifier which acts as a low-cost redundant backup. Collage-CNN then provides backup classification results if a single image classification straggles. Deploying the Collage-CNN models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.47X compared to replication based approaches while providing high accuracy. Also, variation in inference latency can be reduced by 9X with a slight increase in average inference latency.
The power of neural networks lies in their ability to generalize to unseen data, yet the underlying reasons for this phenomenon remains elusive. Numerous rigorous attempts have been made to explain generalization, but available bounds are still quite loose, and analysis does not always lead to true understanding. The goal of this work is to make generalization more intuitive. Using visualization methods, we discuss the mystery of generalization, the geometry of loss landscapes, and how the curse (or, rather, the blessing) of dimensionality causes optimizers to settle into minima that generalize well.
The proliferation of automated inference algorithms in Bayesian statistics has provided practitioners newfound access to fast, reproducible data analysis and powerful statistical models. Designing automated methods that are also both computationally scalable and theoretically sound, however, remains a significant challenge. Recent work on Bayesian coresets takes the approach of compressing the dataset before running a standard inference algorithm, providing both scalability and guarantees on posterior approximation error. But the automation of past coreset methods is limited because they depend on the availability of a reasonable coarse posterior approximation, which is difficult to specify in practice. In the present work we remove this requirement by formulating coreset construction as sparsity-constrained variational inference within an exponential family. This perspective leads to a novel construction via greedy optimization, and also provides a unifying information-geometric view of present and past methods. The proposed Riemannian coreset construction algorithm is fully automated, requiring no inputs aside from the dataset, probabilistic model, desired coreset size, and sample size used for Monte Carlo estimates. In addition to being easier to use than past methods, experiments demonstrate that the proposed algorithm achieves state-of-the-art Bayesian dataset summarization.
To understand causal relationships between events in the world, it is useful to pinpoint when actions occur in videos and to examine the state of the world at and around that time point. For example, one must accurately detect the start of an audience response — laughter in a movie, cheering at a sporting event — to understand the cause of the reaction. In this work, we focus on the problem of accurately detecting action starts rather than isolated events or action ends. We introduce a novel structured loss function based on matching predictions to true action starts that is tailored to this problem; it more heavily penalizes extra and missed action start detections over small misalignments. Recurrent neural networks are used to minimize a differentiable approximation of this loss. To evaluate these methods, we introduce the Mouse Reach Dataset, a large, annotated video dataset of mice performing a sequence of actions. The dataset was labeled by experts for the purpose of neuroscience research on causally relating neural activity to behavior. On this dataset, we demonstrate that the structured loss leads to significantly higher accuracy than a baseline of mean-squared error loss.
Recent advances in generative modeling of text have demonstrated remarkable improvements in terms of fluency and coherency. In this work we investigate to which extent a machine can discriminate real from machine generated text. This is important in itself for automatic detection of computer generated stories, but can also serve as a tool for further improving text generation. We show that learning a dedicated scoring function to discriminate between real and fake text achieves higher precision than employing the likelihood of a generative model. The scoring functions generalize to other generators than those used for training as long as these generators have comparable model complexity and are trained on similar datasets.
Medical image analysis using supervised deep learning methods remains problematic because of the reliance of deep learning methods on large amounts of labelled training data. Although medical imaging data repositories continue to expand there has not been a commensurate increase in the amount of annotated data. Hence, we propose a new unsupervised feature learning method that learns feature representations to then differentiate dissimilar medical images using an ensemble of different convolutional neural networks (CNNs) and K-means clustering. It jointly learns feature representations and clustering assignments in an end-to-end fashion. We tested our approach on a public medical dataset and show its accuracy was better than state-of-the-art unsupervised feature learning methods and comparable to state-of-the-art supervised CNNs. Our findings suggest that our method could be used to tackle the issue of the large volume of unlabelled data in medical imaging repositories.
In this paper, we introduce a novel semantic description approach inspired on Prototype Theory foundations. We propose a Computational Prototype Model (CPM) that encodes and stores the central semantic meaning of objects category: the semantic prototype. Also, we introduce a Prototype-based Description Model that encodes the semantic meaning of an object while describing its features using our CPM model. Our description method uses semantic prototypes computed by CNN-classifications models to create discriminative signatures that describe an object highlighting its most distinctive features within the category. Our experiments show that: i) our CPM model (semantic prototype + distance metric) is able to describe the internal semantic structure of objects categories; ii) our semantic distance metric can be understood as the object visual typicality score within a category; iii) our descriptor encoding is semantically interpretable and significantly outperforms other image global encodings in clustering and classification tasks.
The impact of designing for security of AI is critical for humanity in the AI era. With humans increasingly becoming dependent upon AI, there is a need for neural networks that work reliably, inspite of Adversarial attacks. The vision for Safe and secure AI for popular use is achievable. To achieve safety of AI, this paper explores strategies and a novel deep learning architecture. To guard AI from adversaries, paper explores combination of 3 strategies: 1. Introduce randomness at inference time to hide the representation learning from adversaries. 2. Detect presence of adversaries by analyzing the sequence of inferences. 3. Exploit visual similarity. To realize these strategies, this paper designs a novel architecture, Dynamic Neural Defense, DND. This defense has 3 deep learning architectural features: 1. By hiding the way a neural network learns from exploratory attacks using a random computation graph, DND evades attack. 2. By analyzing input sequence to cloud AI inference engine with LSTM, DND detects attack sequence. 3. By inferring with visual similar inputs generated by VAE, any AI defended by DND approach does not succumb to hackers. Thus, a roadmap to develop reliable, safe and secure AI is presented.
With convenient access to observational data, learning individual causal effects from such data draws more attention in many influential research areas such as economics, healthcare, and education. For example, we aim to study how a medicine (treatment) would affect the health condition (outcome) of a certain patient. To validate causal inference from observational data, we need to control the influence of confounders – the variables which causally influence both the treatment and the outcome. Along this line, existing work for learning individual treatment effect overwhelmingly relies on the assumption that there are no hidden confounders. However, in real-world observational data, this assumption is untenable and can be unrealistic. In fact, an important fact ignored by them is that observational data can come with network information that can be utilized to infer hidden confounders. For example, in an observational study of the individual treatment effect of a medicine, instead of randomized experiments, the medicine is assigned to individuals based on a series of factors. Some factors (e.g., socioeconomic status) are hard to measure directly and therefore become hidden confounders of observational datasets. Fortunately, the socioeconomic status of an individual can be reflected by whom she is connected in social networks. With this fact in mind, we aim to exploit the network structure to recognize patterns of hidden confounders in the task of learning individual treatment effects from observational data. In this work, we propose a novel causal inference framework, the network deconfounder, which learns representations of confounders by unraveling patterns of hidden confounders from the network structure between instances of observational data. Empirically, we perform extensive experiments to validate the effectiveness of the network deconfounder on various datasets.
Given the overwhelming number of emails, an effective subject line becomes essential to better inform the recipient of the email’s content. In this paper, we propose and study the task of email subject line generation: automatically generating an email subject line from the email body. We create the first dataset for this task and find that email subject line generation favor extremely abstractive summary which differentiates it from news headline generation or news single document summarization. We then develop a novel deep learning method and compare it to several baselines as well as recent state-of-the-art text summarization systems. We also investigate the efficacy of several automatic metrics based on correlations with human judgments and propose a new automatic evaluation metric. Our system outperforms competitive baselines given both automatic and human evaluations. To our knowledge, this is the first work to tackle the problem of effective email subject line generation.
This paper studies the problem of learning a sequence of sentiment classification tasks. The learned knowledge from each task is retained and used to help future or subsequent task learning. This learning paradigm is called Lifelong Learning (LL). However, existing LL methods either only transfer knowledge forward to help future learning and do not go back to improve the model of a previous task or require the training data of the previous task to retrain its model to exploit backward/reverse knowledge transfer. This paper studies reverse knowledge transfer of LL in the context of naive Bayesian (NB) classification. It aims to improve the model of a previous task by leveraging future knowledge without retraining using its training data. This is done by exploiting a key characteristic of the generative model of NB. That is, it is possible to improve the NB classifier for a task by improving its model parameters directly by using the retained knowledge from other tasks. Experimental results show that the proposed method markedly outperforms existing LL baselines.
Deep neural networks have achieved great success in classification tasks during the last years. However, one major problem to the path towards artificial intelligence is the inability of neural networks to accurately detect novel class distributions and therefore, most of the classification algorithms proposed make the assumption that all classes are known prior to the training stage. In this work, we propose a methodology for training a neural network that allows it to efficiently detect novel class distributions without compromising much of its classification accuracy on the test examples of known classes. Experimental results on the CIFAR 100 and MiniImagenet data sets demonstrate the effectiveness of the proposed algorithm. The way this method was constructed also makes it suitable for training any classification algorithm that is based on Maximum Likelihood methods.
Explainable machine learning (ML) has been implemented in numerous open source and proprietary software packages and explainable ML is an important aspect of commercial predictive modeling. However, explainable ML can be misused, particularly as a faulty safeguard for harmful black-boxes, e.g. fairwashing, and for other malevolent purposes like model stealing. This text discusses definitions, examples, and guidelines that promote a holistic and human-centered approach to ML which includes interpretable (i.e. white-box ) models and explanatory, debugging, and disparate impact analysis techniques.
A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures that are otherwise intractable, it has been challenging both to generically improve upon Batch Normalization and to understand specific circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics which fixes a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on four datasets: CIFAR-100, SVHN, Caltech-256, and ImageNet.
As the core component of Natural Language Processing (NLP) system, Language Model (LM) can provide word representation and probability indication of word sequences. Neural Network Language Models (NNLMs) overcome the curse of dimensionality and improve the performance of traditional LMs. A survey on NNLMs is performed in this paper. The structure of classic NNLMs is described firstly, and then some major improvements are introduced and analyzed. We summarize and compare corpora and toolkits of NNLMs. Further, some research directions of NNLMs are discussed.

# If you did not already know

Two-Step Importance Weighting IL (2IWIL)
Imitation learning (IL) aims to learn an optimal policy from demonstrations. However, such demonstrations are often imperfect since collecting optimal ones is costly. To effectively learn from imperfect demonstrations, we propose a novel approach that utilizes confidence scores, which describe the quality of demonstrations. More specifically, we propose two confidence-based IL methods, namely two-step importance weighting IL (2IWIL) and generative adversarial IL with imperfect demonstration and confidence (IC-GAIL). We show that confidence scores given only to a small portion of sub-optimal demonstrations significantly improve the performance of IL both theoretically and empirically. …