# If you did not already know

SetExpander
We present SetExpander, a corpus-based system for expanding a seed set of terms into amore complete set of terms that belong to the same semantic class. SetExpander implements an iterative end-to-end workflow. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, re-expand the validated set and store it, thus simplifying the extraction of domain-specific fine-grained semantic classes. SetExpander has been used successfully in real-life use cases including integration into an automated recruitment system and an issues and defects resolution system. A video demo of SetExpander is available at https://…open?id=1e545bB87Autsch36DjnJHmq3HWfSd1Rv (some images were blurred for privacy reasons) …

Information Value (IV)
In statistical data mining, sometimes we need to determine out of a set of variables which ones are best in capturing a desired behavior. For example, let’s say you have a pool of customers for your credit card company, and you want to determine who out of them are about to default (i.e. refuse to pay up after possibly making a huge expense). You need to then identify which of the attributes you have on the customer can potentially identify and alert you of such behavior. One of the popular ways in which this is done by analysts is by looking at something called ‘Information Value’. In the context of data mining is also sometimes referred to by the short form – InfoVal.

Parallel Random Forest (PRF)
With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm’s accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability. …

MILABOT
We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including neural network and template-based models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than other systems. The results highlight the potential of coupling ensemble systems with deep reinforcement learning as a fruitful path for developing real-world, open-domain conversational agents. …

# If you did not already know

Constraint-based causal Discovery from NOnstationary/heterogeneous Data (CD-NOD)
It is commonplace to encounter nonstationary or heterogeneous data. Such a distribution shift feature presents both challenges and opportunities for causal discovery, of which the underlying generating process changes over time or across domains. In this paper, we develop a principled framework for causal discovery from such data, called Constraint-based causal Discovery from NOnstationary/heterogeneous Data (CD-NOD), which addresses two important questions. First, we propose an enhanced constraint-based procedure to detect variables whose local mechanisms change and recover the skeleton of the causal structure over observed variables. Second, we present a way to determine causal orientations by making use of independent changes in the data distribution implied by the underlying causal model, benefiting from information carried by changing distributions. After learning the causal structure, next, we investigate how to efficiently estimate the `driving force’ of the nonstationarity of a causal mechanism. That is, we aim to extract from data a low-dimensional and interpretable representation of changes. The proposed methods are totally nonparametric, with no restrictions on data distributions and causal mechanisms, and do not rely on window segmentation. Furthermore, we find that nonstationarity benefits causal structure identification with particular types of confounders. Finally, we show the tight connection between nonstationarity/heterogeneity and soft intervention in causal discovery. Experimental results on various synthetic and real-world data sets (task-fMRI and stock data) are presented to demonstrate the efficacy of the proposed methods. …

Path-by-Path
The doctrinal paradox is analysed from a probabilistic point of view assuming a simple parametric model for the committee’s behaviour. The well known issue-by-issue and case-by-case majority rules are compared in this model, by means of the concepts of false positive rate (FPR), false negative rate (FNR) and Receiver Operating Characteristics (ROC) space. We introduce also a new rule that we call path-by-path, which is somehow halfway between the other two. Under our model assumptions, the issue-by-issue rule is shown to be the best of the three according to an optimality criterion based in ROC maps, for all values of the model parameters (committee size and competence of its members), when equal weight is given to FPR an FNR. For unequal weights, the relative goodness of the rules depends on the values of the competence and the weights, in a way which is precisely described. The results are illustrated with some numerical examples. …

Generalization Error Analysis
Domain generalization is the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distribution-free, kernel-based approach that predicts a classifier from the marginal distribution of features, by leveraging the trends present in related classification tasks. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on synthetic data and three real data applications demonstrate the superiority of the method with respect to a pooling strategy. …

Dragonfly
Bayesian Optimisation (BO), refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently find the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io. …

# If you did not already know

Super Characters
We propose a method named Super Characters for sentiment classification. This method converts the sentiment classification problem into image classification problem by projecting texts into images and then applying CNN models for classification. Text features are extracted automatically from the generated Super Characters images, hence there is no need of any explicit step of embedding the words or characters into numerical vector representations. Experimental results on large social media corpus show that the Super Characters method consistently outperforms other methods for sentiment classification and topic classification tasks on ten large social media datasets of millions of contents in four different languages, including Chinese, Japanese, Korean and English. …

Factor-Augmented Markov Switching (FAMS)
This paper investigates the role of high-dimensional information sets in the context of Markov switching models with time varying transition probabilities. Markov switching models are commonly employed in empirical macroeconomic research and policy work. However, the information used to model the switching process is usually limited drastically to ensure stability of the model. Increasing the number of included variables to enlarge the information set might even result in decreasing precision of the model. Moreover, it is often not clear a priori which variables are actually relevant when it comes to informing the switching behavior. Building strongly on recent contributions in the field of dynamic factor analysis, we introduce a general type of Markov switching autoregressive models for non-linear time series analysis. Large numbers of time series are allowed to inform the switching process through a factor structure. This factor-augmented Markov switching (FAMS) model overcomes estimation issues that are likely to arise in previous assessments of the modeling framework. More accurate estimates of the switching behavior as well as improved model fit result. The performance of the FAMS model is illustrated in a simulated data example as well as in an US business cycle application. …

Greedy Randomized Adaptive Search Procedures (GRASP)
The greedy randomized adaptive search procedure (also known as GRASP) is a metaheuristic algorithm commonly applied to combinatorial optimization problems. GRASP typically consists of iterations made up from successive constructions of a greedy randomized solution and subsequent iterative improvements of it through a local search. The greedy randomized solutions are generated by adding elements to the problem’s solution set from a list of elements ranked by a greedy function according to the quality of the solution they will achieve. To obtain variability in the candidate set of greedy solutions, well-ranked candidate elements are often placed in a restricted candidate list (also known as RCL), and chosen at random when building up the solution. This kind of greedy randomized construction method is also known as a semi-greedy heuristic, first described in Hart and Shogan (1987). GRASP was first introduced in Feo and Resende (1989). Survey papers on GRASP include Feo and Resende (1995), Pitsoulis and Resende (2002), and Resende and Ribeiro (2003). An annotated bibliography of GRASP can be found in Festa, G. C Resende (2002). …

Bio7
The application Bio7 is an integrated development environment for ecological modelling and contains powerful tools for model creation, scientific image analysis and statistical analysis. The application itself is based on an RCP-Eclipse-Environment (Rich-Client-Platform) which offers a huge flexibility in configuration and extensibility because of its plug-in structure and the possibility of customization. Features:
· Creation and analysis of simulation models.
· Statistical analysis.
· Advanced R Graphical User Interface with editor, spreadsheet, ImageJ plot device and debugging interface.
· Spatial statistics (possibility to send values from a specialized panel to R).
· Image Analysis (embedded ImageJ).
· Fast transfer of image data from ImageJ to R and vice versa.
· Fast communication between R and Java (with RServe) and the possibilty to use R methods inside Java.
· Interpretation of Java and script creation (BeanShell, Groovy, Jython).
· Dynamic compilation of Java.
· Creation of methods for Java, BeanShell, Groovy, Jython and R (integrated editors for Java, R, BeanShell, Groovy, Jython).
· Sensitivity analysis with an embedded flowchart editor in which scripts, macros and compiled code can be dragged and executed.
· Creation of 3d OpenGL (Jogl) models.
· Visualizations and simulations on an embedded 3d globe (World Wind Java SDK).
· Creation of Graphical User Interfaces with the embedded JavaFX SceneBuilder. …

# If you did not already know

DataOps
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. …

CoSegNet
We introduce CoSegNet, a deep neural network architecture for co-segmentation of a set of 3D shapes represented as point clouds. CoSegNet takes as input a set of unsegmented shapes, proposes per-shape parts, and then jointly optimizes the part labelings across the set subjected to a novel group consistency loss expressed via matrix rank estimates. The proposals are refined in each iteration by an auxiliary network that acts as a weak regularizing prior, pre-trained to denoise noisy, unlabeled parts from a large collection of segmented 3D shapes, where the part compositions within the same object category can be highly inconsistent. The output is a consistent part labeling for the input set, with each shape segmented into up to K (a user-specified hyperparameter) parts. The overall pipeline is thus weakly supervised, producing consistent segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from CoSegNet and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods. …

Stochastic Computation Graph (SCG)
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. …

Smooth Density Spatial Quantile Regression
We derive the properties and demonstrate the desirability of a model-based method for estimating the spatially-varying effects of covariates on the quantile function. By modeling the quantile function as a combination of I-spline basis functions and Pareto tail distributions, we allow for flexible parametric modeling of the extremes while preserving non-parametric flexibility in the center of the distribution. We further establish that the model guarantees the desired degree of differentiability in the density function and enables the estimation of non-stationary covariance functions dependent on the predictors. We demonstrate through a simulation study that the proposed method produces more efficient estimates of the effects of predictors than other methods, particularly in distributions with heavy tails. To illustrate the utility of the model we apply it to measurements of benzene collected around an oil refinery to determine the effect of an emission source within the refinery on the distribution of the fence line measurements. …

# If you did not already know

Correntropy
Correntropy is a nonlinear similarity measure between two random variables.
Learning with the Maximum Correntropy Criterion Induced Losses for Regression

Patient Event Graph (PatientEG)
Medical activities, such as diagnoses, medicine treatments, and laboratory tests, as well as temporal relations between these activities are the basic concepts in clinical research. However, existing relational data model on electronic medical records (EMRs) lacks explicit and accurate semantic definitions of these concepts. It leads to the inconvenience of query construction and the inefficiency of query execution where multi-table join queries are frequently required. In this paper, we propose a patient event graph (PatientEG) model to capture the characteristics of EMRs. We respectively define five types of medical entities, five types of medical events and five types of temporal relations. Based on the proposed model, we also construct a PatientEG dataset with 191,294 events, 3,429 distinct entities, and 545,993 temporal relations using EMRs from Shanghai Shuguang hospital. To help to normalize entity values which contain synonyms, hyponymies, and abbreviations, we link them with the Chinese biomedical knowledge graph. With the help of PatientEG dataset, we are able to conveniently perform complex queries for clinical research such as auxiliary diagnosis and therapeutic effectiveness analysis. In addition, we provide a SPARQL endpoint to access PatientEG dataset and the dataset is also publicly available online. Also, we list several illustrative SPARQL queries on our website. …

LogitBoost Autoregressive Networks
Multivariate binary distributions can be decomposed into products of univariate conditional distributions. Recently popular approaches have modeled these conditionals through neural networks with sophisticated weight-sharing structures. It is shown that state-of-the-art performance on several standard benchmark datasets can actually be achieved by training separate probability estimators for each dimension. In that case, model training can be trivially parallelized over data dimensions. On the other hand, complexity control has to be performed for each learned conditional distribution. Three possible methods are considered and experimentally compared. The estimator that is employed for each conditional is LogitBoost. Similarities and differences between the proposed approach and autoregressive models based on neural networks are discussed in detail. …

Discretification
Discretification’ is the mechanism of making continuous data discrete. If you really grasp the concept, you may be thinking ‘Wait a minute, the type of data we are collecting is discrete in and of itself! Data can EITHER be discrete OR continuous, it can’t be both!’ You would be correct. But what if we manually selected values along that continuous measurement, and declared them to be in a specific category? For instance, if we declare 72.0 degrees and greater to be ‘Hot’, 35.0-71.9 degrees to be ‘Moderate’, and anything lower than 35.0 degrees to be ‘Cold’, we have ‘discretified’ temperature! Our readings that were once continuous now fit into distinct categories. So, where we do we draw the boundaries for these categories? What makes 35.0 degrees ‘Cold’ and 35.1 degrees ‘Moderate’? At is at this juncture that the TRUE decision is being made. The beauty of approaching the challenge in this manner is that it is data-centric, not concept-centric. Let’s walk through our marketing example first without using discretification, then with it. …

# If you did not already know

Data Oriented Design
In computing, data-oriented design is a program optimization approach motivated by efficient usage of the CPU cache, used in video game development. The approach is to focus on the data layout, separating and sorting fields according to when they are needed, and to think about transformations of data. Proponents include Mike Acton and Scott Meyers. …

Neural Tangent Kernel (NTK)
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fTheta(which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function fTheta follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
On Exact Computation with an Infinitely Wide Neural Net

Deep Distance Metric Learning (DDML)
Deep distance metric learning (DDML), which is proposed to learn image similarity metrics in an end-to-end manner based on the convolution neural network. …

Bayesian Causal Inference (BCI)
We address the problem of two-variable causal inference. This task is to infer an existing causal relation between two random variables, i.e. $X \rightarrow Y$ or $Y \rightarrow X$, from purely observational data. We briefly review a number of state-of-the-art methods for this, including very recent ones. A novel inference method is introduced, Bayesian Causal Inference (BCI), which assumes a generative Bayesian hierarchical model to pursue the strategy of Bayesian model selection. In the model the distribution of the cause variable is given by a Poisson lognormal distribution, which allows to explicitly regard discretization effects. We assume Fourier diagonal Field covariance operators. The generative model assumed provides synthetic causal data for benchmarking our model in comparison to existing State-of-the-art models, namely LiNGAM, ANM-HSIC, ANM-MML, IGCI and CGNN. We explore how well the above methods perform in case of high noise settings, strongly discretized data and very sparse data. BCI performs generally reliable with synthetic data as well as with the real world TCEP benchmark set, with an accuracy comparable to state-of-the-art algorithms. …

# If you did not already know

Elastic Gossip
Distributing Neural Network training is of particular interest for several reasons including scaling using computing clusters, training at data sources such as IOT devices and edge servers, utilizing underutilized resources across heterogeneous environments, and so on. Most contemporary approaches primarily address scaling using computing clusters and require high network bandwidth and frequent communication. This thesis presents an overview of standard approaches to distribute training and proposes a novel technique involving pairwise-communication using Gossip-like protocols, called Elastic Gossip. This approach builds upon an existing technique known as Elastic Averaging SGD (EASGD), and is similar to another technique called Gossiping SGD which also uses Gossip-like protocols. Elastic Gossip is empirically evaluated against Gossiping SGD using the MNIST digit recognition and CIFAR-10 classification tasks, using commonly used Neural Network architectures spanning Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). It is found that Elastic Gossip, Gossiping SGD, and All-reduce SGD perform quite comparably, even though the latter entails a substantially higher communication cost. While Elastic Gossip performs better than Gossiping SGD in these experiments, it is possible that a more thorough search over hyper-parameter space, specific to a given application, may yield configurations of Gossiping SGD that work better than Elastic Gossip. …

Cross-Entropy Guided Policy (CGP)
Off-Policy reinforcement learning (RL) is an important class of methods for many problem domains, such as robotics, where the cost of collecting data is high and on-policy methods are consequently intractable. Standard methods for applying Q-learning to continuous-valued action domains involve iteratively sampling the Q-function to find a good action (e.g. via hill-climbing), or by learning a policy network at the same time as the Q-function (e.g. DDPG). Both approaches make tradeoffs between stability, speed, and accuracy. We propose a novel approach, called Cross-Entropy Guided Policies, or CGP, that draws inspiration from both classes of techniques. CGP aims to combine the stability and performance of iterative sampling policies with the low computational cost of a policy network. Our approach trains the Q-function using iterative sampling with the Cross-Entropy Method (CEM), while training a policy network to imitate CEM’s sampling behavior. We demonstrate that our method is more stable to train than state of the art policy network methods, while preserving equivalent inference time compute costs, and achieving competitive total reward on standard benchmarks. …

Switchable Temporal Propagation Network
Videos contain highly redundant information between frames. Such redundancy has been extensively studied in video compression and encoding, but is less explored for more advanced video processing. In this paper, we propose a learnable unified framework for propagating a variety of visual properties of video images, including but not limited to color, high dynamic range (HDR), and segmentation information, where the properties are available for only a few key-frames. Our approach is based on a temporal propagation network (TPN), which models the transition-related affinity between a pair of frames in a purely data-driven manner. We theoretically prove two essential factors for TPN: (a) by regularizing the global transformation matrix as orthogonal, the ‘style energy’ of the property can be well preserved during propagation; (b) such regularization can be achieved by the proposed switchable TPN with bi-directional training on pairs of frames. We apply the switchable TPN to three tasks: colorizing a gray-scale video based on a few color key-frames, generating an HDR video from a low dynamic range (LDR) video and a few HDR frames, and propagating a segmentation mask from the first frame in videos. Experimental results show that our approach is significantly more accurate and efficient than the state-of-the-art methods. …

OHIE
Blockchain protocols, originating from Bitcoin, have established a new model of trust through decentralization. However, the low transaction throughput of the first generation of blockchain consensus protocols has been a serious concern. Many new protocols have been proposed recently that scale the throughput of the blockchain with available bandwidth. However, these scalable consensus protocols are becoming increasingly complex, making it more and more difficult to verify their end safety and liveness guarantees. This encumbers adoption since blockchain protocols are difficult to upgrade, once deployed. We propose a new consensus protocol for permissionless blockchains, called OHIE, with an explicit goal of aiming for simplicity. OHIE composes as many parallel instances of Bitcoin’s original (and simple) backbone protocol as needed to achieve near-optimal throughput (i.e., utilizing within a constant factor of the available bandwidth). OHIE tolerates a Byzantine adversary with fraction f < 1/2 of the computation power. We formally prove safety and liveness properties of OHIE. Our proof invokes previously established properties of Bitcoin’s backbone protocol as a black-box, given the modular design of OHIE. In our experimental evaluation with up to 50,000 nodes, OHIE achieves near-optimal throughput, and provides better decentralization of at least about 20x over prior works. …

# If you did not already know

Satisficing
Satisficing is a decision-making strategy or cognitive heuristic that entails searching through the available alternatives until an acceptability threshold is met. The term satisficing, a portmanteau of satisfy and suffice, was introduced by Herbert A. Simon in 1956, although the concept was first posted in his 1947 book Administrative Behavior. Simon used satisficing to explain the behavior of decision makers under circumstances in which an optimal solution cannot be determined. He maintained that many natural problems are characterized by computational intractability or a lack of information, both of which preclude the use of mathematical optimization procedures. He observed in his Nobel Prize in Economics speech that ‘decision makers can satisfice either by finding optimum solutions for a simplified world, or by finding satisfactory solutions for a more realistic world. Neither approach, in general, dominates the other, and both have continued to co-exist in the world of management science’. …

Parameter Hub (PHub)
Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks. We propose PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups, and PBox, a balanced, scalable central PS hardware that fully utilizes PHub capabilities. We show that in a typical cloud environment, PHub can achieve up to 3.8x speedup over state-of-theart designs when training ImageNet. We discuss future directions of integrating PHub with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement. …

Convolutional Spiking Neural Network
Spiking neural networks are motivated from principles of neural systems and may possess unexplored advantages in the context of machine learning. A class of \textit{convolutional spiking neural networks} is introduced, trained to detect image features with an unsupervised, competitive learning mechanism. Image features can be shared within subpopulations of neurons, or each may evolve independently to capture different features in different regions of input space. We analyze the time and memory requirements of learning with and operating such networks. The MNIST dataset is used as an experimental testbed, and comparisons are made between the performance and convergence speed of a baseline spiking neural network. …

ContextNet
Modern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.2 frames per second at full (1024×2048) resolution. …

# If you did not already know

Cortex Neural Network (CrtxNN)
Neural Network has been successfully applied to many real-world problems, such as image recognition and machine translation. However, for the current architecture of neural networks, it is hard to perform complex cognitive tasks, for example, to process the image and audio inputs together. Cortex, as an important architecture in the brain, is important for animals to perform the complex cognitive task. We view the architecture of Cortex in the brain as a missing part in the design of the current artificial neural network. In this paper, we purpose Cortex Neural Network (CrtxNN). The Cortex Neural Network is an upper architecture of neural networks which motivated from cerebral cortex in the brain to handle different tasks in the same learning system. It is able to identify different tasks and solve them with different methods. In our implementation, the Cortex Neural Network is able to process different cognitive tasks and perform reflection to get a higher accuracy. We provide a series of experiments to examine the capability of the cortex architecture on traditional neural networks. Our experiments proved its ability on the Cortex Neural Network can reach accuracy by 98.32% on MNIST and 62% on CIFAR10 at the same time, which can promisingly reduce the loss by 40%. …

Neighbor-Encoder
Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-encoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoder-decoder architecture of the classic autoencoder.In contrast to autoencoder, which reconstructs the input data itself, neighbor-encoder reconstructs the input data’s neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any off-the-shelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighbor-encoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the state-of-the-art performance on text document clustering. …

Synaptic Pruning
Convolutional Neural Networks(CNNs) are both computation and memory intensive which hindered their deployment in mobile devices. Inspired by the relevant concept in neural science literature, we propose Synaptic Pruning: a data-driven method to prune connections between input and output feature maps with a newly proposed class of parameters called Synaptic Strength. Synaptic Strength is designed to capture the importance of a connection based on the amount of information it transports. Experiment results show the effectiveness of our approach. On CIFAR-10, we prune connections for various CNN models with up to 96% , which results in significant size reduction and computation saving. Further evaluation on ImageNet demonstrates that synaptic pruning is able to discover efficient models which is competitive to state-of-the-art compact CNNs such as MobileNet-V2 and NasNet-Mobile. Our contribution is summarized as following: (1) We introduce Synaptic Strength, a new class of parameters for CNNs to indicate the importance of each connections. (2) Our approach can prune various CNNs with high compression without compromising accuracy. (3) Further investigation shows, the proposed Synaptic Strength is a better indicator for kernel pruning compared with the previous approach in both empirical result and theoretical analysis. …

Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM)
Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters. …

# If you did not already know

ACAMP
Matrix profile has been recently proposed as a promising technique to the problem of all-pairs-similarity search on time series. Efficient algorithms have been proposed for computing it, e.g., STAMP, STOMP and SCRIMP++. All these algorithms use the z-normalized Euclidean distance to measure the distance between subsequences. However, as we observed, for some datasets other Euclidean measurements are more useful for knowledge discovery from time series. In this paper, we propose efficient algorithms for computing matrix profile for a general class of Euclidean distances. We first propose a simple but efficient algorithm called AAMP for computing matrix profile with the ‘pure’ (non-normalized) Euclidean distance. Then, we extend our algorithm for the p-norm distance. We also propose an algorithm, called ACAMP, that uses the same principle as AAMP, but for the case of z-normalized Euclidean distance. We implemented our algorithms, and evaluated their performance through experimentation. The experiments show excellent performance results. For example, they show that AAMP is very efficient for computing matrix profile for non-normalized Euclidean distances. The results also show that the ACAMP algorithm is significantly faster than SCRIMP++ (the state of the art matrix profile algorithm) for the case of z-normalized Euclidean distance. …

Bidirectional Recurrent Imputation for Time Series (BRITS)
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation.BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and applies to general settings with missing data.We evaluate our model on three real-world datasets, including an air quality dataset, a health-care data, and a localization data for human activity. Experiments show that our model outperforms the state-of-the-art methods in both imputation and classification/regression accuracies. …

Generative Predecessor Models for Imitation Learning (GPRIL)
We propose Generative Predecessor Models for Imitation Learning (GPRIL), a novel imitation learning algorithm that matches the state-action distribution to the distribution observed in expert demonstrations, using generative models to reason probabilistically about alternative histories of demonstrated states. We show that this approach allows an agent to learn robust policies using only a small number of expert demonstrations and self-supervised interactions with the environment. We derive this approach from first principles and compare it empirically to a state-of-the-art imitation learning method, showing that it outperforms or matches its performance on two simulated robot manipulation tasks and demonstrate significantly higher sample efficiency by applying the algorithm on a real robot. …

Ising Model
The Ising model, named after the physicist Ernst Ising, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent magnetic dipole moments of atomic spins that can be in one of two states (+1 or -1). The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. The model allows the identification of phase transitions, as a simplified model of reality. The two-dimensional square-lattice Ising model is one of the simplest statistical models to show a phase transition. The Ising model was invented by the physicist Wilhelm Lenz (1920), who gave it as a problem to his student Ernst Ising. The one-dimensional Ising model has no phase transition and was solved by Ising (1925) himself in his 1924 thesis. The two-dimensional square lattice Ising model is much harder, and was given an analytic description much later, by Lars Onsager (1944). It is usually solved by a transfer-matrix method, although there exist different approaches, more related to quantum field theory. In dimensions greater than four, the phase transition of the Ising model is described by mean field theory.
Interpreting the Ising Model: The Input Matters