PyText: A Seamless Path from NLP research to production

We introduce PyText – a deep learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine. We report our own experience of migrating experimentation and production workflows to PyText, which enabled us to iterate faster on novel modeling ideas and then seamlessly ship them at industrial scale.

Interaction Design for Explainable AI: Workshop Proceedings

As artificial intelligence (AI) systems become increasingly complex and ubiquitous, these systems will be responsible for making decisions that directly affect individuals and society as a whole. Such decisions will need to be justified due to ethical concerns as well as trust, but achieving this has become difficult due to the `black-box’ nature many AI models have adopted. Explainable AI (XAI) can potentially address this problem by explaining its actions, decisions and behaviours of the system to users. However, much research in XAI is done in a vacuum using only the researchers’ intuition of what constitutes a `good’ explanation while ignoring the interaction and the human aspect. This workshop invites researchers in the HCI community and related fields to have a discourse about human-centred approaches to XAI rooted in interaction and to shed light and spark discussion on interaction design challenges in XAI.

Safety and Trustworthiness of Deep Neural Networks: A Survey

In the past few years, significant progress has been made on deep neural networks (DNNs) in achieving human-level intelligence on several long-standing tasks such as image classification, natural language processing, the ancient game of Go, etc. With broader deployment of DNNs on various applications, the concerns on its safety and trustworthiness have been raised, particularly after the fatal incidents of self-driving cars. Research to address these concerns is very active, with many papers released in the past few years. It is therefore infeasible, if not impossible, to cover all the research activities. This survey paper is to conduct a review of the current research efforts on making DNNs safe and trustworthy, by focusing on those works that are aligned with our humble visions about the safety and trustworthiness of DNNs. In total, we surveyed 178 papers, most of which were published in the most recent two years, i.e., 2017 and 2018.

PROVEN: Certifying Robustness of Neural Networks with a Probabilistic Approach

With deep neural networks providing state-of-the-art machine learning models for numerous machine learning tasks, quantifying the robustness of these models has become an important area of research. However, most of the research literature merely focuses on the \textit{worst-case} setting where the input of the neural network is perturbed with noises that are constrained within an \ell_p ball; and several algorithms have been proposed to compute certified lower bounds of minimum adversarial distortion based on such worst-case analysis. In this paper, we address these limitations and extend the approach to a \textit{probabilistic} setting where the additive noises can follow a given distributional characterization. We propose a novel probabilistic framework PROVEN to PRObabilistically VErify Neural networks with statistical guarantees — i.e., PROVEN certifies the probability that the classifier’s top-1 prediction cannot be altered under any constrained \ell_p norm perturbation to a given input. Importantly, we show that it is possible to derive closed-form probabilistic certificates based on current state-of-the-art neural network robustness verification frameworks. Hence, the probabilistic certificates provided by PROVEN come naturally and with almost no overhead when obtaining the worst-case certified lower bounds from existing methods such as Fast-Lin, CROWN and CNN-Cert. Experiments on small and large MNIST and CIFAR neural network models demonstrate our probabilistic approach can achieve up to around 75\% improvement in the robustness certification with at least a 99.99\% confidence compared with the worst-case robustness certificate delivered by CROWN.

Invariance, Causality and Robustness

We discuss recent work for causal inference and predictive robustness in a unifying way. The key idea relies on a notion of probabilistic invariance or stability: it opens up new insights for formulating causality as a certain risk minimization problem with a corresponding notion of robustness. The invariance itself can be estimated from general heterogeneous or perturbation data which frequently occur with nowadays data collection. The novel methodology is potentially useful in many applications, offering more robustness and better `causal-oriented’ interpretation than machine learning or estimation in standard regression or classification frameworks.

A Novel Large-scale Ordinal Regression Model

Ordinal regression (OR) is a special multiclass classification problem where an order relation exists among the labels. Recent years, people share their opinions and sentimental judgments conveniently with social networks and E-Commerce so that plentiful large-scale OR problems arise. However, few studies have focused on this kind of problems. Nonparallel Support Vector Ordinal Regression (NPSVOR) is a SVM-based OR model, which learns a hyperplane for each rank by solving a series of independent sub-optimization problems and then ensembles those learned hyperplanes to predict. The previous studies are focused on its nonlinear case and got a competitive testing performance, but its training is time consuming, particularly for large-scale data. In this paper, we consider NPSVOR’s linear case and design an efficient training method based on the dual coordinate descent method (DCD). To utilize the order information among labels in prediction, a new prediction function is also proposed. Extensive contrast experiments on the text OR datasets indicate that the carefully implemented DCD is very suitable for training large data.

Factorization Machines for Data with Implicit Feedback

In this work, we propose FM-Pair, an adaptation of Factorization Machines with a pairwise loss function, making them effective for datasets with implicit feedback. The optimization model in FM-Pair is based on the BPR (Bayesian Personalized Ranking) criterion, which is a well-established pairwise optimization model. FM-Pair retains the advantages of FMs on generality, expressiveness and performance and yet it can be used for datasets with implicit feedback. We also propose how to apply FM-Pair effectively on two collaborative filtering problems, namely, context-aware recommendation and cross-domain collaborative filtering. By performing experiments on different datasets with explicit or implicit feedback we empirically show that in most of the tested datasets, FM-Pair beats state-of-the-art learning-to-rank methods such as BPR-MF (BPR with Matrix Factorization model). We also show that FM-Pair is significantly more effective for ranking, compared to the standard FMs model. Moreover, we show that FM-Pair can utilize context or cross-domain information effectively as the accuracy of recommendations would always improve with the right auxiliary features. Finally we show that FM-Pair has a linear time complexity and scales linearly by exploiting additional features.

On the Role of Age-of-Information in Internet of Things

The success of many Internet of Things (IoT) applications relies on the ability of the network to deliver sensing measurements from the IoT devices to the destination nodes while they are still fresh. In this article, we provide an accessible introduction to the emerging idea of Age-of-Information (AoI) that quantifies freshness of information and explore its possible role in the efficient design of freshness-aware IoT. We start by summarizing the concept of AoI and its variants with emphasis on the differences between AoI and other well-known performance metrics in the literature, such as throughput and delay. Building on this, we explore freshness-aware IoT design for a network in which IoT devices sense potentially different physical processes and are supposed to frequently update the status of these processes at a destination node (such as a cellular base station). Inspired by the recent interest, we also assume that these IoT devices are powered by wireless energy transfer by the destination node. For this setting, we investigate the optimal sampling policy for IoT devices that minimizes long-term weighted sum-AoI. This policy jointly optimizes wireless energy transfer and scheduling of update packet transmissions from IoT devices. Using this, we also characterize the achievable AoI region and demonstrate a fundamental trade-off between achieving fairness among different processes and achieving the minimum sum-AoI. Multiple promising directions for future research and extensions for our proposed system setup are presented.

SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks

Deep neural networks have achieved state-of-the-art accuracies in a wide range of computer vision, speech recognition, and machine translation tasks. However the limits of memory bandwidth and computational power constrain the range of devices capable of deploying these modern networks. To address this problem, we propose SQuantizer, a new training method that jointly optimizes for both sparse and low-precision neural networks while maintaining high accuracy and providing a high compression rate. This approach brings sparsification and low-bit quantization into a single training pass, employing these techniques in an order demonstrated to be optimal. Our method achieves state-of-the-art accuracies using 4-bit and 2-bit precision for ResNet18, MobileNet-v2 and ResNet50, even with high degree of sparsity. The compression rates of 18x for ResNet18 and 17x for ResNet50, and 9x for MobileNet-v2 are obtained when SQuantizing both weights and activations within 1% and 2% loss in accuracy for ResNets and MobileNet-v2 respectively. An extension of these techniques to object detection also demonstrates high accuracy on YOLO-v2. Additionally, our method allows for fast single pass training, which is important for rapid prototyping and neural architecture search techniques. Finally extensive results from this simultaneous training approach allows us to draw some useful insights into the relative merits of sparsity and quantization.

Recommendation System based on Semantic Scholar Mining and Topic modeling: A behavioral analysis of researchers from six conferences

Recommendation systems have an important place to help online users in the internet society. Recommendation Systems in computer science are of very practical use these days in various aspects of the Internet portals, such as social networks, and library websites. There are several approaches to implement recommendation systems, Latent Dirichlet Allocation (LDA) is one the popular techniques in Topic Modeling. Recently, researchers have proposed many approaches based on Recommendation Systems and LDA. According to importance of the subject, in this paper we discover the trends of the topics and find relationship between LDA topics and Scholar-Context-documents. In fact, We apply probabilistic topic modeling based on Gibbs sampling algorithms for a semantic mining from six conference publications in computer science from DBLP dataset. According to our experimental results, our semantic framework can be effective to help organizations to better organize these conferences and cover future research topics.

NeuralWarp: Time-Series Similarity with Warping Networks

Research on time-series similarity measures has emphasized the need for elastic methods which align the indices of pairs of time series and a plethora of non-parametric have been proposed for the task. On the other hand, deep learning approaches are dominant in closely related domains, such as learning image and text sentence similarity. In this paper, we propose \textit{NeuralWarp}, a novel measure that models the alignment of time-series indices in a deep representation space, by modeling a warping function as an upper level neural network between deeply-encoded time series values. Experimental results demonstrate that \textit{NeuralWarp} outperforms both non-parametric and un-warped deep models on a range of diverse real-life datasets.

DAC: Data-free Automatic Acceleration of Convolutional Networks

Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance.

Energy-aware virtual machine selection method for cloud data center resource allocation

Saving energy is an important issue for cloud providers to reduce energy cost in a data center. With the increasing popularity of cloud computing, it is time to examine various energy reduction methods for which energy consumption could be reduced and lead us to green cloud computing. In this paper, our aim is to propose a virtual machine selection algorithm to improve the energy efficiency of a cloud data center. We are also presenting experimental results of the proposed algorithm in a cloud computing based simulation environment. The proposed algorithm dynamically took the virtual machines’ allocation, deallocation, and reallocation action to the physical server. However, it depends on the load and heuristics based on the analysis placement of a virtual machine which is decided over time. From the results obtained from the simulation, we have found that our proposed virtual machine selection algorithm reduces the total energy consumption by 19% compared to the existing one. Therefore, the energy consumption cost of a cloud data center reduces and also lowers the carbon footprint. Simulation-based experimental results show that the proposed heuristics which are based on resource provisioning algorithms reduce the energy consumption of the cloud data center and decrease the virtual machine’s migration rate.

Forward Neural Network for Time Series Anomaly Detection

Time series anomaly detection is usually formulated as finding outlier data points relative to some usual data, which is also an important problem in industry and academia. To ensure systems working stably, internet companies, banks and other companies need to monitor time series, which is called KPI (Key Performance Indicators), such as CPU used, number of orders, number of online users and so on. However, millions of time series have several shapes (e.g. seasonal KPIs, KPIs of timed tasks and KPIs of CPU used), so that it is very difficult to use a simple statistical model to detect anomaly for all kinds of time series. Although some anomaly detectors have developed many years and some supervised models are also available in this field, we find many methods have their own disadvantages. In this paper, we present our system, which is based on deep forward neural network and detect anomaly points of time series. The main difference between our system and other systems based on supervised models is that we do not need feature engineering of time series to train deep forward neural network in our system, which is essentially an end-to-end system.

A Survey of Hierarchy Identification in Social Networks

Humans are social by nature. Throughout history, people have formed communities and built relationships. Most relationships with coworkers, friends, and family are developed during face-to-face interactions. These relationships are established through explicit means of communications such as words and implicit such as intonation, body language, etc. By analyzing human interactions we can derive information about the relationships and influence among conversation participants. However, with the development of the Internet, people started to communicate through text in online social networks. Interestingly, they brought their communicational habits to the Internet. Many social network users form relationships with each other and establish communities with leaders and followers. Recognizing these hierarchical relationships is an important task because it will help to understand social networks and predict future trends, improve recommendations, better target advertisement, and improve national security by identifying leaders of anonymous terror groups. In this work, I provide an overview of current research in this area and present the state-of-the-art approaches to deal with the problem of identifying hierarchical relationships in social networks.

Soft Realization: a Bio-inspired Implementation Paradigm

Researchers traditionally solve the computational problems through rigorous and deterministic algorithms called as Hard Computing. These precise algorithms have widely been realized using digital technology as an inherently reliable and accurate implementation platform, either in hardware or software forms. This rigid form of implementation which we refer as Hard Realization relies on strict algorithmic accuracy constraints dictated to digital design engineers. Hard realization admits paying as much as necessary implementation costs to preserve computation precision and determinism throughout all the design and implementation steps. Despite its prior accomplishments, this conventional paradigm has encountered serious challenges with today’s emerging applications and implementation technologies. Unlike traditional hard computing, the emerging soft and bio-inspired algorithms do not rely on fully precise and deterministic computation. Moreover, the incoming nanotechnologies face increasing reliability issues that prevent them from being efficiently exploited in hard realization of applications. This article examines Soft Realization, a novel bio-inspired approach to design and implementation of an important category of applications noticing the internal brain structure. The proposed paradigm mitigates major weaknesses of hard realization by (1) alleviating incompatibilities with today’s soft and bio-inspired algorithms such as artificial neural networks, fuzzy systems, and human sense signal processing applications, and (2) resolving the destructive inconsistency with unreliable nanotechnologies. Our experimental results on a set of well-known soft applications implemented using the proposed soft realization paradigm in both reliable and unreliable technologies indicate that significant energy, delay, and area savings can be obtained compared to the conventional implementation.

Graph Neural Networks: A Review of Methods and Applications

Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics system, learning molecular fingerprints, predicting protein interface, and classifying diseases require that a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures, like the dependency tree of sentences and the scene graph of images, is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are connectionist models that capture the dependence of graphs via message passing between the nodes of graphs. Unlike standard neural networks, graph neural networks retain a state that can represent information from its neighborhood with an arbitrary depth. Although the primitive graph neural networks have been found difficult to train for a fixed point, recent advances in network architectures, optimization techniques, and parallel computation have enabled successful learning with them. In recent years, systems based on graph convolutional network (GCN) and gated graph neural network (GGNN) have demonstrated ground-breaking performance on many tasks mentioned above. In this survey, we provide a detailed review over existing graph neural network models, systematically categorize the applications, and propose four open problems for future research.

cuPC: CUDA-based Parallel PC Algorithm for Causal Structure Learning on GPU

The main goal in many fields in empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn the underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called cuPC, to accelerate an order-independent version of PC. The cuPC algorithm has two variants, cuPC-E and cuPC-S, which parallelize conditional independence tests over the pairs of variables under the tests, and over the conditional sets, respectively. In particular, cuPC-E offers two degrees of parallelization by performing tests of multiple pairs of variables and also the tests of each pair in parallel. In the other hand, cuPC-S reuses the results of computations of a test for a given conditional set in other tests on the same conditional set. Experiment results on GTX 1080 GPU show two to three orders of magnitude speedup. For instance, in one of the most challenging benchmarks, cuPC-S reduces the runtime from about 73 hours to about one minute and achieves a significant speedup factor of about 4000 X.

How Much Does Tokenization Affect in Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple processes such as separating punctuation from words, or more sophisticated processes such as applying morphological knowledge. Neural Machine Translation (NMT) requires a limited-size vocabulary for computational cost and enough examples to well estimate word embeddings. Separating punctuation and splitting tokens into words or subwords has been shown helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality. Tokenization is more challenging when dealing with languages with no separator between words. In order to assess the impact of the tokenization in the quality of the final translation on NMT, we experimented on five tokenizers over ten language pairs. We reached the conclusions that the tokenization significantly affects the final translation quality and for different language pairs, the best tokenizer differs.

Robust Estimation of Causal Effects via High-Dimensional Covariate Balancing Propensity Score

In this paper, we propose a robust method to estimate the average treatment effects in observational studies when the number of potential confounders is possibly much greater than the sample size. We first use a class of penalized M-estimators for the propensity score and outcome models. We then calibrate the initial estimate of the propensity score by balancing a carefully selected subset of covariates that are predictive of the outcome. Finally, the estimated propensity score is used to construct the inverse probability weighting estimator. We prove that the proposed estimator, which has the sample boundedness property, is root-n consistent, asymptotically normal, and semiparametrically efficient when the propensity score model is correctly specified and the outcome model is linear in covariates. More importantly, we show that our estimator remains root-n consistent and asymptotically normal so long as either the propensity score model or the outcome model is correctly specified. We provide valid confidence intervals in both cases and further extend these results to the case where the outcome model is a generalized linear model. In simulation studies, we find that the proposed methodology often estimates the average treatment effect more accurately than the existing methods. We also present an empirical application, in which we estimate the average causal effect of college attendance on adulthood political participation. Open-source software is available for implementing the proposed methodology.

Core Decomposition in Multilayer Networks: Theory, Algorithms, and Applications

Multilayer networks are a powerful paradigm to model complex systems, where various relations might occur among the same set of entities. Despite the keen interest in a variety of problems, algorithms, and analysis methods in this type of network, the problem of extracting dense subgraphs has remained largely unexplored. As a first step in this direction, we study the problem of core decomposition of a multilayer network. Unlike the single-layer counterpart in which cores are all nested into one another, in the multilayer context no total order exists among multilayer cores: they form a lattice whose size is exponential in the number of layers. In this setting we devise three algorithms which differ in the way they visit the core lattice and in their pruning techniques. We assess time and space efficiency of the three algorithms on a large variety of real-world multilayer networks. We then study the problem of extracting only the inner-most cores, i.e., the cores that are not dominated by any other core in terms of their index on all the layers. As inner-most cores are orders of magnitude less than all the cores, it is desirable to develop algorithms that effectively exploit the maximality property and extract inner-most cores directly, without first computing a complete decomposition. Moreover, we showcase an application of the multilayer core-decomposition tool to the problem of densest-subgraph extraction from multilayer networks. We introduce a definition of multilayer densest subgraph that trades-off between high density and number of layers in which the high density holds, and show how multilayer core decomposition can be exploited to approximate this problem with quality guarantees. We also exploit multilayer core decomposition to speed-up the extraction of frequent cross-graph quasi-cliques and to generalize the community-search problem to the multilayer setting.

RNNs Implicitly Implement Tensor Product Representations

Recurrent neural networks (RNNs) can learn continuous vector representations of symbolic structures such as sequences and sentences; these representations often exhibit linear regularities (analogies). Such regularities motivate our hypothesis that RNNs that show such regularities implicitly compile symbolic structures into tensor product representations (TPRs; Smolensky, 1990), which additively combine tensor products of vectors representing roles (e.g., sequence positions) and vectors representing fillers (e.g., particular words). To test this hypothesis, we introduce Tensor Product Decomposition Networks (TPDNs), which use TPRs to approximate existing vector representations. We demonstrate using synthetic data that TPDNs can successfully approximate linear and tree-based RNN autoencoder representations, suggesting that these representations exhibit interpretable compositional structure; we explore the settings that lead RNNs to induce such structure-sensitive representations. By contrast, further TPDN experiments show that the representations of four models trained to encode naturally-occurring sentences can be largely approximated with a bag-of-words, with only marginal improvements from more sophisticated structures. We conclude that TPDNs provide a powerful method for interpreting vector representations, and that standard RNNs can induce compositional sequence representations that are remarkably well approximated by TPRs; at the same time, existing training tasks for sentence representation learning may not be sufficient for inducing robust structural representations.

What are the biases in my word embedding?

This paper presents an algorithm for enumerating biases in word embeddings. The algorithm exposes a large number of offensive associations related to sensitive features such as race and gender on publicly available embeddings, including a supposedly ‘debiased’ embedding. These embedded biases are concerning in light of the widespread use of word embeddings. The associations are identified by geometric patterns in word embeddings that run parallel between people’s names and common lower-case words and phrases. The algorithm is highly unsupervised: it does not even require the sensitive groups (such as gender or race) to be pre-specified. This is desirable because it may not always be easy to identify all vulnerable groups a priori, and because it makes it easier to identify biases against intersectional groups, which depend on combinations of sensitive features. The inputs to our algorithm are a list of target tokens, e.g. names, and a word embedding, and the outputs are a number of Word Embedding Association Tests (WEATs) that capture various biases present in the data. We illustrate the utility of our approach on publicly available word embeddings and lists of names, and evaluate its output using crowdsourcing. We also show how removing names may not remove potential proxy bias.

Deep Metric Transfer for Label Propagation with Limited Annotated Data

We study object recognition under the constraint that each object class is only represented by very few observations. In such cases, naive supervised learning would lead to severe over-fitting in deep neural networks due to limited training data. We tackle this problem by creating much more training data through label propagation from the few labeled examples to a vast collection of unannotated images. Our main insight is that such a label propagation scheme can be highly effective when the similarity metric used for propagation is learned and transferred from other related domains with lots of data. We test our approach on semi-supervised learning, transfer learning and few-shot recognition, where we learn our similarity metric using various supervised/unsupervised pretraining methods, and transfer it to unlabeled data across different data distributions. By taking advantage of unlabeled data in this way, we achieve significant improvements on all three tasks. Notably, our approach outperforms current state-of-the-art techniques by an absolute 20\% for semi-supervised learning on CIFAR10, 10\% for transfer learning from ImageNet to CIFAR10, and 6\% for few-shot recognition on mini-ImageNet, when labeled examples are limited.