If you did not already know

Delta Embedding Learning google
Learning from corpus and learning from supervised NLP tasks both give useful semantics that can be incorporated into a good word representation. We propose an embedding learning method called Delta Embedding Learning, to learn semantic information from high-level supervised tasks like reading comprehension, and combine it with an unsupervised word embedding. The simple technique not only improved the performance of various supervised NLP tasks, but also simultaneously learns improved universal word embeddings out of these tasks. …

Association for Uncertainty in Artificial Intelligence (AUAI) google
The Association for Uncertainty in Artificial Intelligence is a non-profit organization focused on organizing the annual Conference on Uncertainty in Artificial Intelligence (UAI) and, more generally, on promoting research in pursuit of advances in knowledge representation, learning and reasoning under uncertainty. Principles and applications developed within the UAI community have been at the forefront of research in Artificial Intelligence. The UAI community and annual meeting have been primary sources of advances in graphical models for representing and reasoning with uncertainty. …

Deep Product Quantization (DPQ) google
Despite their widespread adoption, Product Quantization techniques were recently shown to be inferior to other hashing techniques. In this work, we present an improved Deep Product Quantization (DPQ) technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a representation that is inspired by Product Quantization and which is learned end-to-end, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straight-through estimator. A novel loss function, Joint Central Loss, is introduced, which both improves the retrieval performance, and decreases the discrepancy between the soft and the hard representations. Finally, by using a normalization technique, we improve the results for cross-domain category retrieval. …

Data Blending google
Data blending is the process of combining data from multiple sources to reveal deeper intelligence that drives better business decision-making. Data blending differs from data integration and data warehousing in that its primary use is not to create the single, unified version of the truth that is stored in systems of record. Rather, business and data analysts use data blending to build an analytic dataset to assist in answering a specific business questions and driving a particular business process. …

If you did not already know

Uneven Group Convolution google
In this paper, we are interested in boosting the representation capability of convolution neural networks which utilizing the inverted residual structure. Based on the success of Inverted Residual structure[Sandler et al. 2018] and Interleaved Low-Rank Group Convolutions[Sun et al. 2018], we rethink this two pattern of neural network structure, rather than NAS(Neural architecture search) method[Zoph and Le 2017; Pham et al. 2018; Liu et al. 2018b], we introduce uneven point-wise group convolution, which provide a novel search space for designing basic blocks to obtain better trade-off between representation capability and computational cost. Meanwhile, we propose two novel information flow patterns that will enable cross-group information flow for multiple group convolution layers with and without any channel permute/shuffle operation. Dense experiments on image classification task show that our proposed model, named Seesaw-Net, achieves state-of-the-art (SOTA) performance with limited computation and memory cost. Our code will be open-source and available together with pre-trained models. …

MQGrad google
One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies. …

Huber Loss google
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. …

Dynamic Generative Memory (DGM) google
Models trained in the context of continual learning (CL) should be able to learn from a stream of data over an undefined period of time. The main challenges herein are: 1) maintaining old knowledge while simultaneously benefiting from it when learning new tasks, and 2) guaranteeing model scalability with a growing amount of data to learn from. In order to tackle these challenges, we introduce Dynamic Generative Memory (DGM) – a synaptic plasticity driven framework for continual learning. DGM relies on conditional generative adversarial networks with learnable connection plasticity realized with neural masking. Specifically, we evaluate two variants of neural masking: applied to (i) layer activations and (ii) to connection weights directly. Furthermore, we propose a dynamic network expansion mechanism that ensures sufficient model capacity to accommodate for continually incoming tasks. The amount of added capacity is determined dynamically from the learned binary mask. We evaluate DGM in the continual class-incremental setup on visual classification tasks. …

If you did not already know

Noise Engineered Mode-matching GAN (NEMGAN) google
Conditional generation refers to the process of sampling from an unknown distribution conditioned on semantics of the data. This can be achieved by augmenting the generative model with the desired semantic labels, albeit it is not straightforward in an unsupervised setting where the semantic label of every data sample is unknown. In this paper, we address this issue by proposing a method that can generate samples conditioned on the properties of a latent distribution engineered in accordance with a certain data prior. In particular, a latent space inversion network is trained in tandem with a generative adversarial network such that the modal properties of the latent space distribution are induced in the data generating distribution. We demonstrate that our model despite being fully unsupervised, is effective in learning meaningful representations through its mode matching property. We validate our method on multiple unsupervised tasks such as conditional generation, attribute discovery and inference using three real world image datasets namely MNIST, CIFAR-10 and CelebA and show that the results are comparable to the state-of-the-art methods. …

Holt-Winters Method (HW) google
Holt (1957) and Winters (1960) extended Holt’s method to capture seasonality. The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations – one for the level ℓ t , one for trend b t , and one for the seasonal component denoted by s t, with smoothing parameters α , β ∗ and γ. We use m to denote the period of the seasonality, i.e., the number of seasons in a year. For example, for quarterly data m=4 , and for monthly data m=12. There are two variations to this method that differ in the nature of the seasonal component. The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series. With the additive method, the seasonal component is expressed in absolute terms in the scale of the observed series, and in the level equation the series is seasonally adjusted by subtracting the seasonal component. Within each year the seasonal component will add up to approximately zero. With the multiplicative method, the seasonal component is expressed in relative terms (percentages) and the series is seasonally adjusted by dividing through by the seasonal component. Within each year, the seasonal component will sum up to approximately m. …

Annotation Query Language (AQL) google
Annotation Query Language (AQL) is the language for developing text analytics extractors in the InfoSphere BigInsights Text Analytics system. An extractor is a program written in AQL that extracts structured information from unstructured or semistructured text. AQL is a declarative language. The syntax of AQL is similar to that of Structured Query Language (SQL), but with several important differences. …

Data2Vis google
Rapidly creating effective visualizations using expressive grammars is challenging for users who have limited time and limited skills in statistics and data visualization. Even high-level, dedicated visualization tools often require users to manually select among data attributes, decide which transformations to apply, and specify mappings between visual encoding variables and raw or transformed attributes. In this paper, we introduce Data2Vis, a neural translation model, for automatically generating visualizations from given datasets. We formulate visualization generation as a sequence to sequence translation problem where data specification is mapped to a visualization specification in a declarative language (Vega-Lite). To this end, we train a multilayered Long Short-Term Memory (LSTM) model with attention on a corpus of visualization specifications. Qualitative results show that our model learns the vocabulary and syntax for a valid visualization specification, appropriate transformations (count, bins, mean) and how to use common data selection patterns that occur within data visualizations. Our model generates visualizations that are comparable to manually-created visualizations in a fraction of the time, with potential to learn more complex visualization strategies at scale. …

If you did not already know

Continuous Adaptive Blending (CAB) google
The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. This paper proposes a new counterfactual estimator – called Continuous Adaptive Blending (CAB) – for this policy evaluation problem that combines regression and weighting approaches for an effective bias/variance trade-off. It can be substantially less biased than clipped Inverse Propensity Score weighting and the Direct Method, and it can have less variance compared with Doubly Robust and IPS estimators. Experimental results show that CAB provides excellent and reliable estimation accuracy compared to other blended estimators, and – unlike the SWITCH estimator – is sub-differentiable such that it can be used for learning. …

Refinery google
Refinery is an open source platform for the massive analysis of large unstructured document collections using the latest state of the art topic models. The goal of Refinery is to simplify this process within an intuitive web-based interface. What makes Refinery unique is that its meant to be run locally, thus bypassing the need for securing document collections over the internet. Refinery was developed by myself and Ben Swanson at MIT Media Lab. It was also the recipient of the Knight Prototype Award in 2014. …

Convolutional Poisson Gamma Belief Network (CPGBN) google
For text analysis, one often resorts to a lossy representation that either completely ignores word order or embeds each word as a low-dimensional dense feature vector. In this paper, we propose convolutional Poisson factor analysis (CPFA) that directly operates on a lossless representation that processes the words in each document as a sequence of high-dimensional one-hot vectors. To boost its performance, we further propose the convolutional Poisson gamma belief network (CPGBN) that couples CPFA with the gamma belief network via a novel probabilistic pooling layer. CPFA forms words into phrases and captures very specific phrase-level topics, and CPGBN further builds a hierarchy of increasingly more general phrase-level topics. For efficient inference, we develop both a Gibbs sampler and a Weibull distribution based convolutional variational auto-encoder. Experimental results demonstrate that CPGBN can extract high-quality text latent representations that capture the word order information, and hence can be leveraged as a building block to enrich a wide variety of existing latent variable models that ignore word order. …

Congruence Distance google
A time series is a sequence of data items; typical examples are videos, stock ticker data, or streams of temperature measurements. Quite some research has been devoted to comparing and indexing simple time series, i.e., time series where the data items are real numbers or integers. However, for many application scenarios, the data items of a time series are not simple, but high-dimensional data points. Motivated by an application scenario dealing with motion gesture recognition, we develop a distance measure (which we call congruence distance) that serves as a model for the approximate congruency of two multi-dimensional time series. This distance measure generalizes the classical notion of congruence from point sets to multi-dimensional time series. We show that, given two input time series $S$ and $T$, computing the congruence distance of $S$ and $T$ is NP-hard. Afterwards, we present two algorithms that compute an approximation of the congruence distance. We provide theoretical bounds that relate these approximations with the exact congruence distance. …

If you did not already know

Deep Semantic Multimodal Hashing Network (DSMHN) google
Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. Particularly, deep hashing has received unprecedented research attention in recent years, owing to its perfect retrieval performance. However, most of existing deep hashing methods learn binary hash codes by preserving the similarity relationship while without exploiting the semantic labels, which result in suboptimal binary codes. In this work, we propose a novel Deep Semantic Multimodal Hashing Network (DSMHN) for scalable multimodal retrieval. In DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both the inter-modality similarities and the intra-modality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for task-specific classification, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Different from previous deep hashing methods, which are tied to some particular forms of loss functions, our deep hashing framework can be flexibly integrated with different types of loss functions. In addition, the bit balance property is investigated to generate binary codes with each bit having $50\%$ probability to be $1$ or $-1$. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, inter-modality similarity preserving learning, semantic label preserving learning and hash functions learning with bit balanced constraint simultaneously. We conduct extensive experiments for both unimodal and cross-modal retrieval tasks on three widely-used multimodal retrieval datasets. The experimental result demonstrates that DSMHN significantly outperforms state-of-the-art methods. …

Cause-Emotion-Action Corpus google
Many Natural Language Processing works on emotion analysis only focus on simple emotion classification without exploring the potentials of putting emotion into ‘event context’, and ignore the analysis of emotion-related events. One main reason is the lack of this kind of corpus. Here we present Cause-Emotion-Action Corpus, which manually annotates not only emotion, but also cause events and action events. We propose two new tasks based on the data-set: emotion causality and emotion inference. The first task is to extract a triple (cause, emotion, action). The second task is to infer the probable emotion. We are currently releasing the data-set with 10,603 samples and 15,892 events, basic statistic analysis and baseline on both emotion causality and emotion inference tasks. Baseline performance demonstrates that there is much room for both tasks to be improved. …

DT-LET google
Previous transfer learning methods based on deep network assume the knowledge should be transferred between the same hidden layers of the source domain and the target domains. This assumption doesn’t always hold true, especially when the data from the two domains are heterogeneous with different resolutions. In such case, the most suitable numbers of layers for the source domain data and the target domain data would differ. As a result, the high level knowledge from the source domain would be transferred to the wrong layer of target domain. Based on this observation, ‘where to transfer’ proposed in this paper should be a novel research frontier. We propose a new mathematic model named DT-LET to solve this heterogeneous transfer learning problem. In order to select the best matching of layers to transfer knowledge, we define specific loss function to estimate the corresponding relationship between high-level features of data in the source domain and the target domain. To verify this proposed cross-layer model, experiments for two cross-domain recognition/classification tasks are conducted, and the achieved superior results demonstrate the necessity of layer correspondence searching. …

Modulated Policy Hierarchies (MPH) google
Solving tasks with sparse rewards is a main challenge in reinforcement learning. While hierarchical controllers are an intuitive approach to this problem, current methods often require manual reward shaping, alternating training phases, or manually defined sub tasks. We introduce modulated policy hierarchies (MPH), that can learn end-to-end to solve tasks from sparse rewards. To achieve this, we study different modulation signals and exploration for hierarchical controllers. Specifically, we find that communicating via bit-vectors is more efficient than selecting one out of multiple skills, as it enables mixing between them. To facilitate exploration, MPH uses its different time scales for temporally extended intrinsic motivation at each level of the hierarchy. We evaluate MPH on the robotics tasks of pushing and sparse block stacking, where it outperforms recent baselines. …

If you did not already know

Semantic View Selection google
An understanding of the nature of objects could help robots to solve both high-level abstract tasks and improve performance at lower-level concrete tasks. Although deep learning has facilitated progress in image understanding, a robot’s performance in problems like object recognition often depends on the angle from which the object is observed. Traditionally, robot sorting tasks rely on a fixed top-down view of an object. By changing its viewing angle, a robot can select a more semantically informative view leading to better performance for object recognition. In this paper, we introduce the problem of semantic view selection, which seeks to find good camera poses to gain semantic knowledge about an observed object. We propose a conceptual formulation of the problem, together with a solvable relaxation based on clustering. We then present a new image dataset consisting of around 10k images representing various views of 144 objects under different poses. Finally we use this dataset to propose a first solution to the problem by training a neural network to predict a ‘semantic score’ from a top view image and camera pose. The views predicted to have higher scores are then shown to provide better clustering results than fixed top-down views. …

Missingness-Aware Temporal Convolutional Hitting-time Network (MATCH-Net) google
Accurate prediction of disease trajectories is critical for early identification and timely treatment of patients at risk. Conventional methods in survival analysis are often constrained by strong parametric assumptions and limited in their ability to learn from high-dimensional data, while existing neural network models are not readily-adapted to the longitudinal setting. This paper develops a novel convolutional approach that addresses these drawbacks. We present MATCH-Net: a Missingness-Aware Temporal Convolutional Hitting-time Network, designed to capture temporal dependencies and heterogeneous interactions in covariate trajectories and patterns of missingness. To the best of our knowledge, this is the first investigation of temporal convolutions in the context of dynamic prediction for personalized risk prognosis. Using real-world data from the Alzheimer’s Disease Neuroimaging Initiative, we demonstrate state-of-the-art performance without making any assumptions regarding underlying longitudinal or time-to-event processes attesting to the model’s potential utility in clinical decision support. …

Windowed Fourier Filtering google
Interferometric phase (InPhase) imaging is an important part of many present-day coherent imaging technologies. Often in such imaging techniques, the acquired images, known as interferograms, suffer from two major degradations: 1) phase wrapping caused by the fact that the sensing mechanism can only measure sinusoidal $2\pi$-periodic functions of the actual phase, and 2) noise introduced by the acquisition process or the system. This work focusses on InPhase denoising which is a fundamental restoration step to many posterior applications of InPhase, namely to phase unwrapping. The presence of sharp fringes that arises from phase wrapping makes InPhase denoising a hard-inverse problem. Motivated by the fact that the InPhase images are often locally sparse in Fourier domain, we propose a multi-resolution windowed Fourier filtering (WFF) analysis that fuses WFF estimates with different resolutions, thus overcoming the WFF fixed resolution limitation. The proposed fusion relies on an unbiased estimate of the mean square error derived using the Stein’s lemma adapted to complex-valued signals. This estimate, known as SURE, is minimized using an optimization framework to obtain the fusion weights. Strong experimental evidence, using synthetic and real (InSAR & MRI) data, that the developed algorithm, termed as SURE-fuse WFF, outperforms the best hand-tuned fixed resolution WFF as well as other state-of-the-art InPhase denoising algorithms, is provided. …

Reinforced Encoder-Decoder (RED) google
Action anticipation aims to detect an action before it happens. Many real world applications in robotics and surveillance are related to this predictive capability. Current methods address this problem by first anticipating visual representations of future frames and then categorizing the anticipated representations to actions. However, anticipation is based on a single past frame’s representation, which ignores the history trend. Besides, it can only anticipate a fixed future time. We propose a Reinforced Encoder-Decoder (RED) network for action anticipation. RED takes multiple history representations as input and learns to anticipate a sequence of future representations. One salient aspect of RED is that a reinforcement module is adopted to provide sequence-level supervision; the reward function is designed to encourage the system to make correct predictions as early as possible. We test RED on TVSeries, THUMOS-14 and TV-Human-Interaction datasets for action anticipation and achieve state-of-the-art performance on all datasets. …

If you did not already know

Graph Wavelet Neural Network (GWNN) google
We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcomings of previous spectral graph CNN methods that depend on graph Fourier transform. Different from graph Fourier transform, graph wavelet transform can be obtained via a fast algorithm without requiring matrix eigendecomposition with high computational cost. Moreover, graph wavelets are sparse and localized in vertex domain, offering high efficiency and good interpretability for graph convolution. The proposed GWNN significantly outperforms previous spectral graph CNNs in the task of graph-based semi-supervised classification on three benchmark datasets: Cora, Citeseer and Pubmed. …

Guided Dropout google
Dropout is often used in deep neural networks to prevent over-fitting. Conventionally, dropout training invokes \textit{random drop} of nodes from the hidden layers of a Neural Network. It is our hypothesis that a guided selection of nodes for intelligent dropout can lead to better generalization as compared to the traditional dropout. In this research, we propose ‘guided dropout’ for training deep neural network which drop nodes by measuring the strength of each node. We also demonstrate that conventional dropout is a specific case of the proposed guided dropout. Experimental evaluation on multiple datasets including MNIST, CIFAR10, CIFAR100, SVHN, and Tiny ImageNet demonstrate the efficacy of the proposed guided dropout. …

Principal Component-Guided Sparse Regression (pcLasso) google
We propose a new method for supervised learning, especially suited to wide data where the number of features is much greater than the number of observations. The method combines the lasso (l 1 ) sparsity penalty with a quadratic penalty that shrinks the coefficient vector toward the leading principal components of the feature matrix. We call the proposed method the ‘principal components lasso’ (‘pcLasso’). The method can be especially powerful if the features are pre-assigned to groups (such as cell-pathways, assays or protein interaction networks). In that case, pcLasso shrinks each group-wise component of the solution toward the leading principal components of that group. In the process, it also carries out selection of the feature groups. We provide some theory for this method and illustrate it on a number of simulated and real data examples. …

Kolmogorov-Smirnov Test (KS) google
In statistics, the Kolmogorov-Smirnov test (K-S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K-S test), or to compare two samples (two-sample K-S test). The Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample case). In each case, the distributions considered under the null hypothesis are continuous distributions but are otherwise unrestricted. The two-sample K-S test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The Kolmogorov-Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic: see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro-Wilk test or Anderson-Darling test. However, other tests have their own disadvantages. For instance the Shapiro-Wilk test is known not to work well with many ties (many identical values). …

If you did not already know

TEST google
Tracking developments in the highly dynamic data-technology landscape are vital to keeping up with novel technologies and tools, in the various areas of Artificial Intelligence (AI). However, It is difficult to keep track of all the relevant technology keywords. In this paper, we propose a novel system that addresses this problem. This tool is used to automatically detect the existence of new technologies and tools in text, and extract terms used to describe these new technologies. The extracted new terms can be logged as new AI technologies as they are found on-the-fly in the web. It can be subsequently classified into the relevant semantic labels and AI domains. Our proposed tool is based on a two-stage cascading model — the first stage classifies if the sentence contains a technology term or not; and the second stage identifies the technology keyword in the sentence. We obtain a competitive accuracy for both tasks of sentence classification and text identification. …

Self-Attention Aligner (SAA) google
Self-attention network, an attention-based feedforward neural network, has recently shown the potential to replace recurrent neural networks (RNNs) in a variety of NLP tasks. However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition requirements. In this paper, we present a RNN-free end-to-end model: self-attention aligner (SAA), which applies the self-attention networks to a simplified recurrent neural aligner (RNA) framework. We also propose a chunk-hopping mechanism, which enables the SAA model to encode on segmented frame chunks one after another to support online recognition. Experiments on two Mandarin ASR datasets show the replacement of RNNs by the self-attention networks yields a 8.4%-10.2% relative character error rate (CER) reduction. In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency. After jointly training with a self-attention network language model, our SAA model obtains further error rate reduction on multiple datasets. Especially, it achieves 24.12% CER on the Mandarin ASR benchmark (HKUST), exceeding the best end-to-end model by over 2% absolute CER. …

WarpLDA google
Developing efficient and scalable algorithms for Latent Dirichlet Allocation (LDA) is of wide interest for many applications. Previous work has developed an $O(1)$ Metropolis-Hastings sampling method for each token. However, the performance is far from being optimal due to random accesses to the parameter matrices and frequent cache misses. In this paper, we propose WarpLDA, a novel $O(1)$ sampling algorithm for LDA. WarpLDA is a Metropolis-Hastings based algorithm which is designed to optimize the cache hit rate. Advantages of WarpLDA include 1) Efficiency and scalability: WarpLDA has good locality and carefully designed partition method, and can be scaled to hundreds of machines; 2) Simplicity: WarpLDA does not have any complicated modules such as alias tables, hybrid data structures, or parameter servers, making it easy to understand and implement; 3) Robustness: WarpLDA is consistently faster than other algorithms, under various settings from small-scale to massive-scale dataset and model. WarpLDA is 5-15x faster than state-of-the-art LDA samplers, implying less cost of time and money. With WarpLDA users can learn up to one million topics from hundreds of millions of documents in a few hours, at the speed of 2G tokens per second, or learn topics from small-scale datasets in seconds. …

Physics-Informed Gaussian Process Regression (GPR) google
We present a physics-informed Gaussian Process Regression (GPR) model to predict the phase angle, angular speed, and wind mechanical power from a limited number of measurements. In the traditional data-driven GPR method, the form of the Gaussian Process covariance matrix is assumed and its parameters are found from measurements. In the physics-informed GPR, we treat unknown variables (including wind speed and mechanical power) as a random process and compute the covariance matrix from the resulting stochastic power grid equations. We demonstrate that the physics-informed GPR method is significantly more accurate than the standard data-driven one for immediate forecasting of generators’ angular velocity and phase angle. We also show that the physics-informed GPR provides accurate predictions of the unobserved wind mechanical power, phase angle, or angular velocity when measurements from only one of these variables are available. The immediate forecast of observed variables and predictions of unobserved variables can be used for effectively managing power grids (electricity market clearing, regulation actions) and early detection of abnormal behavior and faults. The physics-based GPR forecast time horizon depends on the combination of input (wind power, load, etc.) correlation time and characteristic (relaxation) time of the power grid and can be extended to short and medium-range times. …

If you did not already know

Approximate Random Dropout google
The training phases of Deep neural network (DNN) consume enormous processing time and energy. Compression techniques for inference acceleration leveraging the sparsity of DNNs, however, can be hardly used in the training phase. Because the training involves dense matrix-multiplication using GPGPU, which endorse regular and structural data layout. In this paper, we exploit the sparsity of DNN resulting from the random dropout technique to eliminate the unnecessary computation and data access for those dropped neurons or synapses in the training phase. Experiments results on MLP and LSTM on standard benchmarks show that the proposed Approximate Random Dropout can reduce the training time by half on average with ignorable accuracy loss. …

Factor Adjusted Robust Multiple Testing google
Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the assumption of joint normality is often imposed, which is too stringent for many applications. To address these challenges, in this paper we propose a factoradjusted robust procedure for large-scale simultaneous inference with control of the false discovery proportion. We demonstrate that robust factor adjustments are extremely important in both improving the power of the tests and controlling FDP. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. Our proposed procedures are implemented in the R-package farmtest. …

eXplaining Aggregates for eXploratory Analytics (XAXA) google
Analysts wishing to explore multivariate data spaces, typically pose queries involving selection operators, i.e., range or radius queries, which define data subspaces of possible interest and then use aggregation functions, the results of which determine their exploratory analytics interests. However, such aggregate query (AQ) results are simple scalars and as such, convey limited information about the queried subspaces for exploratory analysis. We address this shortcoming aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism coined XAXA: eXplaining Aggregates for eXploratory Analytics. XAXA’s novel AQ explanations are represented using functions obtained by a three-fold joint optimization problem. Explanations assume the form of a set of parametric piecewise-linear functions acquired through a statistical learning model. A key feature of the proposed solution is that model training is performed by only monitoring AQs and their answers on-line. In XAXA, explanations for future AQs can be computed without any database (DB) access and can be used to further explore the queried data subspaces, without issuing any more queries to the DB. We evaluate the explanation accuracy and efficiency of XAXA through theoretically grounded metrics over real-world and synthetic datasets and query workloads. …

AdaCos google
The cosine-based softmax losses and their variants achieve great success in deep learning based face recognition. However, hyperparameter settings in these losses have significant influences on the optimization path as well as the final recognition performance. Manually tuning those hyperparameters heavily relies on user experience and requires many training tricks. In this paper, we investigate in depth the effects of two important hyperparameters of cosine-based softmax losses, the scale parameter and angular margin parameter, by analyzing how they modulate the predicted classification probability. Based on these analysis, we propose a novel cosine-based softmax loss, AdaCos, which is hyperparameter-free and leverages an adaptive scale parameter to automatically strengthen the training supervisions during the training process. We apply the proposed AdaCos loss to large-scale face verification and identification datasets, including LFW, MegaFace, and IJB-C 1:1 Verification. Our results show that training deep neural networks with the AdaCos loss is stable and able to achieve high face recognition accuracy. Our method outperforms state-of-the-art softmax losses on all the three datasets. …

If you did not already know

Internet of NanoThing (IoNT) google
This chapter focuses on Internet of Things from the nanoscale point of view. The chapter starts with section 1 which provides an introduction of nanothings and nanotechnologies. The nanoscale communication paradigms and the different approaches are discussed for nanodevices development. Nanodevice characteristics are discussed and the architecture of wireless nanodevices are outlined. Section 2 describes Internet of NanoThing(IoNT), its network architecture, and the challenges of nanoscale communication which is essential for enabling IoNT. Section 3 gives some practical applications of IoNT. The internet of Bio-NanoThing (IoBNT) and relevant biomedical applications are discussed. Other Applications such as military, industrial, and environmental applications are also outlined. …

Additive Latent Effect Model (ALE) google
The past decade has seen a growth in the development and deployment of educational technologies for assisting college-going students in choosing majors, selecting courses and acquiring feedback based on past academic performance. Grade prediction methods seek to estimate a grade that a student may achieve in a course that she may take in the future (e.g., next term). Accurate and timely prediction of students’ academic grades is important for developing effective degree planners and early warning systems, and ultimately improving educational outcomes. Existing grade pre- diction methods mostly focus on modeling the knowledge components associated with each course and student, and often overlook other factors such as the difficulty of each knowledge component, course instructors, student interest, capabilities and effort. In this paper, we propose additive latent effect models that incorporate these factors to predict the student next-term grades. Specifically, the proposed models take into account four factors: (i) student’s academic level, (ii) course instructors, (iii) student global latent factor, and (iv) latent knowledge factors. We compared the new models with several state-of-the-art methods on students of various characteristics (e.g., whether a student transferred in or not). The experimental results demonstrate that the proposed methods significantly outperform the baselines on grade prediction problem. Moreover, we perform a thorough analysis on the importance of different factors and how these factors can practically assist students in course selection, and finally improve their academic performance. …

Transductive Propagation Network (TPN) google
Few-shot learning aims to build a learner that quickly generalizes to novel classes even when a limited number of labeled examples (so-called low-data problem) are available. Meta-learning is commonly deployed to mimic the test environment in a training phase for good generalization, where episodes (i.e., learning problems) are manually constructed from the training set. This framework gains a lot of attention to few-shot learning with impressive performance, though the low-data problem is not fully addressed. In this paper, we propose Transductive Propagation Network (TPN), a transductive method that classifies the entire test set at once to alleviate the low-data problem. Specifically, our proposed network explicitly learns an underlying manifold space that is appropriate to propagate labels from few-shot examples, where all parameters of feature embedding, manifold structure, and label propagation are estimated in an end-to-end way on episodes. We evaluate the proposed method on the commonly used miniImageNet and tieredImageNet benchmarks and achieve the state-of-the-art or promising results on these datasets. …

Permutation Invariant Gaussian Matrix Model google
Permutation invariant Gaussian matrix models were recently developed for applications in computational linguistics. A 5-parameter family of models was solved. In this paper, we use a representation theoretic approach to solve the general 13-parameter Gaussian model, which can be viewed as a zero-dimensional quantum field theory. We express the two linear and eleven quadratic terms in the action in terms of representation theoretic parameters. These parameters are coefficients of simple quadratic expressions in terms of appropriate linear combinations of the matrix variables transforming in specific irreducible representations of the symmetric group $S_D$ where $D$ is the size of the matrices. They allow the identification of constraints which ensure a convergent Gaussian measure and well-defined expectation values for polynomial functions of the random matrix at all orders. A graph-theoretic interpretation is known to allow the enumeration of permutation invariants of matrices at linear, quadratic and higher orders. We express the expectation values of all the quadratic graph-basis invariants and a selection of cubic and quartic invariants in terms of the representation theoretic parameters of the model. …