We propose a Fourier-based approach for optimization of several clustering algorithms. Mathematically, clusters data can be described by a density function represented by the Dirac mixture distribution. The density function can be smoothed by applying the Fourier transform and a Gaussian filter. The determination of the optimal standard deviation of the Gaussian filter will be accomplished by the use of a convergence criterion related to the correlation between the smoothed and the original density functions. In principle, the optimal smoothed density function exhibits local maxima, which correspond to the cluster centroids. Thus, the complex task of finding the centroids of the clusters is simplified by the detection of the peaks of the smoothed density function. A multiple sliding windows procedure is used to detect the peaks. The remarkable accuracy of the proposed algorithm demonstrates its capability as a reliable general method for enhancement of the clustering performance, its global optimization and also removing the initialization problem in many clustering methods.

Feature crossing captures interactions among categorical features and is useful to enhance learning from tabular data in real-world businesses. In this paper, we present AutoCross, an automatic feature crossing tool provided by 4Paradigm to its customers, ranging from banks, hospitals, to Internet corporations. By performing beam search in a tree-structured space, AutoCross enables efficient generation of high-order cross features, which is not yet visited by existing works. Additionally, we propose successive mini-batch gradient descent and multi-granularity discretization to further improve efficiency and effectiveness, while ensuring simplicity so that no machine learning expertise or tedious hyper-parameter tuning is required. Furthermore, the algorithms are designed to reduce the computational, transmitting, and storage costs involved in distributed computing. Experimental results on both benchmark and real-world business datasets demonstrate the effectiveness and efficiency of AutoCross. It is shown that AutoCross can significantly enhance the performance of both linear and deep models.

We present techniques for automatically inferring invariant properties of feed-forward neural networks. Our insight is that feed forward networks should be able to learn a decision logic that is captured in the activation patterns of its neurons. We propose to extract such decision patterns that can be considered as invariants of the network with respect to a certain output behavior. We present techniques to extract input invariants as convex predicates on the input space, and layer invariants that represent features captured in the hidden layers. We apply the techniques on the networks for the MNIST and ACASXU applications. Our experiments highlight the use of invariants in a variety of applications, such as explainability, providing robustness guarantees, detecting adversaries, simplifying proofs and network distillation.

For any financial organization, computing accurate quarterly forecasts for various products is one of the most critical operations. As the granularity at which forecasts are needed increases, traditional statistical time series models may not scale well. We apply deep neural networks in the forecasting domain by experimenting with techniques from Natural Language Processing (Encoder-Decoder LSTMs) and Computer Vision (Dilated CNNs), as well as incorporating transfer learning. A novel contribution of this paper is the application of curriculum learning to neural network models built for time series forecasting. We illustrate the performance of our models using Microsoft’s revenue data corresponding to Enterprise, and Small, Medium & Corporate products, spanning approximately 60 regions across the globe for 8 different business segments, and totaling in the order of tens of billions of USD. We compare our models’ performance to the ensemble model of traditional statistics and machine learning techniques currently used by Microsoft Finance. With this in-production model as a baseline, our experiments yield an approximately 30% improvement in overall accuracy on test data. We find that our curriculum learning LSTM-based model performs best, showing that it is reasonable to implement our proposed methods without overfitting on medium-sized data.

The ability to predict individualized treatment effects (ITEs) based on a given patient’s profile is essential for personalized medicine. The prediction of ITEs enables the comparison of the effectiveness of two treatment procedures for a specific individual. We propose a hypothesis testing approach to choosing between two available treatments for a given individual in the framework of high-dimensional linear models. The methodological novelty is the development of a testing procedure with the type-I error uniformly controlled for any future high-dimensional observation, while the existing methods can only handle certain specific forms of covariates observation. The procedure is based on a debiased estimator of the ITEs and its asymptotic normality. The asymptotic power of the proposed test is established and the finite sample performance is demonstrated in simulation studies. We introduce the optimality framework of hypothesis testing in high dimensions from both minimaxity and adaptivity perspectives and establish the optimality of the proposed procedure. The proposed method can be extended to conduct statistical inference for general linear contrasts, including both average treatment effect and the prediction problem. The procedure is further illustrated through an analysis of electronic health records data from patients with rheumatoid arthritis.

Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are often hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. We present a set of nine unique challenges that must be addressed to productionize RL to real world problems. For each of these challenges, we specify the exact meaning of the challenge, present some approaches from the literature, and specify some metrics for evaluating that challenge. An approach that addresses all nine challenges would be applicable to a large number of real world problems. We also present an example domain that has been modified to present these challenges as a testbed for practical RL research.

To understand the fundamental trade-offs between training stability, temporal dynamics and architectural complexity of recurrent neural networks~(RNNs), we directly analyze RNN architectures using numerical methods of ordinary differential equations~(ODEs). We define a general family of RNNs–the ODERNNs–by relating the composition rules of RNNs to integration methods of ODEs at discrete time steps. We show that the degree of RNN’s functional nonlinearity and the range of its temporal memory can be mapped to the corresponding stage of Runge-Kutta recursion and the order of time-derivative of the ODEs. We prove that popular RNN architectures, such as LSTM and URNN, fit into different orders of –-ODERNNs. This exact correspondence between RNN and ODE helps us to establish the sufficient conditions for RNN training stability and facilitates more flexible top-down designs of new RNN architectures using large varieties of toolboxes from numerical integration of ODEs. We provide such an example: Quantum-inspired Universal computing Neural Network~(QUNN), which reduces the required number of training parameters from polynomial in both data length and temporal memory length to only linear in temporal memory length.

In the following paper we present a new type of optimization algorithms adapted for neural network training. These algorithms are based upon sequential operator splitting technique for some associated dynamical systems. Furthermore, we investigate through numerical simulations the empirical rate of convergence of these iterative schemes toward a local minimum of the loss function, with some suitable choices of the underlying hyper-parameters. We validate the convergence of these optimizers using the results of the accuracy and of the loss function on the MNIST, MNIST-Fashion and CIFAR 10 classification datasets.

Applied Data Scientists throughout various industries are commonly faced with the challenging task of encoding high-cardinality categorical features into digestible inputs for machine learning algorithms. This paper describes a Bayesian encoding technique developed for WeWork’s lead scoring engine which outputs the probability of a person touring one of our office spaces based on interaction, enrichment, and geospatial data. We present a paradigm for ensemble modeling which mitigates the need to build complicated preprocessing and encoding schemes for categorical variables. In particular, domain-specific conjugate Bayesian models are employed as base learners for features in a stacked ensemble model. For each column of a categorical feature matrix we fit a problem-specific prior distribution, for example, the Beta distribution for a binary classification problem. In order to analytically derive the moments of the posterior distribution, we update the prior with the conjugate likelihood of the corresponding target variable for each unique value of the given categorical feature. This function of column and value encodes the categorical feature matrix so that the final learner in the ensemble model ingests low-dimensional numerical input. Experimental results on both curated and real world datasets demonstrate impressive accuracy and computational efficiency on a variety of problem archetypes. Particularly, for the lead scoring engine at WeWork — where some categorical features have as many as 300,000 levels — we have seen an AUC improvement from 0.87 to 0.97 through implementing conjugate Bayesian model encoding.

In this paper, the parameter estimation of ARMA(p,q) model is given by approximate Bayesian computation algorithm. In order to improve the sampling efficiency of the algorithm, approximate Bayesian computation should select as many statistics as possible with parameter information in low dimension. Firstly, we use the autocorrelation coefficient of the first p+q order sample as the statistic and obtain an approximate Bayesian estimation of the AR coefficient, transforming the ARMA(p,q) model into the MA(q) model. Considering the first q order sample autocorrelation functions and sample variance as the statistics, the approximate Bayesian estimation of MA coefficient and white noise variances can be given. The method mentioned above is more accurate and powerful than the maximum likelihood estimation, which is verified by the numerical simulations and experiment study.

While the SLIM approach obtained high ranking-accuracy in many experiments in the literature, it is also known for its high computational cost of learning its parameters from data. For this reason, we focus in this paper on variants of high-dimensional regression problems that have closed-form solutions. Moreover, we motivate a re-scaling rather than a re-weighting approach for dealing with biases regarding item-popularities in the data. We also discuss properties of the sparse solution, and outline a computationally efficient approximation. In experiments on three publicly available data sets, we observed not only extremely reduced training times, but also significantly improved ranking accuracy compared to SLIM. Surprisingly, various state-of-the-art models, including deep non-linear autoencoders, were also outperformed on two of the three data sets in our experiments, in particular for recommendations with highly personalized relevance.

Convolutional Neural Networks have achieved impressive results in various tasks, but interpreting the internal mechanism is a challenging problem. To tackle this problem, we exploit a multi-channel attention mechanism in feature space. Our network architecture allows us to obtain an attention mask for each feature while existing CNN visualization methods provide only a common attention mask for all features. We apply the proposed multi-channel attention mechanism to multi-attribute recognition task. We can obtain different attention mask for each feature and for each attribute. Those analyses give us deeper insight into the feature space of CNNs. The experimental results for the benchmark dataset show that the proposed method gives high interpretability to humans while accurately grasping the attributes of the data.

Graph neural networks, which generalize deep neural network models to graph structured data, have attracted increasing attention in recent years. They usually learn node representations by transforming, propagating and aggregating node features and have been proven to improve the performance of many graph related tasks such as node classification and link prediction. To apply graph neural networks for the graph classification task, approaches to generate the \textit{graph representation} from node representations are demanded. A common way is to globally combine the node representations. However, rich structural information is overlooked. Thus a hierarchical pooling procedure is desired to preserve the graph structure during the graph representation learning. There are some recent works on hierarchically learning graph representation analogous to the pooling step in conventional convolutional neural (CNN) networks. However, the local structural information is still largely neglected during the pooling process. In this paper, we introduce a pooling operator based on graph Fourier transform, which can utilize the node features and local structures during the pooling process. We then design pooling layers based on the pooling operator, which are further combined with traditional GCN convolutional layers to form a graph neural network framework for graph classification. Theoretical analysis is provided to understand from both local and global perspectives. Experimental results of the graph classification task on commonly used benchmarks demonstrate the effectiveness of the proposed framework.

The clustering methods have recently absorbed even-increasing attention in learning and vision. Deep clustering combines embedding and clustering together to obtain optimal embedding subspace for clustering, which can be more effective compared with conventional clustering methods. In this paper, we propose a joint learning framework for discriminative embedding and spectral clustering. We first devise a dual autoencoder network, which enforces the reconstruction constraint for the latent representations and their noisy versions, to embed the inputs into a latent space for clustering. As such the learned latent representations can be more robust to noise. Then the mutual information estimation is utilized to provide more discriminative information from the inputs. Furthermore, a deep spectral clustering method is applied to embed the latent representations into the eigenspace and subsequently clusters them, which can fully exploit the relationship between inputs to achieve optimal clustering results. Experimental results on benchmark datasets show that our method can significantly outperform state-of-the-art clustering approaches.

In this paper, we analyze the inner product of weight vector and input vector in neural networks from the perspective of vector orthogonal decomposition and prove that the local direction gradient of weight vector decreases as the angle between them gets closer to 0 or . We propose the PR Product, a substitute for the inner product, which makes the local direction gradient of weight vector independent of the angle and consistently larger than the one in the conventional inner product while keeping the forward propagation identical. As the basic operation in neural networks, the PR Product can be applied into many existing deep learning modules, so we develop the PR Product version of the fully connected layer, convolutional layer, and LSTM layer. In static image classification, the experiments on CIFAR10 and CIFAR100 datasets demonstrate that the PR Product can robustly enhance the ability of various state-of-the-art classification networks. On the task of image captioning, even without any bells and whistles, our PR Product version of captioning model can compete or outperform the state-of-the-art models on MS COCO dataset.

In conventional domain adaptation, a critical assumption is that there exists a fully labeled domain (source) that contains the same label space as another unlabeled or scarcely labeled domain (target). However, in the real world, there often exist application scenarios in which both domains are partially labeled and not all classes are shared between these two domains. Thus, it is meaningful to let partially labeled domains learn from each other to classify all the unlabeled samples in each domain under an open-set setting. We consider this problem as weakly supervised open-set domain adaptation. To address this practical setting, we propose the Collaborative Distribution Alignment (CDA) method, which performs knowledge transfer bilaterally and works collaboratively to classify unlabeled data and identify outlier samples. Extensive experiments on the Office benchmark and an application on person reidentification show that our method achieves state-of-the-art performance.

This paper investigates the role of high-dimensional information sets in the context of Markov switching models with time varying transition probabilities. Markov switching models are commonly employed in empirical macroeconomic research and policy work. However, the information used to model the switching process is usually limited drastically to ensure stability of the model. Increasing the number of included variables to enlarge the information set might even result in decreasing precision of the model. Moreover, it is often not clear a priori which variables are actually relevant when it comes to informing the switching behavior. Building strongly on recent contributions in the field of dynamic factor analysis, we introduce a general type of Markov switching autoregressive models for non-linear time series analysis. Large numbers of time series are allowed to inform the switching process through a factor structure. This factor-augmented Markov switching (FAMS) model overcomes estimation issues that are likely to arise in previous assessments of the modeling framework. More accurate estimates of the switching behavior as well as improved model fit result. The performance of the FAMS model is illustrated in a simulated data example as well as in an US business cycle application.

Testing of deep learning models is challenging due to the excessive number and complexity of computations involved. As a result, test data selection is performed manually and in an ad hoc way. This raises the question of how we can automatically select candidate test data to test deep learning models. Recent research has focused on adapting test selection metrics from code-based software testing (such as coverage) to deep learning. However, deep learning models have different attributes from code such as spread of computations across the entire network reflecting training data properties, balance of neuron weights and redundancy (use of many more neurons than needed). Such differences make code-based metrics inappropriate to select data that can challenge the models (can trigger misclassification). We thus propose a set of test selection metrics based on the notion of model uncertainty (model confidence on specific inputs). Intuitively, the more uncertain we are about a candidate sample, the more likely it is that this sample triggers a misclassification. Similarly, the samples for which we are the most uncertain, are the most informative and should be used to improve the model by retraining. We evaluate these metrics on two widely-used image classification problems involving real and artificial (adversarial) data. We show that uncertainty-based metrics have a strong ability to select data that are misclassified and lead to major improvement in classification accuracy during retraining: up to 80% more gain than random selection and other state-of-the-art metrics on one dataset and up to 29% on the other.

Understanding why machine learning algorithms may fail is usually the task of the human expert that uses domain knowledge and contextual information to discover systematic shortcomings in either the data or the algorithm. In this paper, we propose a semantic referee, which is able to extract qualitative features of the errors emerging from deep machine learning frameworks and suggest corrections. The semantic referee relies on ontological reasoning about spatial knowledge in order to characterize errors in terms of their spatial relations with the environment. Using semantics, the reasoner interacts with the learning algorithm as a supervisor. In this paper, the proposed method of the interaction between a neural network classifier and a semantic referee shows how to improve the performance of semantic segmentation for satellite imagery data.

Recent literature suggests that averaged word vectors followed by simple post-processing outperform many deep learning methods on semantic textual similarity tasks. Furthermore, when averaged word vectors are trained supervised on large corpora of paraphrases, they achieve state-of-the-art results on standard STS benchmarks. Inspired by these insights, we push the limits of word embeddings even further. We propose a novel fuzzy bag-of-words (FBoW) representation for text that contains all the words in the vocabulary simultaneously but with different degrees of membership, which are derived from similarities between word vectors. We show that max-pooled word vectors are only a special case of fuzzy BoW and should be compared via fuzzy Jaccard index rather than cosine similarity. Finally, we propose DynaMax, a completely unsupervised and non-parametric similarity measure that dynamically extracts and max-pools good features depending on the sentence pair. This method is both efficient and easy to implement, yet outperforms current baselines on STS tasks by a large margin and is even competitive with supervised word vectors trained to directly optimise cosine similarity.

Learning causal effects from observational data greatly benefits a variety of domains such as healthcare, education and sociology. For instance, one could estimate the impact of a policy to decrease unemployment rate. The central problem for causal effect inference is dealing with the unobserved counterfactuals and treatment selection bias. The state-of-the-art approaches focus on solving these problems by balancing the treatment and control groups. However, during the learning and balancing process, highly predictive information from the original covariate space might be lost. In order to build more robust estimators, we tackle this information loss problem by presenting a method called Adversarial Balancing-based representation learning for Causal Effect Inference (ABCEI), based on the recent advances in deep learning. ABCEI uses adversarial learning to balance the distributions of treatment and control group in the latent representation space, without any assumption on the form of the treatment selection/assignment function. ABCEI preserves useful information for predicting causal effects under the regularization of a mutual information estimator. We conduct various experiments on several synthetic and real-world datasets. The experimental results show that ABCEI is robust against treatment selection bias, and matches/outperforms the state-of-the-art approaches.