Identification of Markov Jump Autoregressive Processes from Large Noisy Data Sets

This paper introduces a novel methodology for the identification of switching dynamics for switched autoregressive linear models. Switching behavior is assumed to follow a Markov model. The system’s outputs are contaminated by possibly large values of measurement noise. Although the procedure provided can handle other noise distributions, for simplicity, it is assumed that the distribution is Normal with unknown variance. Given noisy input-output data, we aim at identifying switched system coefficients, parameters of the noise distribution, dynamics of switching and probability transition matrix of Markovian model. System dynamics are estimated using previous results which exploit algebraic constraints that system trajectories have to satisfy. Switching dynamics are computed with solving a maximum likelihood estimation problem. The efficiency of proposed approach is shown with several academic examples. Although the noise to output ratio can be high, the method is shown to be extremely effective in the situations where a large number of measurements is available.

Eigenvalue and Generalized Eigenvalue Problems: Tutorial

This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.

Data Science and Digital Systems: The 3Ds of Machine Learning Systems Design

Machine learning solutions, in particular those based on deep learning methods, form an underpinning of the current revolution in ‘artificial intelligence’ that has dominated popular press headlines and is having a significant influence on the wider tech agenda. Here we give an overview of the 3Ds of ML systems design: Data, Design and Deployment. By considering the 3Ds we can move towards \emph{data first} design.

Cross-Modal Data Programming Enables Rapid Medical Machine Learning

Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables simpler, clinician-driven input, reduces required labeling time, and improves with additional unlabeled data. In this approach, clinicians generate training labels for models defined over a target modality (e.g. images or time series) by writing rules over an auxiliary modality (e.g. text reports). The resulting technical challenge consists of estimating the accuracies and correlations of these rules; we extend a recent unsupervised generative modeling technique to handle this cross-modal setting in a provably consistent way. Across four applications in radiography, computed tomography, and electroencephalography, and using only several hours of clinician time, our approach matches or exceeds the efficacy of physician-months of hand-labeling with statistical significance, demonstrating a fundamentally faster and more flexible way of building machine learning models in medicine.

Privacy-preserving Active Learning on Sensitive Data for User Intent Classification

Active learning holds promise of significantly reducing data annotation costs while maintaining reasonable model performance. However, it requires sending data to annotators for labeling. This presents a possible privacy leak when the training set includes sensitive user data. In this paper, we describe an approach for carrying out privacy preserving active learning with quantifiable guarantees. We evaluate our approach by showing the tradeoff between privacy, utility and annotation budget on a binary classification task in a active learning setting.

SUSI: Supervised Self-Organizing Maps for Regression and Classification in Python

In many research fields, the sizes of the existing datasets vary widely. Hence, there is a need for machine learning techniques which are well-suited for these different datasets. One possible technique is the self-organizing map (SOM), a type of artificial neural network which is, so far, weakly represented in the field of machine learning. The SOM’s unique characteristic is the neighborhood relationship of the output neurons. This relationship improves the ability of generalization on small datasets. SOMs are mostly applied in unsupervised learning and few studies focus on using SOMs as supervised learning approach. Furthermore, no appropriate SOM package is available with respect to machine learning standards and in the widely used programming language Python. In this paper, we introduce the freely available SUpervised Self-organIzing maps (SUSI) Python package which performs supervised regression and classification. The implementation of SUSI is described with respect to the underlying mathematics. Then, we present first evaluations of the SOM for regression and classification datasets from two different domains of geospatial image analysis. Despite the early stage of its development, the SUSI framework performs well and is characterized by only small performance differences between the training and the test datasets. A comparison of the SUSI framework with existing Python and R packages demonstrates the importance of the SUSI framework. In future work, the SUSI framework will be extended, optimized and upgraded e.g. with tools to better understand and visualize the input data as well as the handling of missing and incomplete data.

Weighted Multisource Tradaboost

In this paper we propose an improved method for transfer learning that takes into account the balance between target and source data. This method builds on the state-of-the-art Multisource Tradaboost, but weighs the importance of each datapoint taking into account the amount of target and source data available. A comparative study is then presented exposing the performance of four transfer learning methods as well as the proposed Weighted Multisource Tradaboost. The experimental results show that the proposed method is able to outperform the base method as the number of target samples increase. These results are promising in the sense that source-target ratio weighing may be a path to improve current methods of transfer learning. However, against the asymptotic conjecture, all transfer learning methods tested in this work get outperformed by a no-transfer SVM for large number on target samples.

An Example-Driven Introduction to Data Analytics on Graphs

Graphs are irregular structures which naturally account for data integrity, however, traditional approaches have been established outside Signal Processing, and largely focus on analyzing the underlying graphs rather than signals on graphs. Given the rapidly increasing availability of multisensor and multinode measurements, likely recorded on irregular or ad-hoc grids, it would be extremely advantageous to analyze such structured data as graph signals and thus benefit from the ability of graphs to incorporate spatial awareness of the sensing locations, sensor importance, and local versus global sensor association. The aim of this lecture note is therefore to establish a common language between graph signals, defined on irregular signal domains, and some of the most fundamental paradigms in DSP, such as spectral analysis of multichannel signals, system transfer function, digital filter design, parameter estimation, and optimal filter design. This is achieved through a physically meaningful and intuitive real-world example of geographically distributed multisensor temperature estimation. A similar spatial multisensor arrangement is already widely used in Signal Processing curricula to introduce minimum variance estimators and Kalman filters \cite{HM}, and by adopting this framework we facilitate a seamless integration of graph theory into the curriculum of existing DSP courses. By bridging the gap between standard approaches and graph signal processing, we also show that standard methods can be thought of as special cases of their graph counterparts, evaluated on linear graphs. It is hoped that our approach would not only help to demystify graph theoretic approaches in education and research but it would also empower practitioners to explore a whole host of otherwise prohibitive modern applications.

ner and pos when nothing is capitalized

For those languages which use it, capitalization is an important signal for the fundamental NLP tasks of Named Entity Recognition (NER) and Part of Speech (POS) tagging. In fact, it is such a strong signal that model performance on these tasks drops sharply in common lowercased scenarios, such as noisy web text or machine translation outputs. In this work, we perform a systematic analysis of solutions to this problem, modifying only the casing of the train or test data using lowercasing and truecasing methods. While prior work and first impressions might suggest training a caseless model, or using a truecaser at test time, we show that the most effective strategy is a concatenation of cased and lowercased training data, producing a single model with high performance on both cased and uncased text. As shown in our experiments, this result holds across tasks and input representations. Finally, we show that our proposed solution gives an 8% F1 improvement in mention detection on noisy out-of-domain Twitter data.

Feature Selection for Data Integration with Mixed Multi-view Data

Data integration methods that analyze multiple sources of data simultaneously can often provide more holistic insights than can separate inquiries of each data source. Motivated by the advantages of data integration in the era of ‘big data’, we investigate feature selection for high-dimensional multi-view data with mixed data types (e.g. continuous, binary, count-valued). This heterogeneity of multi-view data poses numerous challenges for existing feature selection methods. However, after critically examining these issues through empirical and theoretically-guided lenses, we develop a practical solution, the Block Randomized Adaptive Iterative Lasso (B-RAIL), which combines the strengths of the randomized Lasso, adaptive weighting schemes, and stability selection. B-RAIL serves as a versatile data integration method for sparse regression and graph selection, and we demonstrate the effectiveness of B-RAIL through extensive simulations and a case study to infer the ovarian cancer gene regulatory network. In this case study, B-RAIL successfully identifies well-known biomarkers associated with ovarian cancer and hints at novel candidates for future ovarian cancer research.

Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods

Small data challenges have emerged in many learning problems, since the success of deep neural networks often relies on the availability of a huge amount of labeled data that is expensive to collect. To address it, many efforts have been made on training complex models with small data in an unsupervised and semi-supervised fashion. In this paper, we will review the recent progresses on these two major categories of methods. A wide spectrum of small data models will be categorized in a big picture, where we will show how they interplay with each other to motivate explorations of new ideas. We will review the criteria of learning the transformation equivariant, disentangled, self-supervised and semi-supervised representations, which underpin the foundations of recent developments. Many instantiations of unsupervised and semi-supervised generative models have been developed on the basis of these criteria, greatly expanding the territory of existing autoencoders, generative adversarial nets (GANs) and other deep networks by exploring the distribution of unlabeled data for more powerful representations. While we focus on the unsupervised and semi-supervised methods, we will also provide a broader review of other emerging topics, from unsupervised and semi-supervised domain adaptation to the fundamental roles of transformation equivariance and invariance in training a wide spectrum of deep networks. It is impossible for us to write an exclusive encyclopedia to include all related works. Instead, we aim at exploring the main ideas, principles and methods in this area to reveal where we are heading on the journey towards addressing the small data challenges in this big data era.

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

Visually rich documents (VRDs) are ubiquitous in daily business and life. Examples are purchase receipts, insurance policy documents, custom declaration forms and so on. In VRDs, visual and layout information is critical for document understanding, and texts in such documents cannot be serialized into the one-dimensional sequence without losing information. Classic information extraction models such as BiLSTM-CRF typically operate on text sequences and do not incorporate visual features. In this paper, we introduce a graph convolution based model to combine textual and visual information presented in VRDs. Graph embeddings are trained to summarize the context of a text segment in the document, and further combined with text embeddings for entity extraction. Extensive experiments have been conducted to show that our method outperforms BiLSTM-CRF baselines by significant margins, on two real-world datasets. Additionally, ablation studies are also performed to evaluate the effectiveness of each component of our model.

Introduction to Dynamic Linear Models for Time Series Analysis

Dynamic linear models (DLM) offer a very generic framework to analyse time series data. Many classical time series models can be formulated as DLMs, including ARMA models and standard multiple linear regression models. The models can be seen as general regression models where the coefficients can vary in time. In addition, they allow for a state space representation and a formulation as hierarchical statistical models, which in turn is the key for efficient estimation by Kalman formulas and by Markov chain Monte Carlo (MCMC) methods. A dynamic linear model can handle non-stationary processes, missing values and non-uniform sampling as well as observations with varying accuracies. This chapter gives an introduction to DLM and shows how to build various useful models for analysing trends and other sources of variability in geodetic time series.

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.

Multilevel Text Normalization with Sequence-to-Sequence Networks and Multisource Learning

We define multilevel text normalization as sequence-to-sequence processing that transforms naturally noisy text into a sequence of normalized units of meaning (morphemes) in three steps: 1) writing normalization, 2) lemmatization, 3) canonical segmentation. These steps are traditionally considered separate NLP tasks, with diverse solutions, evaluation schemes and data sources. We exploit the fact that all these tasks involve sub-word sequence-to-sequence transformation to propose a systematic solution for all of them using neural encoder-decoder technology. The specific challenge that we tackle in this paper is integrating the traditional know-how on separate tasks into the neural sequence-to-sequence framework to improve the state of the art. We address this challenge by enriching the general framework with mechanisms that allow processing the information on multiple levels of text organization (characters, morphemes, words, sentences) in combination with structural information (multilevel language model, part-of-speech) and heterogeneous sources (text, dictionaries). We show that our solution consistently improves on the current methods in all three steps. In addition, we analyze the performance of our system to show the specific contribution of the integrating components to the overall improvement.

Diversity with Cooperation: Ensemble Methods for Few-Shot Classification

Few-shot classification consists of learning a predictive model that is able to effectively adapt to a new class, given only a few annotated samples. To solve this challenging problem, meta-learning has become a popular paradigm that advocates the ability to ‘learn to adapt’. Recent works have shown, however, that simple learning strategies without meta-learning could be competitive. In this paper, we go a step further and show that by addressing the fundamental high-variance issue of few-shot learning classifiers, it is possible to significantly outperform current meta-learning techniques. Our approach consists of designing an ensemble of deep networks to leverage the variance of the classifiers, and introducing new strategies to encourage the networks to cooperate, while encouraging prediction diversity. Evaluation is conducted on the mini-ImageNet and CUB datasets, where we show that even a single network obtained by distillation yields state-of-the-art results.

Multi-agent Gradient Descent with A Protocol

This essay gives a short introduction to the multi-agent gradient descent method with a protocol. Compared with most existing literature on gradient-based methods, this essay explores a new way to do global optimization, i.e., multiple agents with certain communication protocol will be used in the descent process.

Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective

Knowledge graph is a popular format for representing knowledge, with many applications to semantic search engines, question-answering systems, and recommender systems. Real-world knowledge graphs are usually incomplete, so knowledge graph embedding methods, such as Canonical decomposition/Parallel factorization (CP), DistMult, and ComplEx, have been proposed to address this issue. These methods represent entities and relations as embedding vectors in semantic space and predict the links between them. The embedding vectors themselves contain rich semantic information and can be used in other applications such as data analysis. However, mechanisms in these models and the embedding vectors themselves vary greatly, making it difficult to understand and compare them. Given this lack of understanding, we risk using them ineffectively or incorrectly, particularly for complicated models, such as CP, with two role-based embedding vectors, or the state-of-the-art ComplEx model, with complex-valued embedding vectors. In this paper, we propose a multi-embedding interaction mechanism as a new approach to uniting and generalizing these models. We derive them theoretically via this mechanism and provide empirical analyses and comparisons between them. We also propose a new multi-embedding model based on quaternion algebra and show that it achieves promising results using popular benchmarks.

Towards causally interpretable meta-analysis: transporting inferences from multiple studies to a target population

We take steps towards causally interpretable meta-analysis by describing conditions under which we can transport causal inferences from a collection of randomized trials to a new target population. We discuss the conditions that allow the identification of causal quantities in the target population and provide identification results for potential (counterfactual) outcome means and average treatment effects. Our results highlight the importance of accounting for variation in the treatment assignment mechanisms across the randomized trials when transporting inferences. Last, we propose estimators of the potential outcome means that rely on different working models and provide code for their implementation in statistical software.

Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems

Visual modifications to text are often used to obfuscate offensive comments in social media (e.g., ‘!d10t’) or as a writing style (‘1337’ in ‘leet speak’), among other scenarios. We consider this as a new type of adversarial attack in NLP, a setting to which humans are very robust, as our experiments with both simple and more difficult visual input perturbations demonstrate. We then investigate the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82\%. We then explore three shielding methods—visual character embeddings, adversarial training, and rule-based recovery—which substantially improve the robustness of the models. However, the shielding methods still fall behind performances achieved in non-attack scenarios, which demonstrates the difficulty of dealing with visual attacks.

An Alternating Manifold Proximal Gradient Method for Sparse PCA and Sparse CCA

Sparse principal component analysis (PCA) and sparse canonical correlation analysis (CCA) are two essential techniques from high-dimensional statistics and machine learning for analyzing large-scale data. Both problems can be formulated as an optimization problem with nonsmooth objective and nonconvex constraints. Since non-smoothness and nonconvexity bring numerical difficulties, most algorithms suggested in the literature either solve some relaxations or are heuristic and lack convergence guarantees. In this paper, we propose a new alternating manifold proximal gradient method to solve these two high-dimensional problems and provide a unified convergence analysis. Numerical experiment results are reported to demonstrate the advantages of our algorithm.