Neural networks have been shown to improve performance across a range of natural-language tasks. However, designing and training them can be complicated. Frequently, researchers resort to repeated experimentation to pick optimal settings. In this paper, we address the issue of choosing the correct number of units in hidden layers. We introduce a method for automatically adjusting network size by pruning out hidden units through $\ell_{\infty,1}$ and $\ell_{2,1}$ regularization. We apply this method to language modeling and demonstrate its ability to correctly choose the number of hidden units while maintaining perplexity. We also include these models in a machine translation decoder and show that these smaller neural models maintain the significant improvements of their unpruned versions.
We address the problem of compressed sensing with Multiple Measurement Vectors (MMVs) when the structure of sparse vectors in different channels depend on each other. ‘The sparse vectors are not necessarily joint sparse’. We capture this dependency by computing the conditional probability of each entry of each sparse vector to be non-zero given ‘residuals’ of all previous sparse vectors. To compute these probabilities, we propose to use Long Short-Term Memory (LSTM) [1], a bottom up data driven model for sequence modelling. To compute model parameters we minimize a cross entropy cost function. We propose a greedy solver that uses above probabilities at the decoder. By performing extensive experiments on two real world datasets, we show that the proposed method significantly outperforms general MMV solver Simultaneous Orthogonal Matching Pursuit (SOMP) and model based Bayesian methods including Multitask Compressive Sensing [2] and Sparse Bayesian Learning for Temporally Correlated Sources [3]. Nevertheless, we emphasize that the proposed method is a data driven method where availability of training data is important. However, in many applications, train data is indeed available, e.g., recorded images or video.
Regularisation of deep neural networks (DNN) during training is critical to performance. By far the most popular method is known as dropout. Here, cast through the prism of signal processing theory, we compare and contrast the regularisation effects of dropout with those of dither. We illustrate some serious inherent limitations of dropout and demonstrate that dither provides a far more effective regulariser which does not suffer from the same limitations.
Analysis of sequential event data has been recognized as one of the essential tools in data modeling and analysis field. In this paper, after the examination of its technical requirements and issues to model complex but practical situation, we propose a new sequential data model, dubbed Duration and Interval Hidden Markov Model (DI-HMM), that efficiently represents ‘state duration’ and ‘state interval’ of data events. This has significant implications to play an important role in representing practical time-series sequential data. This eventually provides an efficient and flexible sequential data retrieval. Numerical experiments on synthetic and real data demonstrate the efficiency and accuracy of the proposed DI-HMM.
Deploying, configuring, and managing large clusters is very a demanding and cumbersome task due to the complexity of such systems and the variety of skills needed. One needs to perform low-level configuration of the cluster nodes to ensure their interoperability and connectivity, as well as install, configure and provision the needed services. In this paper we address this problem and demonstrate how to build a Big Data analytic platform on Amazon EC2 in a matter of minutes. Moreover, to use our tool, embedded into a public Amazon Machine Image, the user does not need to be an expert in system administration or Big Data service configuration. Our tool dramatically reduces the time needed to provision clusters, as well as the cost of the infrastructure. Researchers enjoy an additional benefit of having a simple way to specify the experimental environments they use, so that their experiments can be easily reproduced by anyone using our tool.
Graphical models use the intuitive and well-studied methods of graph theory to implicitly represent dependencies between variables in large systems. They can model the global behaviour of a complex system by specifying only local factors. This thesis studies inference in discrete graphical models from an algebraic perspective and the ways inference can be used to express and approximate NP-hard combinatorial problems. We investigate the complexity and reducibility of various inference problems, in part by organizing them in an inference hierarchy. We then investigate tractable approximations for a subset of these problems using distributive law in the form of message passing. The quality of the resulting message passing procedure, called Belief Propagation (BP), depends on the influence of loops in the graphical model. We contribute to three classes of approximations that improve BP for loopy graphs A) loop correction techniques; B) survey propagation, another message passing technique that surpasses BP in some settings; and C) hybrid methods that interpolate between deterministic message passing and Markov Chain Monte Carlo inference. We then review the existing message passing solutions and provide novel graphical models and inference techniques for combinatorial problems under three broad classes: A) constraint satisfaction problems such as satisfiability, coloring, packing, set / clique-cover and dominating / independent set and their optimization counterparts; B) clustering problems such as hierarchical clustering, K-median, K-clustering, K-center and modularity optimization; C) problems over permutations including assignment, graph morphisms and alignment, finding symmetries and traveling salesman problem. In many cases we show that message passing is able to find solutions that are either near optimal or favourably compare with today’s state-of-the-art approaches.
We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. Similarity-based anomaly detection algorithms detect abnormally large amounts of similarity or dissimilarity, e.g.~as measured by nearest neighbor Euclidean distances between a test sample and the training samples. In many application domains there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such cases, multiple dissimilarity measures can be defined, including non-metric measures, and one can test for anomalies by scalarizing using a non-negative linear combination of them. If the relative importance of the different dissimilarity measures are not known in advance, as in many anomaly detection applications, the anomaly detection algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we propose a method for similarity-based anomaly detection using a novel multi-criteria dissimilarity measure, the Pareto depth. The proposed Pareto depth analysis (PDA) anomaly detection algorithm uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach is provably better than using linear combinations of the criteria and shows superior performance on experiments with synthetic and real data sets.
We present the parallel and interacting stochastic approximation annealing (PISAA) algorithm, a stochastic simulation procedure for global optimisation, that extends and improves the stochastic approximation annealing (SAA) by using population Monte Carlo ideas. The standard SAA algorithm guarantees convergence to the global minimum when a square-root cooling schedule is used; however the efficiency of its performance depends crucially on its self-adjusting mechanism. Because its mechanism is based on information obtained from only a single chain, SAA may present slow convergence in complex optimisation problems. The proposed algorithm involves simulating a population of SAA chains that interact each other in a manner that ensures significant improvement of the self-adjusting mechanism and better exploration of the sampling space. Central to the proposed algorithm are the ideas of (i) recycling information from the whole population of Markov chains to design a more accurate/stable self-adjusting mechanism and (ii) incorporating more advanced proposals, such as crossover operations, for the exploration of the sampling space. PISAA presents a significantly improved performance in terms of convergence. PISAA can be implemented in parallel computing environments if available. We demonstrate the good performance of the proposed algorithm on challenging applications including Bayesian network learning and protein folding. Our numerical comparisons suggest that PISAA outperforms the simulated annealing, stochastic approximation annealing, and annealing evolutionary stochastic approximation Monte Carlo especially in high dimensional or rugged scenarios.
Fault-tolerance techniques for stream processing engines can be categorized into passive and active approaches. A typical passive approach periodically checkpoints a processing task’s runtime states and can recover a failed task by restoring its runtime state using its latest checkpoint. On the other hand, an active approach usually employs backup nodes to run replicated tasks. Upon failure, the active replica can take over the processing of the failed task with minimal latency. However, both approaches have their own inadequacies in Massively Parallel Stream Processing Engines (MPSPE). The passive approach incurs a long recovery latency especially when a number of correlated nodes fail simultaneously, while the active approach requires extra replication resources. In this paper, we propose a new fault-tolerance framework, which is Passive and Partially Active (PPA). In a PPA scheme, the passive approach is applied to all tasks while only a selected set of tasks will be actively replicated. The number of actively replicated tasks depends on the available resources. If tasks without active replicas fail, tentative outputs will be generated before the completion of the recovery process. We also propose effective and efficient algorithms to optimize a partially active replication plan to maximize the quality of tentative outputs. We implemented PPA on top of Storm, an open-source MPSPE and conducted extensive experiments using both real and synthetic datasets to verify the effectiveness of our approach.
Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless —traditional tuning methods are problematic in streaming settings— and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.
A limitation of many clustering algorithms is the requirement to tune adjustable parameters for each application or even for each dataset. Some algorithms require an \emph{a priori} estimate of the number of clusters while density-based techniques usually require a scale parameter. Other parametric methods, such as mixture modeling, make assumptions about the underlying cluster distributions. Here we introduce a non-parametric clustering method that does not involve tunable parameters and only assumes that clusters are unimodal, in the sense that they have a single point of maximal density when projected onto any line, and that clusters are separated from one another by a separating hyperplane of relatively lower density. The technique uses a non-parametric algorithm—isotonic regression—as the kernel operation repeated at every iteration. We carry out a rigorous hypothesis test for whether pairs of clusters should be merged based upon Monte Carlo sampling of a statistic. We compare the method against k-means++, DBSCAN, and Gaussian mixture algorithms and show in simulations that it performs better than these standard methods in many situations. The algorithm’s utility is also demonstrated in the context of ‘spike sorting’ of neural electrical recordings. The source code for the algorithm is freely available.
Warehouse is one of the important aspects of a company. Therefore, it is necessary to improve Warehouse Management System (WMS) to have a simple function that can determine the layout of the storage goods. In this paper we propose an improved warehouse layout method based on ant colony algorithm and backtracking algorithm. The method works on two steps. First, it generates a solutions parameter tree from backtracking algorithm. Then second, it deducts the solutions parameter by using a combination of ant colony algorithm and backtracking algorithm. This method was tested by measuring the time needed to build the tree and to fill up the space using two scenarios. The method needs 0.294 to 33.15 seconds to construct the tree and 3.23 seconds (best case) to 61.41 minutes (worst case) to fill up the warehouse. This method is proved to be an attractive alternative solution for warehouse layout system.