LSTM (Long Short-Term Memory) recurrent neural networks have been highly successful in a number of application areas. This technical report describes the use of the MNIST and UW3 databases for benchmarking LSTM networks and explores the effect of different architectural and hyperparameter choices on performance. Significant findings include: (1) LSTM performance depends smoothly on learning rates, (2) batching and momentum has no significant effect on performance, (3) softmax training outperforms least square training, (4) peephole units are not useful, (5) the standard non-linearities (tanh and sigmoid) perform best, (6) bidirectional training combined with CTC performs better than other methods.
The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields an hypotheses testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte-Carlo simulation. When the null distribution is unknown, it is not clear what the best way to proceed is. We propose two procedures. One is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank-scan test requires only a one-time calibration for a given data size making it computationally more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution, and show one incurs only a very small loss in the context of a natural exponential family. These results include the classical normal location model, as well as Poisson model popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory, and also experiments with a real dataset from genomics.
Crowdsourcing and human computation has been employed in increasingly sophisticated projects that require the solution of a heterogeneous set of tasks. We explore the challenge of building or hiring an effective team, for performing tasks required for such projects on an ongoing basis, from an available pool of applicants or workers who have bid for the tasks. The recruiter needs to learn workers’ skills and expertise by performing online tests and interviews, and would like to minimize the amount of budget or time spent in this process before committing to hiring the team. How can one optimally spend budget to learn the expertise of workers as part of recruiting a team? How can one exploit the similarities among tasks as well as underlying social ties or commonalities among the workers for faster learning? We tackle these decision-theoretic challenges by casting them as an instance of online learning for best action selection. We present algorithms with PAC bounds on the required budget to hire a near-optimal team with high confidence. Furthermore, we consider an embedding of the tasks and workers in an underlying graph that may arise from task similarities or social ties, and that can provide additional side-observations for faster learning. We then quantify the improvement in the bounds that we can achieve depending on the characteristic properties of this graph structure. We evaluate our methodology on simulated problem instances as well as on real-world crowdsourcing data collected from the oDesk platform. Our methodology and results present an interesting direction of research to tackle the challenges faced by a recruiter for contract-based crowdsourcing.
Most of contemporary software systems are implemented using an object-oriented approach. Modeling phases — during which software engineers analyze requirements to the future system using some modeling language — are an important part of the development process, since modeling errors are often hard to recognize and correct. In this paper we present a framework which allows the integration of Answer Set Programming into the object-oriented software development process. OOASP supports reasoning about object-oriented software models and their instantiations. Preliminary results of the OOASP application in CSL Studio, which is a Siemens internal modeling environment for product configurators, show that it can be used as a lightweight approach to verify, create and transform instantiations of object models at runtime and to support the software development process during design and testing.
The errors-in-variables (EIV) regression model, being more realistic by accounting for measurement errors in both the dependent and the independent variables, is widely adopted in applied sciences. The traditional EIV model estimators, however, can be highly biased by outliers and other departures from the underlying assumptions. In this paper, we develop a novel nonparametric regression approach – the robust compound regression (RCR) analysis method for the robust estimation of EIV models. We first introduce a robust and efficient estimator called least sine squares (LSS). Taking full advantage of both the new LSS method and the compound regression analysis method developed in our own group, we subsequently propose the RCR approach as a generalization of those two, which provides a robust counterpart of the entire class of the maximum likelihood estimation (MLE) solutions of the EIV model, in a 1-1 mapping. Technically, our approach gives users the flexibility to select from a class of RCR estimates the optimal one with a predefined regression efficiency criterion satisfied. Simulation studies and real-life examples are provided to illustrate the effectiveness of the RCR approach.
Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O(\sigma^2\log^2 n)$ bits, where $n$ is the length of the string and $\sigma$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $\sigma$. We further improve the algorithm by removing its time dependency on $\sigma$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that $\textit{do not occur}$ in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.
The Goulden-Jackson cluster method is a powerful tool for obtaining generating functions for words in free monoids by occurrences of a set of subwords. We introduce a generalization of the cluster method for monoid networks, which generalize the combinatorial framework of free monoids. As a sample application of the monoid network cluster method, we compute bivariate and multivariate generating functions for Motzkin paths—both with height bounded and unbounded—by statistics corresponding to the number of occurrences of various subwords, yielding both closed-form and continued fraction formulae.