The top word list, i.e., the top-M words with highest marginal probability in a given topic, is the standard topic representation in topic models. Most of recent automatical topic labeling algorithms and popular topic quality metrics are based on it. However, we find, empirically, words in this type of top word list are not always representative. The objective of this paper is to find more representative top word lists for topics. To achieve this, we rerank the words in a given topic by further considering marginal probability on words over every other topic. The reranking list of top-M words is used to be a novel topic representation for topic models. We investigate three reranking methodologies, using (1) standard deviation weight, (2) standard deviation weight with topic size and (3) Chi Square \c{hi}2statistic selection. Experimental results on real world collections indicate that our representations can extract more representative words for topics, agreeing with human judgements.
Since the inception of Deep Reinforcement Learning (DRL) algorithms, there has been a growing interest in both research and industrial communities in the promising potentials of this paradigm. The list of current and envisioned applications of deep RL ranges from autonomous navigation and robotics to control applications in the critical infrastructure, air traffic control, defense technologies, and cybersecurity. While the landscape of opportunities and the advantages of deep RL algorithms are justifiably vast, the security risks and issues in such algorithms remain largely unexplored. To facilitate and motivate further research on these critical challenges, this paper presents a foundational treatment of the security problem in DRL. We formulate the security requirements of DRL, and provide a high-level threat model through the classification and identification of vulnerabilities, attack vectors, and adversarial capabilities. Furthermore, we present a review of current literature on security of deep RL from both offensive and defensive perspectives. Lastly, we enumerate critical research venues and open problems in mitigation and prevention of intentional attacks against deep RL as a roadmap for further research in this area.
Entity relatedness has emerged as an important feature in a plethora of applications such as information retrieval, entity recommendation and entity linking. Given an entity, for instance a person or an organization, entity relatedness measures can be exploited for generating a list of highly-related entities. However, the relation of an entity to some other entity depends on several factors, with time and context being two of the most important ones (where, in our case, context is determined by a particular corpus). For example, the entities related to the International Monetary Fund are different now compared to some years ago, while these entities also may highly differ in the context of a USA news portal compared to a Greek news portal. In this paper, we propose a simple but flexible model for entity relatedness which considers time and entity aware word embeddings by exploiting the underlying corpus. The proposed model does not require external knowledge and is language independent, which makes it widely useful in a variety of applications.
We demonstrate some negative results for approximation of functions with neural networks.
Recommender systems are able to estimate the user’s interest for resource given from some relative information to others similar users and to propriety of the resource. In this Memory, we introduced a new contextual recommendation approach based on the AHP Process Hierarchical Analysis method. This work consisted in making a bibliographic study on the works having proposed systems of recommendation based on the context of the users in the field of films. The goal is to design and develop a new approach to recommending movies based on user context. And we relied on methods of multi-criteria decision making (MCDM) and more precisely the method of Hierarchical Process Analysis (AHP) for context integration in the recommendation process.
Graph learning methods have recently been receiving increasing interest as means to infer structure in datasets. Most of the recent approaches focus on different relationships between a graph and data sample distributions, mostly in settings where all available relate to the same graph. This is, however, not always the case, as data is often available in mixed form, yielding the need for methods that are able to cope with mixture data and learn multiple graphs. We propose a novel generative model that explains a collection of distinct data naturally living on different graphs. We assume the mapping of data to graphs is not known and investigate the problem of jointly clustering a set of data and learning a graph for each of the clusters. Experiments in both synthetic and real-world datasets demonstrate promising performance both in terms of data clustering, as well as multiple graph inference from mixture data.
Mobile vision systems such as smartphones, drones, and augmented-reality headsets are revolutionizing our lives. These systems usually run multiple applications concurrently and their available resources at runtime are dynamic due to events such as starting new applications, closing existing applications, and application priority changes. In this paper, we present NestDNN, a framework that takes the dynamics of runtime resources into account to enable resource-aware multi-tenant on-device deep learning for mobile vision systems. NestDNN enables each deep learning model to offer flexible resource-accuracy trade-offs. At runtime, it dynamically selects the optimal resource-accuracy trade-off for each deep learning model to fit the model’s resource demand to the system’s available runtime resources. In doing so, NestDNN efficiently utilizes the limited resources in mobile vision systems to jointly maximize the performance of all the concurrently running applications. Our experiments show that compared to the resource-agnostic status quo approach, NestDNN achieves as much as 4.2% increase in inference accuracy, 2.0x increase in video frame processing rate and 1.7x reduction on energy consumption.
Common approaches to Reinforcement Learning (RL) are seriously challenged by large-scale applications involving huge state spaces and sparse delayed reward feedback. Hierarchical Reinforcement Learning (HRL) methods attempt to address this scalability issue by learning action selection policies at multiple levels of temporal abstraction. Abstraction can be had by identifying a relatively small set of states that are likely to be useful as subgoals, in concert with the learning of corresponding skill policies to achieve those subgoals. Many approaches to subgoal discovery in HRL depend on the analysis of a model of the environment, but the need to learn such a model introduces its own problems of scale. Once subgoals are identified, skills may be learned through intrinsic motivation, introducing an internal reward signal marking subgoal attainment. In this paper, we present a novel model-free method for subgoal discovery using incremental unsupervised learning over a small memory of the most recent experiences of the agent. When combined with an intrinsic motivation learning mechanism, this method learns subgoals and skills together, based on experiences in the environment. Thus, we offer an original approach to HRL that does not require the acquisition of a model of the environment, suitable for large-scale applications. We demonstrate the efficiency of our method on two RL problems with sparse delayed feedback: a variant of the rooms environment and the ATARI 2600 game called Montezuma’s Revenge.
Measuring similarities between unlabeled time series trajectories is an important problem in domains as diverse as medicine, astronomy, finance, and computer vision. It is often unclear what is the appropriate metric to use because of the complex nature of noise in the trajectories (e.g. different sampling rates or outliers). Domain experts typically hand-craft or manually select a specific metric, such as dynamic time warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end algorithm that optimizes and learns a good metric given unlabeled trajectories. We define a flexible and differentiable family of warping metrics, which encompasses common metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation power of sequence autoencoders to optimize for a member of this warping distance family. The output is a metric which is easy to interpret and can be robustly learned from relatively few trajectories. In systematic experiments across different domains, we show that Autowarp often outperforms hand-crafted trajectory similarity metrics.
PoPPy is a Point Process toolbox based on PyTorch, which achieves flexible designing and efficient learning of point process models. It can be used for interpretable sequential data modeling and analysis, e.g., Granger causality analysis of multi-variate point processes, point process-based simulation and prediction of event sequences. In practice, the key points of point process-based sequential data modeling include: 1) How to design intensity functions to describe the mechanism behind observed data? 2) How to learn the proposed intensity functions from observed data? The goal of PoPPy is providing a user-friendly solution to the key points above and achieving large-scale point process-based sequential data analysis, simulation and prediction.
Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.
We present Unit-B, a formal method inspired by Event-B and UNITY. Unit-B aims at the stepwise design of software systems satisfying safety and liveness properties. The method features the novel notion of coarse and fine schedules, a generalisation of weak and strong fairness for specifying events’ scheduling assumptions. Based on events schedules, we propose proof rules to reason about progress properties and a refinement order preserving both liveness and safety properties. We illustrate our approach by an example to show that systems development can be driven by not only safety but also liveness requirements.
We present a Few-Shot Relation Classification Dataset (FewRel), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. The relation of each sentence is first recognized by distant supervision methods, and then filtered by crowdworkers. We adapt the most recent state-of-the-art few-shot learning methods for relation classification and conduct a thorough evaluation of these methods. Empirical results show that even the most competitive few-shot learning models struggle on this task, especially as compared with humans. We also show that a range of different reasoning skills are needed to solve our task. These results indicate that few-shot relation classification remains an open problem and still requires further research. Our detailed analysis points multiple directions for future research. All details and resources about the dataset and baselines are released on http://…/fewrel.
Gradient Boosting Machine (GBM) introduced by Friedman is an extremely powerful supervised learning algorithm that is widely used in practice — it routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In spite of the usefulness of GBM in practice, there is a big gap between its theoretical understanding and its success in practice. In this work, we propose Randomized Gradient Boosting Machine (RGBM) which leads to significant computational gains compared to GBM, by using a randomization scheme to reduce the search in the space of weak learners. Our analysis provides a formal justification of commonly used ad hoc heuristics employed by GBM implementations such as XGBoost, and suggests alternatives. In particular, we also provide a principled guideline towards better step-size selection in RGBM that does not require a line search. The analysis of RGBM is inspired by a special variant of coordinate descent that combines the benefits of randomized coordinate descent and greedy coordinate descent; and may be of independent interest as an optimization algorithm. As a special case, our results for RGBM lead to superior computational guarantees for GBM. Our computational guarantees depend upon a curious geometric quantity that we call Minimal Cosine Angle, which relates to the density of weak learners in the prediction space. We demonstrate the effectiveness of RGBM over GBM in terms of obtaining a model with good training/test data fidelity with a fraction of the computational cost, via numerical experiments on several real datasets.
Time series prediction can be generalized as a process that extracts useful information from historical records and then determines future values. Learning long-range dependencies that are embedded in time series is often an obstacle for most algorithms, whereas Long Short-Term Memory (LSTM) solutions, as a specific kind of scheme in deep learning, promise to effectively overcome the problem. In this article, we first give a brief introduction to the structure and forward propagation mechanism of the LSTM model. Then, aiming at reducing the considerable computing cost of LSTM, we put forward the Random Connectivity LSTM (RCLSTM) model and test it by predicting traffic and user mobility in telecommunication networks. Compared to LSTM, RCLSTM is formed via stochastic connectivity between neurons, which achieves a significant breakthrough in the architecture formation of neural networks. In this way, the RCLSTM model exhibits a certain level of sparsity, which leads to an appealing decrease in the computational complexity and makes the RCLSTM model become more applicable in latency-stringent application scenarios. In the field of telecommunication networks, the prediction of traffic series and mobility traces could directly benefit from this improvement as we further demonstrate that the prediction accuracy of RCLSTM is comparable to that of the conventional LSTM no matter how we change the number of training samples or the length of input sequences.
In this paper we propose a new template for empirical studies intended to assess causal effects: the outcome-wide longitudinal design. The approach is an extension of what is often done to assess the causal effects of a treatment or exposure using confounding control, but now, over numerous outcomes. We discuss the temporal and confounding control principles for such outcome-wide studies, metrics to evaluate robustness or sensitivity to potential unmeasured confounding for each outcome, and approaches to handle multiple testing. We argue that the outcome-wide longitudinal design has numerous advantages over more traditional studies of single exposure-outcome relationships including results that are less subject to investigator bias, greater potential to report null effects, greater capacity to compare effect sizes, a tremendous gain in the efficiency for the research community, a greater policy relevance, and a more rapid advancement of knowledge. We discuss both the practical and theoretical justification for the outcome-wide longitudinal design and also the pragmatic details of its implementation.
In this paper, we identify partial correlation information structures that allow for simpler reformulations in evaluating the maximum expected value of mixed integer linear programs with random objective coefficients. To this end, assuming only the knowledge of the mean and the covariance matrix entries restricted to block-diagonal patterns, we develop a reduced semidefinite programming formulation, the complexity of solving which is related to characterizing a suitable projection of the convex hull of the set $\{(\bold{x}, \bold{x}\bold{x}'): \bold{x} \in \mathcal{X}\}$ where $\mathcal{X}$ is the feasible region. In some cases, this lends itself to efficient representations that result in polynomial-time solvable instances, most notably for the distributionally robust appointment scheduling problem with random job durations as well as for computing tight bounds in Project Evaluation and Review Technique (PERT) networks and linear assignment problems. To the best of our knowledge, this is the first example of a distributionally robust optimization formulation for appointment scheduling that permits a tight polynomial-time solvable semidefinite programming reformulation which explicitly captures partially known correlation information between uncertain processing times of the jobs to be scheduled.
Multidimensional scaling is an important dimension reduction tool in statistics and machine learning. Yet few theoretical results characterizing its statistical performance exist, not to mention any in high dimensions. By considering a unified framework that includes low, moderate and high dimensions, we study multidimensional scaling in the setting of clustering noisy data. Our results suggest that, in order to achieve consistent estimation of the embedding scheme, the classical multidimensional scaling needs to be modified, especially when the noise level increases. To this end, we propose {\it modified multidimensional scaling} which applies a nonlinear transformation to the sample eigenvalues. The nonlinear transformation depends on the dimensionality, sample size and unknown moment. We show that modified multidimensional scaling followed by various clustering algorithms can achieve exact recovery, i.e., all the cluster labels can be recovered correctly with probability tending to one. Numerical simulations and two real data applications lend strong support to our proposed methodology. As a byproduct, we unify and improve existing results on the $\ell_{\infty}$ bound for eigenvectors under only low bounded moment conditions. This can be of independent interest.
In this paper, the defining properties of a valid measure of the dependence between two random variables are reviewed and complemented with two original ones, shown to be more fundamental than other usual postulates. While other popular choices are proved to violate some of these requirements, a class of dependence measures satisfying all of them is identified. One particular measure, that we call the Hellinger correlation, appears as a natural choice within that class due to both its theoretical and intuitive appeal. A simple and efficient nonparametric estimator for that quantity is proposed. Synthetic and real-data examples finally illustrate the descriptive ability of the measure, which can also be used as test statistic for exact independence testing.
Learning with Label Proportions (LLP) is the problem of recovering the underlying true labels given a dataset when the data is presented in the form of bags. This paradigm is particularly suitable in contexts where providing individual labels is expensive and label aggregates are more easily obtained. In the healthcare domain, it is a burden for a patient to keep a detailed diary of their daily routines, but often they will be amenable to provide higher level summaries of daily behavior. We present a novel and efficient graph-based algorithm that encourages local smoothness and exploits the global structure of the data, while preserving the `mass’ of each bag.
This article provides a comprehensive study of different ways to make speed benchmarks of gradient boosted decision trees algorithm. We show main problems of several straight forward ways to make benchmarks, explain, why a speed benchmarking is a challenging task and provide a set of reasonable requirements for a benchmark to be fair and useful.
Despite recent developments, deploying deep neural networks on resource constrained general purpose hardware remains a significant challenge. There has been much work in developing methods for reshaping neural networks, usually with a focus on minimising total parameter count. These methods are typically developed in a hardware-agnostic manner and do not exploit hardware behaviour. In this paper we propose a new approach, Hardware Aware Knowledge Distillation (HAKD) which uses empirical observations of hardware behaviour to design efficient student networks which are then trained with knowledge distillation. This allows the trade-off between accuracy and performance to be managed explicitly. We have applied this approach across three platforms and evaluated it on two networks, MobileNet and DenseNet, on CIFAR-10. We show that HAKD outperforms Deep Compression and Fisher pruning in terms of size, accuracy and performance.
Knowledge bases (KBs) are paramount in NLP. We employ multiview learning for increasing accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider high-resource and low-resource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with fine-grained classes demonstrate the effectiveness of multiview learning. We release MVET, a large multiview – and, in particular, multilingual – entity typing dataset we created. Mono- and multilingual fine-grained entity typing systems can be evaluated on this dataset.