Differential privacy has emerged as a formal framework for protecting sensitive information in networked systems. One key feature is that it is immune to post-processing, which means that arbitrary post-hoc computations can be performed on privatized data without weakening differential privacy. It is therefore common to filter private data streams. To characterize this setup, in this paper we present error and entropy bounds for Kalman filtering differentially private state trajectories generated by agents with linear dynamics. We consider input perturbation privacy, in which privacy guarantees are provided on an individual basis. We bound network-level a priori and a posteriori error and entropy of a Kalman filter processing agents’ trajectories. Using the error bounds we develop, we further provide guidelines to calibrate individuals’ privacy levels in order to keep filter error within pre-specified bounds. Simulation results are presented to demonstrate these developments.
In the present paper we develop a Bayesian analysis of radar target detection that uses the parameters of conventional radar analysis to provide a valid prediction of target presence or absence when received signals cross or fail to cross chosen threshold values. A Positive Predictive Value parameter is added to the normal Receiver Operating Characteristic to provide information that allows the radar operator to make an informed decision in the choice of threshold.
We propose an approach of open-ended evolution via the simulation of swarm dynamics. In nature, swarms possess remarkable properties, which allow many organisms, from swarming bacteria to ants and flocking birds, to form higher-order structures that enhance their behavior as a group. Swarm simulations highlight three important factors to create novelty and diversity: (a) communication generates combinatorial cooperative dynamics, (b) concurrency allows for separation of timescales, and (c) complexity and size increases push the system towards transitions in innovation. We illustrate these three components in a model computing the continuous evolution of a swarm of agents. The results, divided in three distinct applications, show how emergent structures are capable of filtering information through the bottleneck of their memory, to produce meaningful novelty and diversity within their simulated environment.
In this age of data abundance, there is a growing need for algorithms and techniques for clustering big data in an accurate and efficient manner. Well-known clustering methods of the past are computationally expensive, especially when employed to cluster massive datasets into a relatively large number of groups. The particular task of clustering millions (billions) of data points into thousands (millions) of clusters is referred to as extreme clustering. We have devised a distributed method, capable of being powered by a quantum processor, to tackle this clustering problem.
Online reviews have become a vital source of information in purchasing a service (product). Opinion spammers manipulate reviews, affecting the overall perception of the service. A key challenge in detecting opinion spam is obtaining ground truth. Though there exists a large set of reviews online, only a few of them have been labeled spam or non-spam. In this paper, we propose spamGAN, a generative adversarial network which relies on limited set of labeled data as well as unlabeled data for opinion spam detection. spamGAN improves the state-of-the-art GAN based techniques for text classification. Experiments on TripAdvisor dataset show that spamGAN outperforms existing spam detection techniques when limited labeled data is used. Apart from detecting spam reviews, spamGAN can also generate reviews with reasonable perplexity.
The Web provides many data in user-friendly tabular formats that are encoded using HTML. Information extractors are intended to extract those data as datasets that can feed business applications. There exist many proposals to implement them, which has motivated several previous surveys. Unfortunately, they are outdated and we do not think that it suffices to update them because they do not provide a good conceptual framework, they do not provide a taxonomy of web tables, they do not analyse the exact tasks involved, and they do not provide a good comparison framework. This article presents a review of the literature that does not have any of the previous problems, which we hope will be useful to both researchers and practitioners.
High-level human instructions often correspond to behaviors with multiple implicit steps. In order for robots to be useful in the real world, they must be able to to reason over both motions and intermediate goals implied by human instructions. In this work, we propose a framework for learning representations that convert from a natural-language command to a sequence of intermediate goals for execution on a robot. A key feature of this framework is prospection, training an agent not just to correctly execute the prescribed command, but to predict a horizon of consequences of an action before taking it. We demonstrate the fidelity of plans generated by our framework when interpreting real, crowd-sourced natural language commands for a robot in simulated scenes.
The past few years have seen several works establishing PAC frameworks for solving various problems in economic domains; these include optimal auction design, approximate optima of submodular functions, stable partitions and payoff divisions in cooperative games and more. In this work, we provide a unified learning-theoretic methodology for modeling these problems, and establish some useful tools for determining whether a given economic solution concept can be learned from data. Our learning theoretic framework generalizes a notion of function space dimension — the graph dimension — adapting it to the solution concept learning domain. We identify sufficient conditions for the PAC learnability of solution concepts, and show that results in existing works can be immediately derived using our general methodology. Finally, we apply our methods in other economic domains, yielding a novel notion of PAC competitive equilibrium and PAC Condorcet winners.
Indexes are the best apposite choice for quickly retrieving the records. This is nothing but cutting down the number of Disk IO. Instead of scanning the complete table for the results, we can decrease the number of IO’s or page fetches using index structures such as B-Trees or Hash Indexes to retrieve the data faster. The most convenient way to consider an index is to think like a dictionary. It has words and its corresponding definitions against those words. The dictionary will have an index on ‘word’ because when we open a dictionary and we want to fetch its corresponding word quickly, then find its definition. The dictionary generally contains just a single index – an index ordered by word. When we modify any record and change the corresponding value of an indexed column in a clustered index, the database might require moving the entire row into a separately new position to maintain the rows in the sorted order. This action is essentially turned into an update query into a DELETE followed by an INSERT, and it decreases the performance of the query. The clustered index in the table can often be available on the primary key or a foreign key column because key values usually do not modify once a record is injected into the database.
The rise of non-linear and interactive media such as video games has increased the need for automatic movement animation generation. In this survey, we review and analyze different aspects of building automatic movement generation systems using machine learning techniques and motion capture data. We cover topics such as high-level movement characterization, training data, features representation, machine learning models, and evaluation methods. We conclude by presenting a discussion of the reviewed literature and outlining the research gaps and remaining challenges for future work.
Lifelong learning, the problem of continual learning where tasks arrive in sequence, has been lately attracting more attention in the computer vision community. The aim of lifelong learning is to develop a system that can learn new tasks while maintaining the performance on the previously learned tasks. However, there are two obstacles for lifelong learning of deep neural networks: catastrophic forgetting and capacity limitation. To solve the above issues, inspired by the recent breakthroughs in automatically learning good neural network architectures, we develop a Multi-task based lifelong learning via nonexpansive AutoML framework termed Regularize, Expand and Compress (REC). REC is composed of three stages: 1) continually learns the sequential tasks without the learned tasks’ data via a newly proposed multi-task weight consolidation (MWC) algorithm; 2) expands the network to help the lifelong learning with potentially improved model capability and performance by network-transformation based AutoML; 3) compresses the expanded model after learning every new task to maintain model efficiency and performance. The proposed MWC and REC algorithms achieve superior performance over other lifelong learning algorithms on four different datasets.
This paper outlines some of the possible advancements for the technosignatures searches using the new methods currently rapidly developing in computer science, such as machine learning and deep learning. It also showcases a couple of case studies of large research programs where such methods have been already successfully implemented with notable results. We consider that the availability of data from all sky, all the time observations paired with the latest developments in computational capabilities and algorithms currently used in artificial intelligence, including automation, will spur an unprecedented development of the technosignatures search efforts.
When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase `green card’ is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.
Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weighted terms in both check-worthy and non-check-worthy sentences are actually overlapping [15]. Motivated by this, we present a neural check-worthiness sentence ranking model that represents each word in a sentence by \textit{both} its embedding (aiming to capture its semantics) and its syntactic dependencies (aiming to capture its role in modifying the semantics of other terms in the sentence). Our model is an end-to-end trainable neural network for check-worthiness ranking, which is trained on large amounts of unlabelled data through weak supervision. Thorough experimental evaluation against state of the art baselines, with and without weak supervision, shows our model to be superior at all times (+13% in MAP and +28% at various Precision cut-offs from the best baseline with statistical significance). Empirical analysis of the use of weak supervision, word embedding pretraining on domain-specific data, and the use of syntactic dependencies of our model reveals that check-worthy sentences contain notably more identical syntactic dependencies than non-check-worthy sentences.
While data science is battling to extract information from the enormous explosion of data, many estimators and algorithms are being developed for better prediction. Researchers and data scientists often introduce new methods and evaluate them based on various aspects of data. However, studies on the impact of/on a model with multiple response variables are limited. This study compares some newly-developed (envelope) and well-established (PLS, PCR) prediction methods based on real data and simulated data specifically designed by varying properties such as multicollinearity, the correlation between multiple responses and position of relevant principal components of predictors. This study aims to give some insight into these methods and help the researcher to understand and use them in further studies.
Context: Data mining techniques have demonstrated to be a powerful technique for discovering insights hidden in data from a domain. However, these techniques demand very specialised skills. People willing to analyse data often lack these skills, so they must rely on data scientists, which hinders data mining democratisation. Different approaches have appeared in the last years to address this issue. Objective: Analyse the state of the art to know how far are we from an effective data mining democratisation, what has already been accomplished, and what should be done in the upcoming years. Method: We performed a state-of-the-art review following a systematic and objective procedure, which included works both from the academia and the industry. The reviewed works were grouped in four categories. Each category was then evaluated in detail using a well-defined evaluation criteria to identify its strengths and weaknesses. Results: Around 700 works were initially considered, from which 43 were finally selected for a more in-depth analysis. Only two out of the four identified categories provide effective solutions to data mining democratisation. From these two categories, one always requires a minimum intervention of a data scientist, whereas the other one does not provide support for all the stages of the data mining process, and might exhibit accuracy problems in some contexts. Conclusion: In all analysed approaches, a data scientist is still required to perform some steps of the analysis process. Moreover, automated approaches that do not require data scientists for some steps expose some problems in other quality attributes, such as accuracy. Therefore, although existent work shows some promising initial steps, we are still far from data mining democratisation.
One of the main drawbacks of the practical use of neural networks is the long time needed in the training process. Such training process consists in an iterative change of parameters trying to minimize a loss function. These changes are driven by a dataset, which can be seen as a set of labeled points in an n-dimensional space. In this paper, we explore the concept of it representative dataset which is smaller than the original dataset and satisfies a nearness condition independent of isometric transformations. The representativeness is measured using persistence diagrams due to its computational efficiency. We also prove that the accuracy of the learning process of a neural network on a representative dataset is comparable with the accuracy on the original dataset when the neural network architecture is a perceptron and the loss function is the mean squared error. These theoretical results accompanied with experimentation open a door to the size reduction of the dataset in order to gain time in the training process of any neural network.
We present a machine-readable movement writing for sleight-of-hand moves with cards — a ‘Labanotation of card magic.’ This scheme of movement writing contains 440 categories of motion, and appears to taxonomize all card sleights that have appeared in over 1500 publications. The movement writing is axiomatized in $\mathcal{SROIQ}$(D) Description Logic, and collected formally as an Ontology of Card Sleights, a computational ontology that extends the Basic Formal Ontology and the Information Artifact Ontology. The Ontology of Card Sleights is implemented in OWL DL, a Description Logic fragment of the Web Ontology Language. While ontologies have historically been used to classify at a less granular level, the algorithmic nature of card tricks allows us to transcribe a performer’s actions step by step. We conclude by discussing design criteria we have used to ensure the ontology can be accessed and modified with a simple click-and-drag interface. This may allow database searches and performance transcriptions by users with card magic knowledge, but no ontology background.
We present a novel model called OCGAN for the classical problem of one-class novelty detection, where, given a set of examples from a particular class, the goal is to determine if a query example is from the same class. Our solution is based on learning latent representations of in-class examples using a denoising auto-encoder network. The key contribution of our work is our proposal to explicitly constrain the latent space to exclusively represent the given class. In order to accomplish this goal, firstly, we force the latent space to have bounded support by introducing a tanh activation in the encoder’s output layer. Secondly, using a discriminator in the latent space that is trained adversarially, we ensure that encoded representations of in-class examples resemble uniform random samples drawn from the same bounded space. Thirdly, using a second adversarial discriminator in the input space, we ensure all randomly drawn latent samples generate examples that look real. Finally, we introduce a gradient-descent based sampling technique that explores points in the latent space that generate potential out-of-class examples, which are fed back to the network to further train it to generate in-class examples from those points. The effectiveness of the proposed method is measured across four publicly available datasets using two one-class novelty detection protocols where we achieve state-of-the-art results.
Wikipedia is a rich and invaluable source of information. Its central place on the Web makes it a particularly interesting object of study for scientists. Researchers from different domains used various complex datasets related to Wikipedia to study language, social behavior, knowledge organization, and network theory. While being a scientific treasure, the large size of the dataset hinders pre-processing and may be a challenging obstacle for potential new studies. This issue is particularly acute in scientific domains where researchers may not be technically and data processing savvy. On one hand, the size of Wikipedia dumps is large. It makes the parsing and extraction of relevant information cumbersome. On the other hand, the API is straightforward to use but restricted to a relatively small number of requests. The middle ground is at the mesoscopic scale when researchers need a subset of Wikipedia ranging from thousands to hundreds of thousands of pages but there exists no efficient solution at this scale. In this work, we propose an efficient data structure to make requests and access subnetworks of Wikipedia pages and categories. We provide convenient tools for accessing and filtering viewership statistics or ‘pagecounts’ of Wikipedia web pages. The dataset organization leverages principles of graph databases that allows rapid and intuitive access to subgraphs of Wikipedia articles and categories. The dataset and deployment guidelines are available on the LTS2 website \url{https://…/}.
Contextual bandits with linear payoffs, which are also known as linear bandits, provide a powerful alternative for solving practical problems of sequential decisions, e.g., online advertisements. In the era of big data, contextual data usually tend to be high-dimensional, which leads to new challenges for traditional linear bandits mostly designed for the setting of low-dimensional contextual data. Due to the curse of dimensionality, there are two challenges in most of the current bandit algorithms: the first is high time-complexity; and the second is extreme large upper regret bounds with high-dimensional data. In this paper, in order to attack the above two challenges effectively, we develop an algorithm of Contextual Bandits via RAndom Projection (\texttt{CBRAP}) in the setting of linear payoffs, which works especially for high-dimensional contextual data. The proposed \texttt{CBRAP} algorithm is time-efficient and flexible, because it enables players to choose an arm in a low-dimensional space, and relaxes the sparsity assumption of constant number of non-zero components in previous work. Besides, we provide a linear upper regret bound for the proposed algorithm, which is associated with reduced dimensions.
This paper is concerned with numerically finding a global solution of constrained optimal control problems with many local minima. The focus is on the optimal decentralized control (ODC) problem, whose feasible set is recently shown to have an exponential number of connected components and consequently an exponential number of local minima. The rich literature of numerical algorithms for nonlinear optimization suggests that if a local search algorithm is initialized in an arbitrary connected component of the feasible set, it would search only within that component and find a stationary point there. This is based on the fact that numerical algorithms are designed to generate a sequence of points (via searching for descent directions and adjusting the step size), whose corresponding continuous path is trapped in a single connected component. In contrast with this perception rooted in convex optimization, we numerically illustrate that local search methods for non-convex constrained optimization can obliviously jump between different connected components to converge to a global minimum, via an aggressive step size adjustment using backtracking and the Armijio rule. To support the observations, we prove that from almost every arbitrary point in any connected component of the feasible set, it is possible to generate a sequence of points using local search to jump to different components and converge to a global solution. However, due to the NP-hardness of the problem, such fine-tuning of the parameters of a local search algorithm may need prior knowledge or be time consuming. This paper offers the first result on escaping non-global local solutions of constrained optimal control problems with complicated feasible sets.
We describe a TensorFlow-based library for posterior sampling and exploration in machine learning applications. TATi, the Thermodynamic Analytics ToolkIt, implements algorithms for 2nd order (underdamped) Langevin dynamics and Hamiltonian Monte Carlo (HMC). It also allows for rapid prototyping of new sampling methods in pure Python and supports an ensemble framework for generating multiple trajectories in parallel, a capability that is demonstrated by the implementation of a recently proposed ensemble preconditioning sampling procedure. In addition to explaining the architecture of TATi and its connections with the TensorFlow framework, this article contains preliminary numerical experiments to explore the efficiency of posterior sampling strategies in ML applications, in comparison to standard training strategies. We provide a glimpse of the potential of the new toolkit by studying (and visualizing) the loss landscape of a neural network applied to the MNIST hand-written digits data set.