In this paper, we develop a privacy implementation for symbolic control systems. Such systems generate sequences of non-numerical data, and these sequences can be represented by words or strings over a finite alphabet. This work uses the framework of differential privacy, which is a statistical notion of privacy that makes it unlikely that privatized data will reveal anything meaningful about underlying sensitive data. To bring differential privacy to symbolic control systems, we develop an exponential mechanism that approximates a sensitive word using a randomly chosen word that is likely to be near it. The notion of ‘near’ is given by the Levenshtein distance, which counts the number of operations required to change one string into another. We then develop a Levenshtein automaton implementation of our exponential mechanism that efficiently generates privatized output words. This automaton has letters as its states, and this work develops transition probabilities among these states that give overall output words obeying the distribution required by the exponential mechanism. Numerical results are provided to demonstrate this technique for both strings of English words and runs of a deterministic transition system, demonstrating in both cases that privacy can be provided in this setting while maintaining a reasonable degree of accuracy.
Toxic online content has become a major issue in today’s world due to an exponential increase in the use of internet by people of different cultures and educational background. Differentiating hate speech and offensive language is a key challenge in automatic detection of toxic text content. In this paper, we propose an approach to automatically classify tweets on Twitter into three classes: hateful, offensive and clean. Using Twitter dataset, we perform experiments considering n-grams as features and passing their term frequency-inverse document frequency (TFIDF) values to multiple machine learning models. We perform comparative analysis of the models considering several values of n in n-grams and TFIDF normalization methods. After tuning the model giving the best results, we achieve 95.6% accuracy upon evaluating it on test data. We also create a module which serves as an intermediate between user and Twitter.
The aim of this research is to introduce a novel structural design process that allows architects and engineers to extend their typical design space horizon and thereby promoting the idea of creativity in structural design. The theoretical base of this work builds on the combination of structural form-finding and state-of-the-art machine learning algorithms. In the first step of the process, Combinatorial Equilibrium Modelling (CEM) is used to generate a large variety of spatial networks in equilibrium for given input parameters. In the second step, these networks are clustered and represented in a form-map through the implementation of a Self Organizing Map (SOM) algorithm. In the third step, the solution space is interpreted with the help of a Uniform Manifold Approximation and Projection algorithm (UMAP). This allows gaining important insights in the structure of the solution space. A specific case study is used to illustrate how the infinite equilibrium states of a given topology can be defined and represented by clusters. Furthermore, three classes, related to the non-linear interaction between the input parameters and the form space, are verified and a statement about the entire manifold of the solution space of the case study is made. To conclude, this work presents an innovative approach on how the manifold of a solution space can be grasped with a minimum amount of data and how to operate within the manifold in order to increase the diversity of solutions.
This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely – (1) the impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, and (3) demonstrate the generalisation ability of our recognition network to inputs of arbitrary lengths. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples.
In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else’s. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension.
Neural architecture for named entity recognition has achieved great success in the field of natural language processing. Currently, the dominating architecture consists of a bi-directional recurrent neural network (RNN) as the encoder and a conditional random field (CRF) as the decoder. In this paper, we propose a deformable stacked structure for named entity recognition, in which the connections between two adjacent layers are dynamically established. We evaluate the deformable stacked structure by adapting it to different layers. Our model achieves the state-of-the-art performances on the OntoNotes dataset.
Control of large-scale networked systems often necessitates the availability of complex models for the interactions amongst the agents. However in many applications, building accurate models of agents or interactions amongst them might be infeasible or computationally prohibitive due to the curse of dimensionality or the complexity of these interactions. In the meantime, data-guided control methods can circumvent model complexity by directly synthesizing the controller from the observed data. In this paper, we propose a distributed $Q$-learning algorithm to design a feedback mechanism based on a given underlying graph structure parameterizing the agents’ interaction network. We assume that the distributed nature of the system arises from the cost function of the corresponding control problem and show that for the specific case of identical dynamically decoupled systems, the learned controller converges to the optimal Linear Quadratic Regulator (LQR) controller for each subsystem. We provide a convergence analysis and verify the result with an example.
We propose a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches.
5th generation networks are envisioned to provide seamless and ubiquitous connection to 1000-fold more devices and is believed to provide ultra-low latency and higher data rates up to tens of Gbps. Different technologies enabling these requirements are being developed including mmWave communications, Massive MIMO and beamforming, Device to Device (D2D) communications and Heterogeneous Networks. D2D communication is a promising technology to enable applications requiring high bandwidth such as online streaming and online gaming etc. It can also provide ultra- low latencies required for applications like vehicle to vehicle communication for autonomous driving. D2D communication can provide higher data rates with high energy efficiency and spectral efficiency compared to conventional communication. The performance benefits of D2D communication can be best achieved when D2D users reuses the spectrum being utilized by the conventional cellular users. This spectrum sharing in a multi-tier heterogeneous network will introduce complex interference among D2D users and cellular users which needs to be resolved. Motivated by limited number of surveys for interference mitigation and resource allocation in D2D enabled heterogeneous networks, we have surveyed different conventional and artificial intelligence based interference mitigation and resource allocation schemes developed in recent years. Our contribution lies in the analysis of conventional interference mitigation techniques and their shortcomings. Finally, the strengths of AI based techniques are determined and open research challenges deduced from the recent research are presented.
We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images.
With increasing amounts of visual data being created in the form of videos and images, visual data selection and summarization are becoming ever increasing problems. We present Vis-DSS, an open-source toolkit for Visual Data Selection and Summarization. Vis-DSS implements a framework of models for summarization and data subset selection using submodular functions, which are becoming increasingly popular today for these problems. We present several classes of models, capturing notions of diversity, coverage, representation and importance, along with optimization/inference and learning algorithms. Vis-DSS is the first open source toolkit for several Data selection and summarization tasks including Image Collection Summarization, Video Summarization, Training Data selection for Classification and Diversified Active Learning. We demonstrate state-of-the art performance on all these tasks, and also show how we can scale to large problems. Vis-DSS allows easy integration for applications to be built on it, also can serve as a general skeleton that can be extended to several use cases, including video and image sharing platforms for creating GIFs, image montage creation, or as a component to surveillance systems and we demonstrate this by providing a graphical user-interface (GUI) desktop app built over Qt framework. Vis-DSS is available at https://…/vis-dss
We study the time complexity of induced subgraph isomorphism problems where the pattern graph is fixed. The earliest known example of an improvement over trivial algorithms is by Itai and Rodeh (1978) who sped up triangle detection in graphs using fast matrix multiplication. This algorithm was generalized by Ne\v{s}et\v{r}il and Poljak (1985) to speed up detection of k-cliques. Improved algorithms are known for certain small-sized patterns. For example, a linear-time algorithm is known for detecting length-4 paths. In this paper, we give the first pattern detection algorithm that improves upon Ne\v{s}et\v{r}il and Poljak’s algorithm for arbitrarily large pattern graphs (not cliques). The algorithm is obtained by reducing the induced subgraph isomorphism problem to the problem of detecting multilinear terms in constant-degree polynomials. We show that the same technique can be used to reduce the induced subgraph isomorphism problem of many pattern graphs to constructing arithmetic circuits computing homomorphism polynomials of these pattern graphs. Using this, we obtain faster combinatorial algorithms (algorithms that do not use fast matrix multiplication) for k-paths and k-cycles. We also obtain faster algorithms for 5-paths and 5-cycles that match the runtime for triangle detection. We show that these algorithms are expressible using polynomial families that we call graph pattern polynomial families. We then define a notion of reduction among these polynomials that allows us to compare the complexity of various pattern detection problems within this framework. For example, we show that the induced subgraph isomorphism polynomial for any pattern that contains a k-clique is harder than the induced subgraph isomorphism polynomial for k-clique. An analogue of this theorem is not known with respect to general algorithmic hardness.
This investigation aims to study different adaptive fuzzy inference algorithms capable of real-time sequential learning and prediction of time-series data. A brief qualitative description of these algorithms namely meta-cognitive fuzzy inference system (McFIS), sequential adaptive fuzzy inference system (SAFIS) and evolving Takagi-Sugeno (ETS) model provide a comprehensive comparison of their working principle, especially their unique characteristics are discussed. These algorithms are then simulated with dataset collected at one of the academic buildings at Nanyang Technological University, Singapore. The performance are compared by means of the root mean squared error (RMSE) and non-destructive error index (NDEI) of the predicted output. Analysis shows that McFIS shows promising results either with lower RMSE and NDEI or with lower architectural complexity over ETS and SAFIS. Statistical Analysis also reveals the significance of the outcome of these algorithms.
High-throughput data acquisition in synthetic biology leads to an abundance of data that need to be processed and aggregated into useful biological models. Building dynamical models based on this wealth of data is of paramount importance to understand and optimize designs of synthetic biology constructs. However, building models manually for each data set is inconvenient and might become infeasible for highly complex synthetic systems. In this paper, we present state-of-the-art system identification techniques and combine them with chemical reaction network theory (CRNT) to generate dynamic models automatically. On the system identification side, Sparse Bayesian Learning offers methods to learn from data the sparsest set of dictionary functions necessary to capture the dynamics of the system into ODE models; on the CRNT side, building on such sparse ODE models, all possible network structures within a given parameter uncertainty region can be computed. Additionally, the system identification process can be complemented with constraints on the parameters to, for example, enforce stability or non-negativity—thus offering relevant physical constraints over the possible network structures. In this way, the wealth of data can be translated into biologically relevant network structures, which then steers the data acquisition, thereby providing a vital step for closed-loop system identification.
Deep learning architectures have proved versatile in a number of drug discovery applications, including the modelling of in vitro compound activity. While controlling for prediction confidence is essential to increase the trust, interpretability and usefulness of virtual screening models in drug discovery, techniques to estimate the reliability of the predictions generated with deep learning networks remain largely underexplored. Here, we present Deep Confidence, a framework to compute valid and efficient confidence intervals for individual predictions using the deep learning technique Snapshot Ensembling and conformal prediction. Specifically, Deep Confidence generates an ensemble of deep neural networks by recording the network parameters throughout the local minima visited during the optimization phase of a single neural network. This approach serves to derive a set of base learners (i.e., snapshots) with comparable predictive power on average, that will however generate slightly different predictions for a given instance. The variability across base learners and the validation residuals are in turn harnessed to compute confidence intervals using the conformal prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, we show that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of independently trained deep neural networks. In addition, we find that the confidence regions predicted using the Deep Confidence framework span a narrower set of values. Overall, Deep Confidence represents a highly versatile error prediction framework that can be applied to any deep learning-based application at no extra computational cost.
Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.
The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances the flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoising autoeconder, namely deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. DEVDAN features an open structure both in the generative phase and in the discriminative phase where input features can be automatically added and discarded on the fly. A network significance (NS) method is formulated in this paper and is derived from the bias-variance concept. This method is capable of estimating the statistical contribution of the network structure and its hidden units which precursors an ideal state to add or prune input features. Furthermore, DEVDAN is free of the problem- specific threshold and works fully in the single-pass learning fashion. The efficacy of DEVDAN is numerically validated using nine non-stationary data stream problems simulated under the prequential test-then-train protocol where DEVDAN is capable of delivering an improvement of classification accuracy to recently published online learning works while having flexibility in the automatic extraction of robust input features and in adapting to rapidly changing environments.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.