Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors

Anomaly detection (AD) has garnered ample attention in security research, as such algorithms complement existing signature-based methods but promise detection of never-before-seen attacks. Cyber operations manage a high volume of heterogeneous log data; hence, AD in such operations involves multiple (e.g., per IP, per data type) ensembles of detectors modeling heterogeneous characteristics (e.g., rate, size, type) often with adaptive online models producing alerts in near real time. Because of high data volume, setting the threshold for each detector in such a system is an essential yet underdeveloped configuration issue that, if slightly mistuned, can leave the system useless, either producing a myriad of alerts and flooding downstream systems, or giving none. In this work, we build on the foundations of Ferragut et al. to provide a set of rigorous results for understanding the relationship between threshold values and alert quantities, and we propose an algorithm for setting the threshold in practice. Specifically, we give an algorithm for setting the threshold of multiple, heterogeneous, possibly dynamic detectors completely a priori, in principle. Indeed, if the underlying distribution of the incoming data is known (closely estimated), the algorithm provides provably manageable thresholds. If the distribution is unknown (e.g., has changed over time) our analysis reveals how the model distribution differs from the actual distribution, indicating a period of model refitting is necessary. We provide empirical experiments showing the efficacy of the capability by regulating the alert rate of a system with \approx2,500 adaptive detectors scoring over 1.5M events in 5 hours. Further, we demonstrate on the real network data and detection framework of Harshaw et al. the alternative case, showing how the inability to regulate alerts indicates the detection model is a bad fit to the data.

DPCA: Dimensionality Reduction for Discriminative Analytics of Multiple Large-Scale Datasets

Principal component analysis (PCA) has well-documented merits for data extraction and dimensionality reduction. PCA deals with a single dataset at a time, and it is challenged when it comes to analyzing multiple datasets. Yet in certain setups, one wishes to extract the most significant information of one dataset relative to other datasets. Specifically, the interest may be on identifying, namely extracting features that are specific to a single target dataset but not the others. This paper develops a novel approach for such so-termed discriminative data analysis, and establishes its optimality in the least-squares (LS) sense under suitable data modeling assumptions. The criterion reveals linear combinations of variables by maximizing the ratio of the variance of the target data to that of the remainders. The novel approach solves a generalized eigenvalue problem by performing SVD just once. Numerical tests using synthetic and real datasets showcase the merits of the proposed approach relative to its competing alternatives.

Representation Learning in Large Attributed Graphs

Graphs (networks) are ubiquitous and allow us to model entities (nodes) and the dependencies (edges) between them. Learning a useful feature representation from graph data lies at the heart and success of many machine learning tasks such as classification, anomaly detection, link prediction, among many others. Many existing techniques use random walks as a basis for learning features or estimating the parameters of a graph model for a downstream prediction task. Examples include recent node embedding methods such as DeepWalk, node2vec, as well as graph-based deep learning algorithms. However, the simple random walk used by these methods is fundamentally tied to the identity of the node. This has three main disadvantages. First, these approaches are inherently transductive and do not generalize to unseen nodes and other graphs. Second, they are not space-efficient as a feature vector is learned for each node which is impractical for large graphs. Third, most of these approaches lack support for attributed graphs. To make these methods more generally applicable, we propose a framework based on the notion of attributed random walk that is not tied to node identity and is instead based on learning a function \Phi : \mathrm{\rm \bf x} \rightarrow w that maps a node attribute vector \mathrm{\rm \bf x} to a type w. This framework serves as a basis for generalizing existing methods such as DeepWalk, node2vec, and many other previous methods that leverage traditional random walks.

Knowledge Projection for Deep Neural Networks

While deeper and wider neural networks are actively pushing the performance limits of various computer vision and machine learning tasks, they often require large sets of labeled data for effective training and suffer from extremely high computational complexity. In this paper, we will develop a new framework for training deep neural networks on datasets with limited labeled samples using cross-network knowledge projection which is able to improve the network performance while reducing the overall computational complexity significantly. Specifically, a large pre-trained teacher network is used to observe samples from the training data. A projection matrix is learned to project this teacher-level knowledge and its visual representations from an intermediate layer of the teacher network to an intermediate layer of a thinner and faster student network to guide and regulate its training process. Both the intermediate layers from the teacher network and the injection layers from the student network are adaptively selected during training by evaluating a joint loss function in an iterative manner. This knowledge projection framework allows us to use crucial knowledge learned by large networks to guide the training of thinner student networks, avoiding over-fitting, achieving better network performance, and significantly reducing the complexity. Extensive experimental results on benchmark datasets have demonstrated that our proposed knowledge projection approach outperforms existing methods, improving accuracy by up to 4% while reducing network complexity by 4 to 10 times, which is very attractive for practical applications of deep neural networks.

InterpNET: Neural Introspection for Interpretable Deep Learning

Humans are able to explain their reasoning. On the contrary, deep neural networks are not. This paper attempts to bridge this gap by introducing a new way to design interpretable neural networks for classification, inspired by physiological evidence of the human visual system’s inner-workings. This paper proposes a neural network design paradigm, termed InterpNET, which can be combined with any existing classification architecture to generate natural language explanations of the classifications. The success of the module relies on the assumption that the network’s computation and reasoning is represented in its internal layer activations. While in principle InterpNET could be applied to any existing classification architecture, it is evaluated via an image classification and explanation task. Experiments on a CUB bird classification and explanation dataset show qualitatively and quantitatively that the model is able to generate high-quality explanations. While the current state-of-the-art METEOR score on this dataset is 29.2, InterpNET achieves a much higher METEOR score of 37.9.

Maximum Principle Based Algorithms for Deep Learning

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin’s maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out – a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

Context-Aware Generative Adversarial Privacy

Preserving the utility of published datasets, while providing provable privacy guarantees, is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to the dataset’s statistics. To circumvent this problem, we present a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn ‘optimal’ privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals’ private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP’s performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model, and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in an iterative generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics.

Artifact reduction for separable non-local means

It was recently demonstrated [J. Electron. Imaging, 25(2), 2016] that one can perform fast non-local means (NLM) denoising of one-dimensional signals using a method called lifting. The cost of lifting is independent of the patch length, which dramatically reduces the run-time for large patches. Unfortunately, it is difficult to directly extend lifting for non-local means denoising of images. To bypass this, the authors proposed a separable approximation in which the image rows and columns are filtered using lifting. The overall algorithm is significantly faster than NLM, and the results are comparable in terms of PSNR. However, the separable processing often produces vertical and horizontal stripes in the image. This problem was previously addressed by using a bilateral filter-based post-smoothing, which was effective in removing some of the stripes. In this letter, we demonstrate that stripes can be mitigated in the first place simply by involving the neighboring rows (or columns) in the filtering. In other words, we use a two-dimensional search (similar to NLM), while still using one-dimensional patches (as in the previous proposal). The novelty is in the observation that one can use lifting for performing two-dimensional searches. The proposed approach produces artifact-free images, whose quality and PSNR are comparable to NLM, while being significantly faster.

Big Data Classification Using Augmented Decision Trees

We present an algorithm for classification tasks on big data. Experiments conducted as part of this study indicate that the algorithm can be as accurate as ensemble methods such as random forests or gradient boosted trees. Unlike ensemble methods, the models produced by the algorithm can be easily interpreted. The algorithm is based on a divide and conquer strategy and consists of two steps. The first step consists of using a decision tree to segment the large dataset. By construction, decision trees attempt to create homogeneous class distributions in their leaf nodes. However, non-homogeneous leaf nodes are usually produced. The second step of the algorithm consists of using a suitable classifier to determine the class labels for the non-homogeneous leaf nodes. The decision tree segment provides a coarse segment profile while the leaf level classifier can provide information about the attributes that affect the label within a segment.

Causal Inference for a Single Group of Causally-Connected Units Under Stratified Interference

The assumption that no subject’s exposure affects another subject’s outcome, known as the assumption of no interference, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of knowledge of an underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject’s outcome may be affected by all the other subjects in the group, but only as a function of the total number of subjects exposed (the stratified interference assumption). In this paper, we consider a setting in which we can rely on neither of these aids, as each subject affects every other subject’s outcome. In particular, we consider settings in which the stratified interference assumption is reasonable for a single group consisting of the entire sample, i.e., a subject’s outcome is affected by all other subjects’ exposures, but only via the total number of subjects exposed. This can occur when the exposure is a shared resource whose efficacy is modified by the number of subjects among whom it is shared. We present a doubly-robust estimator that allows for incorporation of machine learning, and tools for inference for a class of causal parameters that includes direct effects and overall effects under certain interventions. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.

ALL-IN-1: Short Text Classification with One Model for All Languages

We present ALL-IN-1, a simple model for multilingual text classification that does not require any parallel data. It is based on a traditional Support Vector Machine classifier exploiting multilingual word embeddings and character n-grams. Our model is simple, easily extendable yet very effective, overall ranking 1st (out of 12 teams) in the IJCNLP 2017 shared task on customer feedback analysis in four languages: English, French, Japanese and Spanish.

Distributed Spatial Data Clustering as a New Approach for Big Data Analysis

In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase executes a clustering algorithm on local data, assuming that the datasets was already distributed among the system processing nodes. The second phase deals with the local clusters aggregation to generate global clusters. This approach not only generates local clusters on each processing node in parallel, but also facilitates the formation of global clusters without prior knowledge of the number of the clusters, which many partitioning clustering algorithm require. In this study, this approach was applied on spatial datasets. The proposed aggregation phase is very efficient and does not involve the exchange of large amounts of data between the processing nodes. The experimental results show that the approach has super linear speed up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications.

Segment Parameter Labelling in MCMC Mean-Shift Change Detection

This work addresses the problem of segmentation in time series data with respect to a statistical parameter of interest in Bayesian models. It is common to assume that the parameters are distinct within each segment. As such, many Bayesian change point detection models do not exploit the segment parameter patterns, which can improve performance. This work proposes a Bayesian mean-shift change point detection algorithm that makes use of repetition in segment parameters, by introducing segment class labels that utilise a Dirichlet process prior. The performance of the proposed approach was assessed on both synthetic and real world data, highlighting the enhanced performance when using parameter labelling.

Sparse Diffusion-Convolutional Neural Networks

The predictive power and overall computational efficiency of Diffusion-convolutional neural networks make them an attractive choice for node classification tasks. However, a naive dense-tensor-based implementation of DCNNs leads to \mathcal{O}(N^2) memory complexity which is prohibitive for large graphs. In this paper, we introduce a simple method for thresholding input graphs that provably reduces memory requirements of DCNNs to O(N) (i.e. linear in the number of nodes in the input) without significantly affecting predictive performance.

Time-dependent variational principle in matrix-product state manifolds: pitfalls and potential
Dynamical potentials for non-equilibrium quantum many-body phases
Line Formation by Fat Robots under Limited Visibility
Trapped-ion quantum simulation of excitation transport: disordered, noisy, and long-range connected quantum networks
mixup: Beyond Empirical Risk Minimization
Cross-identification of stellar catalogs with multiple stars: Complexity and Resolution
Uniform Circle Formation by Transparent Fat Robots
A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
Benefits of Depth for Long-Term Memory of Recurrent Networks
Chromatic numbers of stable Kneser hypergraphs via topological Tverberg-type theorems
Malware Detection by Eating a Whole EXE
High Five: Improving Gesture Recognition by Embracing Uncertainty
General Bayesian Inference over the Stiefel Manifold via the Givens Transform
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence
Millimeter Wave MIMO Prototype: Measurements and Experimental Results
Spatial Field estimation from Samples taken at Unknown Locations generated by an Unknown Autoregressive Process
On Bandlimited Spatiotemporal Field Sampling with Location and Time Unaware Mobile Sensors
Optimal Mechanism Design with Flexible Consumers and Costly Supply
Comparing Dushnik-Miller Dimension, Boolean Dimension and Local Dimension
Triangular fractal approximating graphs and their covering paths and cycles
Fair division with multiple pieces
Additive Matrix Convolutions of Pólya Ensembles and Polynomial Ensembles
Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction
A Differential Evaluation Markov Chain Monte Carlo algorithm for Bayesian Model Updating
Slow and Long-ranged Dynamical Heterogeneities in Dissipative Fluids
Complete 3D Scene Parsing from Single RGBD Image
Piercing Numbers in Approval Voting
Compressive sensing and truncated moment problems on spheres
Free deterministic equivalent Z-scores of compound Wishart models: A goodness of fit test of 2DARMA models
Evolutionary games under incompetence
Barriers for Rank Methods in Arithmetic Complexity
Conjectures for Ehrhart $h^*$-vectors of Hypersimplices and Dilated Simplices
Reparameterizing the Birkhoff Polytope for Variational Permutation Inference
Eigenvalue location in graphs of small clique-width
Soft Methodology for Cost-and-error Sensitive Classification
Vertex-primitive $s$-arc-transitive digraphs of linear groups
Online learning in optical tomography: a stochastic approach
Laplacian Prior Variational Automatic Relevance Determination for Transmission Tomography
On Graph Isomorphism Problem
Rotational Unit of Memory
Wong–Zakai Approximations of Stochastic Allen-Cahn Equation
A Generative Model for Volume Rendering
LPG: a four-groups probabilistic approach to leveraging pleiotropy in genome-wide association studies
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
Duality-free Methods for Stochastic Composition Optimization
Nonlinear Feedback Shift Registers and Zech Logarithms
From global scaling to the dynamics of individual cities
Biologically Inspired Feedforward Supervised Learning for Deep Self-Organizing Map Networks
A Note on Parallel Asynchronous Channels with Arbitrary Skews
Stochastic Wiener Filter in the White Noise Space
Tests for the weights of the global minimum variance portfolio in a high-dimensional setting
Quantum versus Classical Online Algorithms with Advice and Logarithmic Space
Watch Your Step: Learning Graph Embeddings Through Attention
Simple Distributed Graph Clustering using Modularity and Map Equation
Optimal Input Design for Parameter Estimation in AR(1) with Dependent Stationary Noise
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
Optimal control of a Vlasov-Poisson plasma by fixed magnetic field coils
SRE: Semantic Rules Engine For the Industrial Internet-Of-Things Gateways
Construction of optimal locally repairable codes via automorphism groups of rational function fields
Optimal survival strategy for branching Brownian motion in a Poissonian trap field
Hopf bifurcation with additive noise
Protograph LDPC Codes with Block Thresholds: Extension to Degree-1 and Generalized Nodes
Weighted variants of the Andrásfai-Erdős-Sós Theorem
Markovian Maximal Coupling of Markov Processes
Cointegration in continuous time for factor models
A Fast Algorithm for Solving Henderson’s Mixed Model Equation
PDE-Net: Learning PDEs from Data
Spitzer’s identity for discrete random walks
Improved Workflow for Unsupervised Multiphase Image Segmentation
1-skeletons of the spanning tree problems with additional constraints
Class Correlation affects Single Object Localization using Pre-trained ConvNets
Quantifying Transient Spreading Dynamics on Networks
On the minimum of the mean-squared error in 2-means clustering
Inserting rim-hooks into reverse plane partitions
On the Ubiquity of Information Inconsistency for Conjugate Priors
Learning Approximate Stochastic Transition Models
Lengths of closed geodesics on random surfaces of large genus
Testing for long memory in panel random-coefficient AR(1) data
Many-Body Localization and Level Repulsion
New Approach to General Nonlinear Discrete-Time Stochastic $H_\infty$ Control
Impact of Coreference Resolution on Slot Filling
Time-Division Transmission is Optimal for Covert Communication over Broadcast Channels
Deep Spatial Regression Model for Image Crowd Counting
Directional Metropolis-Hastings
How to Fool Radiologists with Generative Adversarial Networks? A Visual Turing Test for Lung Cancer Diagnosis
A Central Limit Theorem for Wasserstein type distances between two different laws
Impact of Random Receiver Orientation on Visible Light Communications Channel
Gale-Robinson quivers: from representations to combinatorial formulas
Meta Learning Shared Hierarchies
From Distance Correlation to Multiscale Generalized Correlation
Exit time asymptotics for small noise stochastic delay differential equations
Deep Multi-Modal Classification of Intraductal Papillary Mucinous Neoplasms (IPMN) with Canonical Correlation Analysis
Interactions of Computational Complexity Theory and Mathematics
Mutation frequencies in a birth-death branching process
Offset-Based Beamforming: A New Approach to Robust Downlink Transmission
Optimal Shrinkage of Singular Values Under Random Data Contamination
FashionBrain Project: A Vision for Understanding Europe’s Fashion Data Universe
Statistical Inference on Tree Swallow Migrations
A strong collapse increasing the geometric simplicial Lusternik-Schnirelmann category
Synchronization Strings: Explicit Constructions, Local Decoding, and Applications
Interference Queuing Networks on Grids
Lip2AudSpec: Speech reconstruction from silent lip movements video
Improving Negative Sampling for Word Representation using Self-embedded Features
Minimum Circuit Size, Graph Isomorphism, and Related Problems
Joint Screening Tests for LASSO
On the generalized Douglas-Rachford algorithm for feasibility problems
Spiking Optical Flow for Event-based Sensors Using IBM’s TrueNorth Neurosynaptic System
Cell Line Classification Using Electric Cell-substrate Impedance Sensing (ECIS)
Klout Topics for Modeling Interests and Expertise of Users Across Social Networks
On the role of synaptic stochasticity in training low-precision neural networks
Dynamic Routing Between Capsules