GraphPrints: Towards a Graph Analytic Method for Network Anomaly Detection

This paper introduces a novel graph-analytic approach for detecting anomalies in network flow data called GraphPrints. Building on foundational network-mining techniques, our method represents time slices of traffic as a graph, then counts graphlets — small induced subgraphs that describe local topology. By performing outlier detection on the sequence of graphlet counts, anomalous intervals of traffic are identified, and furthermore, individual IPs experiencing abnormal behavior are singled-out. Initial testing of GraphPrints is performed on real network data with an implanted anomaly. Evaluation shows false positive rates bounded by 2.84% at the time-interval level, and 0.05% at the IP-level with 100% true positive rates at both.


Interactive algorithms: From pool to stream

We consider interactive algorithms in the pool-based setting, and in the stream-based setting. Interactive algorithms observe suggested elements (representing actions or queries), and interactively select some of them and receive responses. Stream-based algorithms are not allowed to select suggested elements after more elements have been observed, while pool-based algorithms can select elements at any order. We assume that the available elements are generated independently according to some distribution, and design stream-based algorithms that emulate black-box pool-based interactive algorithms. We provide two such emulating algorithms. The first algorithm can emulate any pool-based algorithm, but the number of suggested elements that need to be observed might be exponential in the number of selected elements. The second algorithm applies to the class of utility-based interactive algorithms, and the number of suggested elements that it observes is linear in the number of selected elements. For the case of utility-based emulation, we also provide a lower bound showing that near-linearity is necessary.


High-Dimensional Regularized Discriminant Analysis

Friedman proposed the popular regularized discriminant analysis (RDA) classifier that utilizes a biased covariance-matrix estimator that partially pools the sample covariance matrices from linear and quadratic discriminant analysis and shrinks the resulting estimator towards a scaled identity matrix. The RDA classifier’s two tuning parameters are typically estimated via a computationally burdensome cross-validation procedure that uses a grid search. We formulate a new RDA-based classifier for the small-sample, high-dimensional setting and then show that the classification decision rule is equivalent to a classifier in a subspace having a much lower dimension. As a result, the utilization of the dimension-reduction step yields a substantial reduction in computation during model selection. Also, our parameterization offers interpretability that was previously lacking with the RDA classifier. We demonstrate that our proposed classifier is often superior to several recently proposed sparse and regularized classifiers in terms of classification accuracy with three artificial and six real high-dimensional data sets. Finally, we provide an implementation of our proposed classifier in the sparsediscrim R package, which is available on CRAN.


Using Hadoop for Large Scale Analysis on Twitter: A Technical Report

Sentiment analysis (or opinion mining) on Twitter data has attracted much attention recently. One of the system’s key features, is the immediacy in communication with other users in an easy, user-friendly and fast way. Consequently, people tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide diversity of topics. This amount of information offers huge potential and can be harnessed to receive the sentiment tendency towards these topics. However, since none can invest an infinite amount of time to read through these tweets, an automated decision making approach is necessary. Nevertheless, most existing solutions are limited in centralized environments only. Thus, they can only process at most a few thousand tweets. Such a sample, is not representative to define the sentiment polarity towards a topic due to the massive number of tweets published daily. In this paper, we go one step further and develop a novel method for sentiment learning in the MapReduce framework. Our algorithm exploits the hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification procedure of diverse sentiment types in a parallel and distributed manner. Moreover, we utilize Bloom filters to compact the storage size of intermediate data and boost the performance of our algorithm. Through an extensive experimental evaluation, we prove that our solution is efficient, robust and scalable and confirm the quality of our sentiment identification.


Canary: A Scheduling Architecture for High Performance Cloud Computing

We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally. We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9-90X speedup over Spark and up to 4X speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores.


Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions

Do Cascades Recur?

On the existence of shadow prices for optimal investment with random endowment

Efficient Index for Weighted Sequences

On the Nyström and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Data Sets

A Combinatorial Approach to the Symmetry of $q,t$-Catalan Numbers

A Dual Embedding Space Model for Document Ranking

High-dimensional variable selection via penalized credible regions with global-local shrinkage priors

Extreme values of the stationary distribution of random walks on directed graphs

Single-Solution Hypervolume Maximization and its use for Improving Generalization of Neural Networks

Learning Discriminative Features via Label Consistent Neural Network

Regression with network cohesion

Principal stratification analysis using principal scores

k-variates++: more pluses in the k-means++

denoiseR: A Package for Low Rank Matrix Estimation

Spatial Concept Acquisition for a Mobile Robot that Integrates Self-Localization and Unsupervised Word Discovery from Spoken Sentences

A Fractional Micro-Macro Model for Crowds of Pedestrians based on Fractional Mean Field Games

Maximum leave-one-out likelihood estimation for location parameter of unbounded densities

Maximal $m$-distance sets containing the representation of the Hamming graph $H(n,m)$

The leading term of the Yang-Mills free energy

On the Distribution of the Number of Goldbach Partitions of a Randomly Chosen Positive Even Integer

Probabilistic Trace and Poisson Summation Formulae on Locally Compact Abelian Groups

A computationally efficient nonparametric approach for changepoint detection

Hidden Regular Variation under Full and Strong Asymptotic Dependence

Frequentistic approximations to Bayesian prevision of exchangeable random elements

On random partitions induced by random maps

Sample path properties of multivariate operator-self-similar stable random fields

Limit theorems for number of edges in the generalized random graphs with random vertex weights

Dominating Sets in Circulant Graphs

An application of a functional inequality to quasi-invariance in infinite dimensions

How proofs are prepared at Camelot

Graphs with Large Girth are b-continuous

On the interplay between embedded graphs and delta-matroids

Continuity of the Feynman-Kac formula for a generalized parabolic equation

A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks

Biclustering Readings and Manuscripts via Non-negative Matrix Factorization, with Application to the Text of Jude

On the large time behaviour of the solution of an SDE driven by a Poisson Point Process

Geometry of infinite planar maps with high degrees

Perfect (super) edge-magic crowns

Plurality Consensus via Shuffling: Lessons Learned from Load Balancing

A Probabilistic Modeling Approach to Hearing Loss Compensation

Universality of causal graph dynamics

Near-Optimality of Linear Recovery in Gaussian Observation Scheme under $\|\cdot\|_2^2$-Loss

Sensory evaluation of commercial coffee brands in Colombia

Analysis of generalized negative binomial distributions attached to hyperbolic Landau levels

Inv-ASKIT: A Parallel Fast Diret Solver for Kernel Matrices

The structure of large intersecting families

Making Walks Count: From Silent Circles to Hamiltonian Cycles

Finding the different patterns in buildings data using bag of words representation with clustering

A Kronecker-factored approximate Fisher matrix for convolution layers

A General Framework for Fast Image Deconvolution with Incomplete Observations. Applications to Unknown Boundaries, Inpainting, Superresolution, and Demosaicing

An SSD-based eigensolver for spectral analysis on billion-node graphs

‘Draw My Topics’: Find Desired Topics fast from large scale of Corpus

Sequential Bayesian Analysis of Multivariate Poisson Count Data

Decoherence of a quantum two-level system by spectral diffusion

A matrix model for random nilpotent groups