Stealing Hyperparameters in Machine Learning

Hyperparameters are critical in machine learning, as different hyperparameters often result in models with significantly different performance. Hyperparameters may be deemed confidential because of their commercial value and the confidentiality of the proprietary algorithms that the learner uses to learn them. In this work, we propose attacks on stealing the hyperparameters that are learned by a learner. We call our attacks hyperparameter stealing attacks. Our attacks are applicable to a variety of popular machine learning algorithms such as ridge regression, logistic regression, support vector machine, and neural network. We evaluate the effectiveness of our attacks both theoretically and empirically. For instance, we evaluate our attacks on Amazon Machine Learning. Our results demonstrate that our attacks can accurately steal hyperparameters. We also study countermeasures. Our results highlight the need for new defenses against our hyperparameter stealing attacks for certain machine learning algorithms.

High Dimensional Bayesian Optimization Using Dropout

Scaling Bayesian optimization to high dimensions is challenging task as the global optimization of high-dimensional acquisition function can be expensive and often infeasible. Existing methods depend either on limited active variables or the additive form of the objective function. We propose a new method for high-dimensional Bayesian optimization, that uses a dropout strategy to optimize only a subset of variables at each iteration. We derive theoretical bounds for the regret and show how it can inform the derivation of our algorithm. We demonstrate the efficacy of our algorithms for optimization on two benchmark functions and two real-world applications- training cascade classifiers and optimizing alloy composition.

‘Dependency Bottleneck’ in Auto-encoding Architectures: an Empirical Study

Recent works investigated the generalization properties in deep neural networks (DNNs) by studying the Information Bottleneck in DNNs. However, the measurement of the mutual information (MI) is often inaccurate due to the density estimation. To address this issue, we propose to measure the dependency instead of MI between layers in DNNs. Specifically, we propose to use Hilbert-Schmidt Independence Criterion (HSIC) as the dependency measure, which can measure the dependence of two random variables without estimating probability densities. Moreover, HSIC is a special case of the Squared-loss Mutual Information (SMI). In the experiment, we empirically evaluate the generalization property using HSIC in both the reconstruction and prediction auto-encoding (AE) architectures.

CADDeLaG: Framework for distributed anomaly detection in large dense graph sequences

Random walk based distance measures for graphs such as commute-time distance are useful in a variety of graph algorithms, such as clustering, anomaly detection, and creating low dimensional embeddings. Since such measures hinge on the spectral decomposition of the graph, the computation becomes a bottleneck for large graphs and do not scale easily to graphs that cannot be loaded in memory. Most existing graph mining libraries for large graphs either resort to sampling or exploit the sparsity structure of such graphs for spectral analysis. However, such methods do not work for dense graphs constructed for studying pairwise relationships among entities in a data set. Examples of such studies include analyzing pairwise locations in gridded climate data for discovering long distance climate phenomena. These graphs representations are fully connected by construction and cannot be sparsified without loss of meaningful information. In this paper we describe CADDeLaG, a framework for scalable computation of commute-time distance based anomaly detection in large dense graphs without the need to load the entire graph in memory. The framework relies on Apache Spark’s memory-centric cluster-computing infrastructure and consists of two building blocks: a decomposable algorithm for commute time distance computation and a distributed linear system solver. We illustrate the scalability of CADDeLaG and its dependency on various factors using both synthetic and real world data sets. We demonstrate the usefulness of CADDeLaG in identifying anomalies in a climate graph sequence, that have been historically missed due to ad hoc graph sparsification and on an election donation data set.

Mean Field Multi-Agent Reinforcement Learning

Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of user interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution. Experiments on resource allocation, Ising model estimation, and battle game tasks verify the learning effectiveness of our mean field approaches in handling many-agent interactions in population.

History PCA: A New Algorithm for Streaming PCA

In this paper we propose a new algorithm for streaming principal component analysis. With limited memory, small devices cannot store all the samples in the high-dimensional regime. Streaming principal component analysis aims to find the k-dimensional subspace which can explain the most variation of the d-dimensional data points that come into memory sequentially. In order to deal with large d and large N (number of samples), most streaming PCA algorithms update the current model using only the incoming sample and then dump the information right away to save memory. However the information contained in previously streamed data could be useful. Motivated by this idea, we develop a new streaming PCA algorithm called History PCA that achieves this goal. By using O(Bd) memory with B\approx 10 being the block size, our algorithm converges much faster than existing streaming PCA algorithms. By changing the number of inner iterations, the memory usage can be further reduced to O(d) while maintaining a comparable convergence speed. We provide theoretical guarantees for the convergence of our algorithm along with the rate of convergence. We also demonstrate on synthetic and real world data sets that our algorithm compares favorably with other state-of-the-art streaming PCA methods in terms of the convergence speed and performance.

Black Hole Metric: Overcoming the PageRank Normalization Problem

In network science, there is often the need to sort the graph nodes. While the sorting strategy may be different, in general sorting is performed by exploiting the network structure. In particular, the metric PageRank has been used in the past decade in different ways to produce a ranking based on how many neighbors point to a specific node. PageRank is simple, easy to compute and effective in many applications, however it comes with a price: as PageRank is an application of the random walker, the arc weights need to be normalized. This normalization, while necessary, introduces a series of unwanted side-effects. In this paper, we propose a generalization of PageRank named Black Hole Metric which mitigates the problem. We devise a scenario in which the side-effects are particularily impactful on the ranking, test the new metric in both real and synthetic networks, and show the results.

Admissible Time Series Motif Discovery with Missing Data

The discovery of time series motifs has emerged as one of the most useful primitives in time series data mining. Researchers have shown its utility for exploratory data mining, summarization, visualization, segmentation, classification, clustering, and rule discovery. Although there has been more than a decade of extensive research, there is still no technique to allow the discovery of time series motifs in the presence of missing data, despite the well-documented ubiquity of missing data in scientific, industrial, and medical datasets. In this work, we introduce a technique for motif discovery in the presence of missing data. We formally prove that our method is admissible, producing no false negatives. We also show that our method can piggy-back off the fastest known motif discovery method with a small constant factor time/space overhead. We will demonstrate our approach on diverse datasets with varying amounts of missing data

Convolutional Analysis Operator Learning: Acceleration, Convergence, Application, and Neural Networks

Convolutional operator learning is increasingly gaining attention in many signal processing and computer vision applications. Learning kernels has mostly relied on so-called local approaches that extract and store many overlapping patches across training signals. Due to memory demands, local approaches have limitations when learning kernels from large datasets — particularly with multi-layered structures, e.g., convolutional neural network (CNN) — and/or applying the learned kernels to high-dimensional signal recovery problems. The so-called global approach has been studied within the ‘synthesis’ signal model, e.g., convolutional dictionary learning, overcoming the memory problems by careful algorithmic designs. This paper proposes a new convolutional analysis operator learning (CAOL) framework in the global approach, and develops a new convergent Block Proximal Gradient method using a Majorizer (BPG-M) to solve the corresponding block multi-nonconvex problems. To learn diverse filters within the CAOL framework, this paper introduces an orthogonality constraint that enforces a tight-frame (TF) filter condition, and a regularizer that promotes diversity between filters. Numerical experiments show that, for tight majorizers, BPG-M significantly accelerates the CAOL convergence rate compared to the state-of-the-art method, BPG. Numerical experiments for sparse-view computational tomography show that CAOL using TF filters significantly improves reconstruction quality compared to a conventional edge-preserving regularizer. Finally, this paper shows that CAOL can be useful to mathematically model a CNN, and the corresponding updates obtained via BPG-M coincide with core modules of the CNN.

Direct Estimation of Differences in Causal Graphs

We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models given i.i.d. samples from each model. This is of interest for example in genomics, where large-scale gene expression data is becoming available under different cellular contexts, for different cell types, or disease states. Changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks and provide important insights into the emergence of a particular phenotype. While the individual networks are usually very large, containing high-degree hub nodes and thus difficult to learn, the overall change between two related networks can be sparse. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation.

DeepMatch: Balancing Deep Covariate Representations for Causal Inference Using Adversarial Training

We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings.

Event Nugget Detection with Forward-Backward Recurrent Neural Networks

Traditional event detection methods heavily rely on manually engineered rich features. Recent deep learning approaches alleviate this problem by automatic feature engineering. But such efforts, like tradition methods, have so far only focused on single-token event mentions, whereas in practice events can also be a phrase. We instead use forward-backward recurrent neural networks (FBRNNs) to detect events that can be either words or phrases. To the best our knowledge, this is one of the first efforts to handle multi-word events and also the first attempt to use RNNs for event detection. Experimental results demonstrate that FBRNN is competitive with the state-of-the-art methods on the ACE 2005 and the Rich ERE 2015 event detection tasks.

Simulation assisted machine learning

Predicting how a proposed cancer treatment will affect a given tumor can be cast as a machine learning problem, but the complexity of biological systems, the number of potentially relevant genomic and clinical features, and the lack of very large scale patient data repositories make this a unique challenge. ‘Pure data’ approaches to this problem are underpowered to detect combinatorially complex interactions and are bound to uncover false correlations despite statistical precautions taken (1). To investigate this setting, we propose a method to integrate simulations, a strong form of prior knowledge, into machine learning, a combination which to date has been largely unexplored. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to train kernelized machine learning algorithms such as support vector machines, thus handling the curse-of-dimensionality that typically affects genomic machine learning. Using four synthetic datasets of complex systems–three biological models and one network flow optimization model–we demonstrate that when the number of training samples is small compared to the number of features, the simulation kernel approach dominates over no-prior-knowledge methods. In addition to biology and medicine, this approach should be applicable to other disciplines, such as weather forecasting, financial markets, and agricultural management, where predictive models are sought and informative yet approximate simulations are available. The Python SimKern software, the models (in MATLAB, Octave, and R), and the datasets are made freely available at https://…/SimKern.

Designing Random Graph Models Using Variational Autoencoders With Applications to Chemical Design
Flexible and objective time series analysis: a loss-based approach with two-piece location-scale distributions
Dynamic Fair Division Problem with General Valuations
Stronger generalization bounds for deep nets via a compression approach
Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise
Influential User Subscription on Time-Decaying Social Streams
Learning Deep Disentangled Embeddings with the F-Statistic Loss
Reinforcement Learning from Imperfect Demonstrations
Online Learning for Non-Stationary A/B Tests
500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)
Classifying movie genres by analyzing text reviews
Advancing System Performance with Redundancy: From Biological to Artificial Designs
Graphs with at most two trees in a forest building process
Bootstrap-Assisted Unit Root Testing With Piecewise Locally Stationary Errors
Multimodal Generative Models for Scalable Weakly-Supervised Learning
Two- and Multi-dimensional Curve Fitting using Bayesian Inference
From Gameplay to Symbolic Reasoning: Learning SAT Solver Heuristics in the Style of Alpha(Go) Zero
Spatial Coherence of Oriented White Matter Microstructure: Applications to White Matter Regions Associated with Genetic Similarity
Rescaled Whittaker driven stochastic differential equations converge to the additive stochastic heat equation
Generating dense packings of hard spheres by soft interaction design
Gibbs Partitions, Riemann-Liouville Fractional Operators, Mittag-Leffler Functions, and Fragmentations Derived From Stable Subordinators
The Role of Information Complexity and Randomization in Representation Learning
Economic Dispatch with Adjustable Transformer Ratio and Phase Shifter
Resolution of Conjectures Related to Lights Out! and Cartesian Products
Deep contextualized word representations
Universal Neural Machine Translation for Extremely Low Resource Languages
Covariance Function Pre-Training with m-Kernels for Accelerated Bayesian Optimisation
Input-Aware Auto-Tuning of Compute-Bound HPC Kernels
Improving Retrieval Modeling Using Cross Convolution Networks And Multi Frequency Word Embedding
A Progressive Batching L-BFGS Method for Machine Learning
Active Feature Acquisition with Supervised Matrix Completion
Value-Aware Item Weighting for Long-Tail Recommendation
Deep Learning Based Speech Beamforming
AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation
Fooling OCR Systems with Adversarial Text Images
Shamap: Shape-based Manifold Learning
Strongly connected components-Algorithm for finding the strongly connected components of a graph
Dynamical Galam model
Reducing over-clustering via the powered Chinese restaurant process
Cost-Effective Training of Deep CNNs with Active Model Adaptation
The Observation of Continuous Power Flow Solutions for IEEE-14 Bus System
Competitive caching with machine learned advice
Putting a bug in ML: The moth olfactory network learns to read MNIST
Collision of eigenvalues for matrix-valued processes
Selecting the Best in GANs Family: a Post Selection Inference Framework
NtMalDetect: A Machine Learning Approach to Malware Detection Using Native API System Calls
Teaching Machines to Code: Neural Markup Generation with Visual Attention
Dual Garside structures and Coxeter sortable elements
On Adaptive Cubic Regularized Newton’s Methods for Convex Optimization via Random Sampling
Blind Source Separation with Optimal Transport Non-negative Matrix Factorization
On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo
Improving the Decoding Threshold of Tailbiting Spatially Coupled LDPC Codes by Energy Shaping
Genomics as a Service: a Joint Computing and Networking Perspective
A Weighted Likelihood Approach Based on Statistical Data Depths
Discrepancy-based Evolutionary Diversity Optimization
A range condition for polyconvex variational regularization
Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Massivizing Computer Systems: a Vision to Understand, Design, and Engineer Computer Ecosystems through and beyond Modern Distributed Systems
Chromatic symmetric functions via the group algebra of $S_n$
Smooth heaps and a dual view of self-adjusting data structures
Robust and sparse Gaussian graphical modeling under cell-wise contamination
Approximate quantum Markov chains
Evolution of Images with Diversity and Constraints Using a Generator Network
Chaos in Kuramoto Oscillator Networks
On the P vs NP question: a proof of inequality
Optimal control of the customer dynamics based on marketing policy
Grammar-based Compression of Unranked Trees
A (2+1)-dimensional Anisotropic KPZ growth model with a rigid phase
An Operational (Preasymptotic) Measure of Fat-tailedness
Finding small-width connected path decompositions in polynomial time
Extreme points of Gram spectrahedra of binary forms
Learning from a Handful Volumes: MRI Resolution Enhancement with Volumetric Super-Resolution Forests
Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints
Singular control and optimal stopping of memory mean-field processes
Modelling spatial heterogeneity and discontinuities using Voronoi tessellations
Fine-Grained Complexity of Safety Verification
Semi-Supervised Learning on Graphs Based on Local Label Distributions
Open Information Extraction on Scientific Text: An Evaluation
DR-BiLSTM: Dependent Reading Bidirectional LSTM for Natural Language Inference
Fast Generalized Conditional Gradient Method with Applications to Matrix Recovery Problems
Distributed coloring in sparse graphs with fewer colors
Tools and resources for Romanian text-to-speech and speech-to-text applications
Towards End-to-End Lane Detection: an Instance Segmentation Approach
Prioritized Sweeping Neural DynaQ with Multiple Predecessors, and Hippocampal Replays
On the binomial approximation of the American put
A novel wavelet-based optimal linear quadratic tracker for time-varying systems with multiple delays
Conditioning of three-dimensional generative adversarial networks for pore and reservoir-scale models
An $O(1)$-Approximation Algorithm for Dynamic Weighted Vertex Cover with Soft Capacity
Contributions to the asymptotic study of Hermite driven processes
Speech Emotion Recognition with Data Augmentation and Layer-wise Learning Rate Adjustment
Tilings and matroids on the lattice points of a regular simplex
Nonparametric Bayesian posterior contraction rates for scalar diffusions with high-frequency data
cGANs with Projection Discriminator
Reliable Uncertain Evidence Modeling in Bayesian Networks by Credal Networks
Gradient Boosting With Piece-Wise Linear Regression Trees
The Mechanics of n-Player Differentiable Games
Learning Determinantal Point Processes by Sampling Inferred Negatives
Ranks and Pseudo-Ranks – Paradoxical Results of Rank Tests –
3D Convolutional Encoder-Decoder Network for Low-Dose CT via Transfer Learning from a 2D Trained Network
List Heaps
Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
Calculating the similarity between words and sentences using a lexical database and corpus statistics
Model compression via distillation and quantization
Constraining the Dynamics of Deep Probabilistic Models
A Bio-inspired Redundant Sensing Architecture
Effect of Hilbert space truncation on Anderson localization
Learning DNFs under product distributions via μ-biased quantum Fourier sampling
Bandit Learning with Positive Externalities
Multinomial Adversarial Networks for Multi-Domain Text Classification
Explainable Prediction of Medical Codes from Clinical Text
Identification of the Polaron measure and its central limit theorem
A Machine Learning Approach for Virtual Flow Metering and Forecasting
Inverting The Generator Of A Generative Adversarial Network (II)