Generative Adversarial Networks (GANs) have been shown to outperform non-adversarial generative models in terms of the image generation quality by a large margin. Recently, researchers have looked into improving non-adversarial alternatives that can close the gap of generation quality while avoiding some common issues of GANs, such as unstable training and mode collapse. Examples in this direction include Two-stage VAE and Generative Latent Nearest Neighbors. However, a major drawback of these models is that they are slow to train, and in particular, they require two training stages. To address this, we propose Generative Latent Flow (GLF), which uses an auto-encoder to learn the mapping to and from the latent space, and an invertible flow to map the distribution in the latent space to simple i.i.d noise. The advantages of our method include a simple conceptual framework, single stage training and fast convergence. Quantitatively, the generation quality of our model significantly outperforms that of VAEs, and is competitive with GANs’ benchmark on commonly used datasets.
Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by resource allocation in wireless networks that encourages a more fair (i.e., lower-variance) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efficient method, q-FedAvg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efficiency of q-FedAvg on a suite of federated datasets, and show that q-FFL (along with q-FedAvg) outperforms existing baselines in terms of the resulting fairness, flexibility, and efficiency.
Malware detection is an ever-present challenge for all organizational gatekeepers. Organizations often deploy numerous different malware detection tools, and then combine their output to produce a final classification for an inspected file. This approach has two significant drawbacks. First, it requires large amounts of computing resources and time since every incoming file needs to be analyzed by all detectors. Secondly, it is difficult to accurately and dynamically enforce a predefined security policy that comports with the needs of each organization (e.g., how tolerant is the organization to false negatives and false positives). In this study we propose ASPIRE, a reinforcement learning (RL)-based method for malware detection. Our approach receives the organizational policy — defined solely by the perceived costs of correct/incorrect classifications and of computing resources — and then dynamically assigns detection tools and sets the detection threshold for each inspected file. We demonstrate the effectiveness and robustness of our approach by conducting an extensive evaluation on multiple organizational policies. ASPIRE performed well in all scenarios, even achieving near-optimal accuracy of 96.21% (compared to an optimum of 96.86%) at approximately 20% of the running time of this baseline.
Deep learning based recommender systems have been extensively explored in recent years. However, the large number of models proposed each year poses a big challenge for both researchers and practitioners in reproducing the results for further comparisons. Although a portion of papers provides source code, they adopted different programming languages or different deep learning packages, which also raises the bar in grasping the ideas. To alleviate this problem, we released the open source project: \textbf{DeepRec}. In this toolkit, we have implemented a number of deep learning based recommendation algorithms using Python and the widely used deep learning package – Tensorflow. Three major recommendation scenarios: rating prediction, top-N recommendation (item ranking) and sequential recommendation, were considered. Meanwhile, DeepRec maintains good modularity and extensibility to easily incorporate new models into the framework. It is distributed under the terms of the GNU General Public License. The source code is available at github: \url{https://…/DeepRec}
With the booming development of data science, many clustering methods have been proposed. All clustering methods have inherent merits and deficiencies. Therefore, they are only capable of clustering some specific types of data robustly. In addition, the accuracies of the clustering methods rely heavily on the characteristics of the data. In this paper, we propose a new clustering method based on the morphological operations. The morphological dilation is used to connect the data points based on their adjacency and form different connected domains. The iteration of the morphological dilation process stops when the number of connected domains equals the number of the clusters or when the maximum number of iteration is reached. The morphological dilation is then used to label the connected domains. The Euclidean distance between each data point and the points in each labeled connected domain is calculated. For each data point, there is a labeled connected domain that contains a point that yields the smallest Euclidean distance. The data point is assigned with the same labeling number as the labeled connected domain. We evaluate and compare the proposed method with state of the art clustering methods with different types of data. Experimental results show that the proposed method is more robust and generic for clustering two-dimensional or three-dimensional data.
In recent years, the proliferation of smart mobile devices has lead to the gradual integration of search functionality within mobile platforms. This has created an incentive to move away from the ‘ten blue links” metaphor, as mobile users are less likely to click on them, expecting to get the answer directly from the snippets. In turn, this has revived the interest in Question Answering. Then, along came chatbots, conversational systems, and messaging platforms, where the user needs could be better served with the system asking follow-up questions in order to better understand the user’s intent. While typically a user would expect a single response at any utterance, a system could also return multiple options for the user to select from, based on different system understandings of the user’s intent. However, this possibility should not be overused, as this practice could confuse and/or annoy the user. How to produce good variable-length lists, given the conflicting objectives of staying short while maximizing the likelihood of having a correct answer included in the list, is an underexplored problem. It is also unclear how to evaluate a system that tries to do that. Here we aim to bridge this gap. In particular, we define some necessary and some optional properties that an evaluation measure fit for this purpose should have. We further show that existing evaluation measures from the IR tradition are not entirely suitable for this setup, and we propose novel evaluation measures that address it satisfactorily.
Constrained Concept Factorization (CCF) yields the enhanced representation ability over CF by incorporating label information as additional constraints, but it cannot classify and group unlabeled data appropriately. Minimizing the difference between the original data and its reconstruction directly can enable CCF to model a small noisy perturbation, but is not robust to gross sparse errors. Besides, CCF cannot preserve the manifold structures in new representation space explicitly, especially in an adaptive manner. In this paper, we propose a joint label prediction based Robust Semi-Supervised Adaptive Concept Factorization (RS2ACF) framework. To obtain robust representation, RS2ACF relaxes the factorization to make it simultaneously stable to small entrywise noise and robust to sparse errors. To enrich prior knowledge to enhance the discrimination, RS2ACF clearly uses class information of labeled data and more importantly propagates it to unlabeled data by jointly learning an explicit label indicator for unlabeled data. By the label indicator, RS2ACF can ensure the unlabeled data of the same predicted label to be mapped into the same class in feature space. Besides, RS2ACF incorporates the joint neighborhood reconstruction error over the new representations and predicted labels of both labeled and unlabeled data, so the manifold structures can be preserved explicitly and adaptively in the representation space and label space at the same time. Owing to the adaptive manner, the tricky process of determining the neighborhood size or kernel width can be avoided. Extensive results on public databases verify that our RS2ACF can deliver state-of-the-art data representation, compared with other related methods.
The methods of extracting image features are the key to many image processing tasks. At present, the most popular method is the deep neural network which can automatically extract robust features through end-to-end training instead of hand-crafted feature extraction. However, the deep neural network currently faces many challenges: 1) its effectiveness is heavily dependent on large datasets, so the computational complexity is very high; 2) it is usually regarded as a black box model with poor interpretability. To meet the above challenges, a more interpretable and scalable feature learning method, i.e., deep image feature learning with fuzzy rules (DIFL-FR), is proposed in the paper, which combines the rule-based fuzzy modeling technique and the deep stacked learning strategy. The method progressively learns image features through a layer-by-layer manner based on fuzzy rules, so the feature learning process can be better explained by the generated rules. More importantly, the learning process of the method is only based on forward propagation without back propagation and iterative learning, which results in the high learning efficiency. In addition, the method is under the settings of unsupervised learning and can be easily extended to scenes of supervised and semi-supervised learning. Extensive experiments are conducted on image datasets of different scales. The results obviously show the effectiveness of the proposed method.
The unsupervised detection of anomalies in time series data has important applications, e.g., in user behavioural modelling, fraud detection, and cybersecurity. Anomaly detection has been extensively studied in categorical sequences, however we often have access to time series data that contain paths through networks. Examples include transaction sequences in financial networks, click streams of users in networks of cross-referenced documents, or travel itineraries in transportation networks. To reliably detect anomalies we must account for the fact that such data contain a large number of independent observations of short paths constrained by a graph topology. Moreover, the heterogeneity of real systems rules out frequency-based anomaly detection techniques, which do not account for highly skewed edge and degree statistics. To address this problem we introduce a novel framework for the unsupervised detection of anomalies in large corpora of variable-length temporal paths in a graph, which provides an efficient analytical method to detect paths with anomalous frequencies that result from nodes being traversed in unexpected chronological order.
In this work we propose Hebbian-descent as a biologically plausible learning rule for hetero-associative as well as auto-associative learning in single layer artificial neural networks. It can be used as a replacement for gradient descent as well as Hebbian learning, in particular in online learning, as it inherits their advantages while not suffering from their disadvantages. We discuss the drawbacks of Hebbian learning as having problems with correlated input data and not profiting from seeing training patterns several times. For gradient descent we identify the derivative of the activation function as problematic especially in online learning. Hebbian-descent addresses these problems by getting rid of the activation function’s derivative and by centering, i.e. keeping the neural activities mean free, leading to a biologically plausible update rule that is provably convergent, does not suffer from the vanishing error term problem, can deal with correlated data, profits from seeing patterns several times, and enables successful online learning when centering is used. We discuss its relationship to Hebbian learning, contrastive learning, and gradient decent and show that in case of a strictly positive derivative of the activation function Hebbian-descent leads to the same update rule as gradient descent but for a different loss function. In this case Hebbian-descent inherits the convergence properties of gradient descent, but we also show empirically that it converges when the derivative of the activation function is only non-negative, such as for the step function for example. Furthermore, in case of the mean squared error loss Hebbian-descent can be understood as the difference between two Hebb-learning steps, which in case of an invertible and integrable activation function actually optimizes a generalized linear model. …
Multi-view clustering has received much attention recently. Most of the existing multi-view clustering methods only focus on one-sided clustering. As the co-occurring data elements involve the counts of sample-feature co-occurrences, it is more efficient to conduct two-sided clustering along the samples and features simultaneously. To take advantage of two-sided clustering for the co-occurrences in the scene of multi-view clustering, a two-sided multi-view clustering method is proposed, i.e., multi-view information-theoretic co-clustering (MV-ITCC). The proposed method realizes two-sided clustering for co-occurring multi-view data under the formulation of information theory. More specifically, it exploits the agreement and disagreement among views by sharing a common clustering results along the sample dimension and keeping the clustering results of each view specific along the feature dimension. In addition, the mechanism of maximum entropy is also adopted to control the importance of different views, which can give a right balance in leveraging the agreement and disagreement. Extensive experiments are conducted on text and image multi-view datasets. The results clearly demonstrate the superiority of the proposed method.
We propose a new family of fairness definitions for classification problems that combine some of the best properties of both statistical and individual notions of fairness. We posit not only a distribution over individuals, but also a distribution over (or collection of) classification tasks. We then ask that standard statistics (such as error or false positive/negative rates) be (approximately) equalized across individuals, where the rate is defined as an expectation over the classification tasks. Because we are no longer averaging over coarse groups (such as race or gender), this is a semantically meaningful individual-level constraint. Given a sample of individuals and classification problems, we design an oracle-efficient algorithm (i.e. one that is given access to any standard, fairness-free learning heuristic) for the fair empirical risk minimization task. We also show that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions. Finally we implement our algorithm and empirically verify its effectiveness.
Large-scale face recognition in-the-wild has been recently achieved matured performance in many real work applications. However, such systems are built on GPU platforms and mostly deploy heavy deep network architectures. Given a high-performance heavy network as a teacher, this work presents a simple and elegant teacher-student learning paradigm, namely ShrinkTeaNet, to train a portable student network that has significantly fewer parameters and competitive accuracy against the teacher network. Far apart from prior teacher-student frameworks mainly focusing on accuracy and compression ratios in closed-set problems, our proposed teacher-student network is proved to be more robust against open-set problem, i.e. large-scale face recognition. In addition, this work introduces a novel Angular Distillation Loss for distilling the feature direction and the sample distributions of the teacher’s hypersphere to its student. Then ShrinkTeaNet framework can efficiently guide the student’s learning process with the teacher’s knowledge presented in both intermediate and last stages of the feature embedding. Evaluations on LFW, CFP-FP, AgeDB, IJB-B and IJB-C Janus, and MegaFace with one million distractors have demonstrated the efficiency of the proposed approach to learn robust student networks which have satisfying accuracy and compact sizes. Our ShrinkTeaNet is able to support the light-weight architecture achieving high performance with 99.77% on LFW and 95.64% on large-scale Megaface protocols.
In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information. We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed SSE can further reduce overfitting, which often leads to more favorable generalization results.
This work investigates fundamental questions related to locating and defining features in convolutional neural networks (CNN). The theoretical investigations guided by the locality principle show that the relevance of locations within a representation decreases with distance from the center. This is aligned with empirical findings across multiple architectures such as VGG, ResNet, Inception, DenseNet and MobileNet. To leverage our insights, we introduce Locality-promoting Regularization (LOCO-REG). It yields accuracy gains across multiple architectures and datasets.
Networks have been widely used as the data structure for abstracting real-world systems as well as organizing the relations among entities. Network embedding models are powerful tools in mapping nodes in a network into continuous vector-space representations in order to facilitate subsequent tasks such as classification and link prediction. Existing network embedding models comprehensively integrate all information of each node, such as links and attributes, towards a single embedding vector to represent the node’s general role in the network. However, a real-world entity could be multifaceted, where it connects to different neighborhoods due to different motives or self-characteristics that are not necessarily correlated. For example, in a movie recommender system, a user may love comedies or horror movies simultaneously, but it is not likely that these two types of movies are mutually close in the embedding space, nor the user embedding vector could be sufficiently close to them at the same time. In this paper, we propose a polysemous embedding approach for modeling multiple facets of nodes, as motivated by the phenomenon of word polysemy in language modeling. Each facet of a node is mapped as an embedding vector, while we also maintain association degree between each pair of node and facet. The proposed method is adaptive to various existing embedding models, without significantly complicating the optimization process. We also discuss how to engage embedding vectors of different facets for inference tasks including classification and link prediction. Experiments on real-world datasets help comprehensively evaluate the performance of the proposed method.
Recent advances in Artificial Intelligence, especially in Machine Learning (ML), have brought applications previously considered as science fiction (e.g., virtual personal assistants and autonomous cars) into the reach of millions of everyday users. Since modern ML technologies like deep learning require considerable technical expertise and resource to build custom models, reusing existing models trained by experts has become essential. This is why in the past year model stores have been introduced, which, similar to mobile app stores, offer organizations and developers access to pre-trained models and/or their code to train, evaluate, and predict samples. This paper conducts an exploratory study on three popular model stores (AWS marketplace, Wolfram neural net repository, and ModelDepot) that compares the information elements (features and policies) provided by model stores to those used by the two popular mobile app stores (Google Play and Apple’s App Store). We have found that the model information elements vary among the different model stores, with 65% elements shared by all three studied stores. Model stores share five information elements with mobile app stores, while eight elements are unique to model stores and four elements unique to app stores. Only few models were available on multiple model stores. Our findings allow to better understand the differences between ML models and ‘regular’ source code components or applications, and provide inspiration to identify software engineering practices (e.g., in requirements and delivery) specific to ML applications.
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.
This paper presents a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed “DAVE’. Several behavioral studies have shown a strong relation between auditory and visual cues for guiding gaze during scene free viewing. The existing video saliency models, however, only consider visual cues for predicting saliency over videos and neglect the auditory information that is ubiquitous in dynamic scenes. We propose a multimodal saliency model that utilizes audio and visual information for predicting saliency in videos. Our model consists of a two-stream encoder and a decoder. First, auditory and visual information are mapped into a feature space using 3D Convolutional Neural Networks (3D CNNs). Then, a decoder combines the features and maps them to a final saliency map. To train such model, data from various eye tracking datasets containing video and audio are pulled together. We further categorised videos into `social’, `nature’, and `miscellaneous’ classes to analyze the models over different content types. Several analyses show that our audio-visual model outperforms video-based models significantly over all scores; overall and over individual categories. Contextual analysis of the model performance over the location of sound source reveals that the audio-visual model behaves similar to humans in attending to the location of sound source. Our endeavour demonstrates that audio is an important signal that can boost video saliency prediction and help getting closer to human performance.
Transferring knowledge from one neural network to another has been shown to be helpful for learning tasks with few training examples. Prevailing fine-tuning methods could potentially contaminate pre-trained features by comparably high energy random noise. This noise is mainly delivered from a careless replacement of task-specific parameters. We analyze theoretically such knowledge contamination for classification tasks and propose a practical and easy to apply method to trap and minimize the contaminant. In our approach, the entropy of the output estimates gets maximized initially and the first back-propagated error is stalled at the output of the last layer. Our proposed method not only outperforms the traditional fine-tuning, but also significantly speeds up the convergence of the learner. It is robust to randomness and independent of the choice of architecture. Overall, our experiments show that the power of transfer learning has been substantially underestimated so far.
Over the past decade, knowledge graphs became popular for capturing structured domain knowledge. Relational learning models enable the prediction of missing links inside knowledge graphs. More specifically, latent distance approaches model the relationships among entities via a distance between latent representations. Translating embedding models (e.g., TransE) are among the most popular latent distance approaches which use one distance function to learn multiple relation patterns. However, they are not capable of capturing symmetric relations. They also force relations with reflexive patterns to become symmetric and transitive. In order to improve distance based embedding, we propose multi-distance embeddings (MDE). Our solution is based on the idea that by learning independent embedding vectors for each entity and relation one can aggregate contrasting distance functions. Benefiting from MDE, we also develop supplementary distances resolving the above-mentioned limitations of TransE. We further propose an extended loss function for distance based embeddings and show that MDE and TransE are fully expressive using this loss function. Furthermore, we obtain a bound on the size of their embeddings for full expressivity. Our empirical results show that MDE significantly improves the translating embeddings and outperforms several state-of-the-art embedding models on benchmark datasets.
Activation functions play an important role in the training of artificial neural networks and the Rectified Linear Unit (ReLU) has been the mainstream in recent years. Most of the activation functions currently used are deterministic in nature, whose input-output relationship is fixed. In this work, we propose a probabilistic activation function, called ProbAct. The output value of ProbAct is sampled from a normal distribution, with the mean value same as the output of ReLU and with a fixed or trainable variance for each element. In the trainable ProbAct, the variance of the activation distribution is trained through back-propagation. We also show that the stochastic perturbation through ProbAct is a viable generalization technique that can prevent overfitting. In our experiments, we demonstrate that when using ProbAct, it is possible to boost the image classification performance on CIFAR-10, CIFAR-100, and STL-10 datasets.