If you did not already know

Pyramid Attention Network (PAN)
A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset. …

KeyVec
Previous studies have demonstrated the empirical success of word embeddings in various applications. In this paper, we investigate the problem of learning distributed representations for text documents which many machine learning algorithms take as input for a number of NLP tasks. We propose a neural network model, KeyVec, which learns document representations with the goal of preserving key semantics of the input text. It enables the learned low-dimensional vectors to retain the topics and important information from the documents that will flow to downstream tasks. Our empirical evaluations show the superior quality of KeyVec representations in two different document understanding tasks. …

Feature Boosting and Suppression (FBS)
Making deep convolutional neural networks more accurate typically comes at the cost of increased computational and memory resources. In this paper, we exploit the fact that the importance of features computed by convolutional layers is highly input-dependent, and propose feature boosting and suppression (FBS), a new method to predictively amplify salient convolutional channels and skip unimportant ones at run-time. FBS introduces small auxiliary connections to existing convolutional layers. In contrast to channel pruning methods which permanently remove channels, it preserves the full network structures and accelerates convolution by dynamically skipping unimportant input and output channels. FBS-augmented networks are trained with conventional stochastic gradient descent, making it readily available for many state-of-the-art CNNs. We compare FBS to a range of existing channel pruning and dynamic execution schemes and demonstrate large improvements on ImageNet classification. Experiments show that FBS can accelerate VGG-16 by $5\times$ and improve the speed of ResNet-18 by $2\times$, both with less than $0.6\%$ top-5 accuracy loss. …

Stochastic Ordering
In probability theory and statistics, a stochastic order quantifies the concept of one random variable being ‘bigger’ than another. These are usually partial orders, so that one random variable A may be neither stochastically greater than, less than nor equal to another random variable B. Many different orders exist, which have different applications.
An Introduction to Stochastic Orders

If you did not already know

Time Series Database (TSDB)
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range). In some fields these time series are called profiles, curves, or traces. A time series of stock prices might be called a price curve. A time series of energy consumption might be called a load profile. A log of temperature values over time might be called a temperature trace. Despite the disparate names, many of the same mathematical operations, queries, or database transactions are useful for analysing all of them. The implementation of a database that can correctly, reliably, and efficiently implement these operations must be specialized for time-series data. TSDBs are databases that are optimized for time series data. Software with complex logic or business rules and high transaction volume for time series data may not be practical with traditional relational database management systems. Flat file databases are not a viable option either, if the data and transaction volume reaches a maximum threshold determined by the capacity of individual servers (processing power and storage capacity). Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in a relational database. Compositions of those rules are even more difficult. This is a problem compounded by the free nature of relational systems themselves. Many relational systems are often not modelled correctly with respect to time series data. TSDBs on the other hand impose a model and this allows them to provide more features for doing so. Ideally, these repositories are often natively implemented using specialized database algorithms. However, it is possible to store time series as binary large objects (BLOBs) in a relational database or by using a VLDB approach coupled with a pure star schema. Efficiency is often improved if time is treated as a discrete quantity rather than as a continuous mathematical dimension. Database joins across multiple time series data sets is only practical when the time tag associated with each data entry spans the same set of discrete times for all data sets across which the join is performed. …

Intelligence Graph
In fact, there exist three genres of intelligence architectures: logics (e.g. \textit{Random Forest, A$^*$ Searching}), neurons (e.g. \textit{CNN, LSTM}) and probabilities (e.g. \textit{Naive Bayes, HMM}), all of which are incompatible to each other. However, to construct powerful intelligence systems with various methods, we propose the intelligence graph (short as \textbf{\textit{iGraph}}), which is composed by both of neural and probabilistic graph, under the framework of forward-backward propagation. By the paradigm of iGraph, we design a recommendation model with semantic principle. First, the probabilistic distributions of categories are generated from the embedding representations of users/items, in the manner of neurons. Second, the probabilistic graph infers the distributions of features, in the manner of probabilities. Last, for the recommendation diversity, we perform an expectation computation then conduct a logic judgment, in the manner of logics. Experimentally, we beat the state-of-the-art baselines and verify our conclusions. …

Physics-Informed Kriging (PhIK)
In this work, we propose a new Gaussian process regression (GPR) method: physics-informed Kriging (PhIK). In the standard data-driven Kriging, the unknown function of interest is usually treated as a Gaussian process with assumed stationary covariance with hyperparameters estimated from data. In PhIK, we compute the mean and covariance function from realizations of available stochastic models, e.g., from realizations of governing stochastic partial differential equations solutions. Such a constructed Gaussian process generally is non-stationary, and does not assume a specific form of the covariance function. Our approach avoids the costly optimization step in data-driven GPR methods to identify the hyperparameters. More importantly, we prove that the physical constraints in the form of a deterministic linear operator are guaranteed in the resulting prediction. We also provide an error estimate in preserving the physical constraints when errors are included in the stochastic model realizations. To reduce the computational cost of obtaining stochastic model realizations, we propose a multilevel Monte Carlo estimate of the mean and covariance functions. Further, we present an active learning algorithm that guides the selection of additional observation locations. The efficiency and accuracy of PhIK are demonstrated for reconstructing a partially known modified Branin function and learning a conservative tracer distribution from sparse concentration measurements. …

Neural Network Synthesis Tool (NeST)
Neural networks (NNs) have begun to have a pervasive impact on various applications of machine learning. However, the problem of finding an optimal NN architecture for large applications has remained open for several decades. Conventional approaches search for the optimal NN architecture through extensive trial-and-error. Such a procedure is quite inefficient. In addition, the generated NN architectures incur substantial redundancy. To address these problems, we propose an NN synthesis tool (NeST) that automatically generates very compact architectures for a given dataset. NeST starts with a seed NN architecture. It iteratively tunes the architecture with gradient-based growth and magnitude-based pruning of neurons and connections. Our experimental results show that NeST yields accurate yet very compact NNs with a wide range of seed architecture selection. For example, for the LeNet-300-100 (LeNet-5) NN architecture derived from the MNIST dataset, we reduce network parameters by 34.1x (74.3x) and floating-point operations (FLOPs) by 35.8x (43.7x). For the AlexNet NN architecture derived from the ImageNet dataset, we reduce network parameters by 15.7x and FLOPs by 4.6x. All these results are the current state-of-the-art for these architectures. …

If you did not already know

AutoPruner
Channel pruning is an important family of methods to speedup deep model’s inference. Previous filter pruning algorithms regard channel pruning and model fine-tuning as two independent steps. This paper argues that combining them in a single end-to-end trainable system will lead to better results. We propose an efficient channel selection layer, namely AutoPruner, to find less important filters automatically in a joint training manner. AutoPruner takes previous activation responses as input and generates a true binary index code for pruning. Hence, all the filters corresponding to zero index values can be removed safely after training. We empirically demonstrate that the gradient information of this channel selection layer is also helpful for the whole model training. Compared with previous state-of-the-art pruning algorithms, AutoPruner achieves significantly better performance. Furthermore, ablation experiments show that the proposed novel mini-batch pooling and binarization operations are vital for the success of filter pruning. …

Skilled Experience Catalogue (SEC)

Feature Crossing
Feature crossing captures interactions among categorical features and is useful to enhance learning from tabular data in real-world businesses. …

Learned Global Ranking (LeGR)
Filter pruning has shown to be effective for learning resource-constrained convolutional neural networks (CNNs). However, prior methods for resource-constrained filter pruning have some limitations that hinder their effectiveness and efficiency. When searching for constraint-satisfying CNNs, prior methods either alter the optimization objective or adopt local search algorithms with heuristic parameterization, which are sub-optimal, especially in low-resource regime. From the efficiency perspective, prior methods are often costly to search for constraint-satisfying CNNs. In this work, we propose learned global ranking, dubbed LeGR, which improves upon prior art in the two aforementioned dimensions. Inspired by theoretical analysis, LeGR is parameterized to learn layer-wise affine transformations over the filter norms to construct a learned global ranking. With global ranking, resource-constrained filter pruning at various constraint levels can be done efficiently. We conduct extensive empirical analyses to demonstrate the effectiveness of the proposed algorithm with ResNet and MobileNetV2 networks on CIFAR-10, CIFAR-100, Bird-200, and ImageNet datasets. Code is publicly available at https://…/LeGR.

If you did not already know

MOA
MOA is the most popular open source framework for data stream mining, with a very active growing community (blog). It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems. …

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)
The ‘fast iterative shrinkage-thresholding algorithm’, a.k.a. FISTA, is one of the most well-known first-order optimisation scheme in the literature, as it achieves the worst-case $O(1/k^2)$ optimal convergence rate in terms of objective function value. However, despite the optimal theoretical rate, in practice the (local) oscillatory behaviour of FISTA often damps its efficiency. Over the past years, various efforts are made in the literature to improve the practical performance of FISTA, such as monotone FISTA, restarting FISTA and backtracking strategies. In this paper, we propose a simple yet effective modification to FISTA which has two advantages: it allows us to 1) prove the convergence of generated sequence; 2) design a so-called ‘lazy-start’ strategy which can up to an order faster than the original scheme in practice. Moreover, we also propose novel adaptive and greedy strategies which can further improve the performance and outperform the state-of-the-art schemes in the literature. The advantages of the proposed schemes are illustrated through problems arising from inverse problem, machine learning and signal/image processing. …

PolyNeuron
Automated deep neural network architecture design has received a significant amount of recent attention. However, this attention has not been equally shared by one of the fundamental building blocks of a deep neural network, the neurons. In this study, we propose PolyNeuron, a novel automatic neuron discovery approach based on learned polyharmonic spline activations. More specifically, PolyNeuron revolves around learning polyharmonic splines, characterized by a set of control points, that represent the activation functions of the neurons in a deep neural network. A relaxed variant of PolyNeuron, which we term PolyNeuron-R, loosens the constraints imposed by PolyNeuron to reduce the computational complexity for discovering the neuron activation functions in an automated manner. Experiments show both PolyNeuron and PolyNeuron-R lead to networks that have improved or comparable performance on multiple network architectures (LeNet-5 and ResNet-20) using different datasets (MNIST and CIFAR10). As such, automatic neuron discovery approaches such as PolyNeuron is a worthy direction to explore. …

Story Ending Generation (SEG)
We introduce a new task named Story Ending Generation (SEG), which aims at generating a coherent story ending from a sequence of story plot. We propose a framework consisting of a Generator and a Reward Manager for thistask. The Generator follows the pointer-generator network with coverage mech-anism to deal with out-of-vocabulary (OOV) and repetitive words. Moreover, amixed loss method is introduced to enable the Generator to produce story endingsof high semantic relevance with story plots. In the Reward Manager, the rewardis computed to fine-tune the Generator with policy-gradient reinforcement learn-ing (PGRL). We conduct experiments on the recently-introduced ROCStoriesCorpus. We evaluate our model in both automatic evaluation and human evalua-tion. Experimental results show that our model exceeds the sequence-to-sequencebaseline model by 15.75% and 13.57% in terms of CIDEr and consistency scorerespectively. …

If you did not already know

M-PACT
Action classification is a widely known and popular task that offers an approach towards video understanding. The absence of an easy-to-use platform containing state-of-the-art (SOTA) models presents an issue for the community. Given that individual research code is not written with an end user in mind and in certain cases code is not released, even for published articles, the importance of a common unified platform capable of delivering results while removing the burden of developing an entire system cannot be overstated. To try and overcome these issues, we develop a tensorflow-based unified platform to abstract away unnecessary overheads in terms of an end-to-end pipeline setup in order to allow the user to quickly and easily prototype action classification models. With the use of a consistent coding style across different models and seamless data flow between various submodules, the platform lends itself to the quick generation of results on a wide range of SOTA methods across a variety of datasets. All of these features are made possible through the use of fully pre-defined training and testing blocks built on top of a small but powerful set of modular functions that handle asynchronous data loading, model initializations, metric calculations, saving and loading of checkpoints, and logging of results. The platform is geared towards easily creating models, with the minimum requirement being the definition of a network architecture and preprocessing steps from a large custom selection of layers and preprocessing functions. M-PACT currently houses four SOTA activity classification models which include, I3D, C3D, ResNet50+LSTM and TSN. The classification performance achieved by these models are, 43.86% for ResNet50+LSTM on HMDB51 while C3D and TSN achieve 93.66% and 85.25% on UCF101 respectively. …

Full Normalization (FN)
Batch Normalization (BN) has been used extensively in deep learning to achieve faster training process and better resulting models. However, whether BN works strongly depends on how the batches are constructed during training and it may not converge to a desired solution if the statistics on a batch are not close to the statistics over the whole dataset. In this paper, we try to understand BN from an optimization perspective by formulating the optimization problem which motivates BN. We show when BN works and when BN does not work by analyzing the optimization problem. We then propose a refinement of BN based on compositional optimization techniques called Full Normalization (FN) to alleviate the issues of BN when the batches are not constructed ideally. We provide convergence analysis for FN and empirically study its effectiveness to refine BN. …

Quasi-Support Vector Data Description (QSVDD)
In the area of data classification, the different classifiers have been developed by its own strengths and weaknesses. Among these classifiers, we propose a method which is based on the maximum margin between two classes. One of the main challenges in this area is dealt with noisy data. In this paper, our aim is to optimize the method of large margin classifier based on hyperdisk (LMC-HD) and incorporate it into quasi-support vector data description (QSVDD) method. In the proposed method, the bounding hypersphere is calculated based on the QSVDD method. So our convex class model is more robust in compared with support vector machine (SVM) and less tight than LMC-HD. Applying this idea causes the reduction of the impact of the noisy data set in classification. Large margin classifiers aim to maximize the margin and minimizing the risk. Sine our proposed method ignores the effect of outliers and noises, so this method has the widest margin compared with other large margin classifiers. In the end, we compare our proposed method with other popular large margin classifiers by the experiments on a set of standard data which indicates our results are more efficient than the others. …

Ridership estimation at station level plays a critical role in metro transportation planning. Among various existing ridership estimation methods, direct demand model has been recognized as an effective approach. However, existing direct demand models including Geographically Weighted Regression (GWR) have rarely included local model selection in ridership estimation. In practice, acquiring insights into metro ridership under multiple influencing factors from a local perspective is important for passenger volume management and transportation planning operations adapting to local conditions. In this study, we propose an Adapted Geographically Weighted Lasso (Ada-GWL) framework for modelling metro ridership, which performs regression-coefficient shrinkage and local model selection. It takes metro network connection intermedia into account and adopts network-based distance metric instead of Euclidean-based distance metric, making it so-called adapted to the context of metro networks. The real-world case of Shenzhen Metro is used to validate the superiority of our proposed model. The results show that the Ada-GWL model performs the best compared with the global model (Ordinary Least Square (OLS), GWR, GWR calibrated with network-based distance metric and GWL in terms of estimation error of the dependent variable and goodness-of-fit. Through understanding the variation of each coefficient across space (elasticities) and variables selection of each station, it provides more realistic conclusions based on local analysis. Besides, through clustering analysis of the stations according to the regression coefficients, clusters’ functional characteristics are found to be in compliance with the facts of the functional land use policy of Shenzhen. These results of the proposed Ada-GWL model demonstrate a great spatial explanatory power in transportation planning. …

If you did not already know

Data Lake
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers.
Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. …

Newton
This article introduces Newton, a specification language for notating the analytic form, units of measure, and sensor signal properties for physical-object-specific invariants and general physical laws. We designed Newton to provide a means for hardware designers (e.g., sensor integrated circuit manufacturers, computing hardware architects, or mechanical engineers) to specify properties of the physical environments in which embedded computing systems will be deployed (e.g., a sensing platform deployed on a bridge versus worn by a human). Compilers and other program analysis tools for embedded systems can use a library interface to the Newton compiler to obtain information about the sensors, sensor signals, and inter-signal relationships imposed by the structure and materials properties of a given physical system. The information encoded within Newton specifications could enable new compile-time transformations that exploit information about the physical world. …

Abstract Syntax Tree
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is ‘abstract’ in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes. Likewise, a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated parse trees. Parse trees are typically built by a parser during the source code translation and compiling process. Once built, additional information is added to the AST by means of subsequent processing, e.g., contextual analysis. Abstract syntax trees are also used in program analysis and program transformation systems. …

Differential Privacy
In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. …

If you did not already know

Transfer Feature Generating Networks With Semantic Classes Structure (TFGNSCS)
Suffering from the generating feature inconsistence of seen classes training model for following the distribution of unseen classes , most of existing feature generating networks difficultly obtain satisfactory performance for the challenging generalization zero-shot learning (GZSL) by adversarial learning the distribution of semantic classes. To alleviate the negative influence of this inconsistence for zero-shot learning (ZSL), transfer feature generating networks with semantic classes structure (TFGNSCS) is proposed to construct networks model for improving the performance of ZSL and GZSL. TFGNSCS can not only consider the semantic structure relationship between seen and unseen classes but also learn the difference of generating features by balancing transfer information between seen and unseen classes in networks. The proposed method can integrate a Wasserstein generative adversarial network with classification loss and transfer loss to generate enough CNN feature, on which softmax classifiers are trained for ZSL and GZSL. Experiments demonstrate that the performance of TFGNSCS outperforms that of the state of the arts on four challenging datasets, which are CUB,FLO,SUN, AWA in GZSL. …

Stochastic Subspace Descent
We present two stochastic descent algorithms that apply to unconstrained optimization and are particularly efficient when the objective function is slow to evaluate and gradients are not easily obtained, as in some PDE-constrained optimization and machine learning problems. The basic algorithm projects the gradient onto a random subspace at each iteration, similar to coordinate descent but without restricting directional derivatives to be along the axes. This algorithm is previously known but we provide new analysis. We also extend the popular SVRG method to this framework but without requiring that the objective function be written as a finite sum. We provide proofs of convergence for our methods under various convexity assumptions and show favorable results when compared to gradient descent and BFGS on non-convex problems from the machine learning and shape optimization literature. We also note that our analysis gives a proof that the iterates of SVRG and several other popular first-order stochastic methods, in their original formulation, converge almost surely to the optimum; to our knowledge, prior to this work the iterates of SVRG had only been known to converge in expectation. …

Deep Determinantal Point Process (Deep DPP)
Determinantal point processes (DPPs) have attracted significant attention as an elegant model that is able to capture the balance between quality and diversity within sets. DPPs are parameterized by a positive semi-definite kernel matrix. While DPPs have substantial expressive power, they are fundamentally limited by the parameterization of the kernel matrix and their inability to capture nonlinear interactions between items within sets. We present the deep DPP model as way to address these limitations, by using a deep feed-forward neural network to learn the kernel matrix. In addition to allowing us to capture nonlinear item interactions, the deep DPP also allows easy incorporation of item metadata into DPP learning. We show experimentally that the deep DPP can provide a considerable improvement in the predictive performance of DPPs. …

Block Markov Chain (BMC)
These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the objective is to recover, from this observation only, the (initially unknown) clusters. …

If you did not already know

Ramp-Based Twin Support Vector Clustering (RampTWSVC)
Traditional plane-based clustering methods measure the cost of within-cluster and between-cluster by quadratic, linear or some other unbounded functions, which may amplify the impact of cost. This letter introduces a ramp cost function into the plane-based clustering to propose a new clustering method, called ramp-based twin support vector clustering (RampTWSVC). RampTWSVC is more robust because of its boundness, and thus it is more easier to find the intrinsic clusters than other plane-based clustering methods. The non-convex programming problem in RampTWSVC is solved efficiently through an alternating iteration algorithm, and its local solution can be obtained in a finite number of iterations theoretically. In addition, the nonlinear manifold-based formation of RampTWSVC is also proposed by kernel trick. Experimental results on several benchmark datasets show the better performance of our RampTWSVC compared with other plane-based clustering methods. …

Sine-Cosine Algorithm (SCA)
This paper proposes a novel population-based optimization algorithm called Sine Cosine Algorithm (SCA) for solving optimization problems. The SCA creates multiple initial random candidate solutions and requires them to fluctuate outwards or towards the best solution using a mathematical model based on sine and cosine functions. Several random and adaptive variables also are integrated to this algorithm to emphasize exploration and exploitation of the search space in different milestones of optimization. The performance of SCA is benchmarked in three test phases. Firstly, a set of well-known test cases including unimodal, multi-modal, and composite functions are employed to test exploration, exploitation, local optima avoidance, and convergence of SCA. Secondly, several performance metrics (search history, trajectory, average fitness of solutions, and the best solution during optimization) are used to qualitatively observe and confirm the performance of SCA on shifted two-dimensional test functions. Finally, the cross-section of an aircraft’s wing is optimized by SCA as a real challenging case study to verify and demonstrate the performance of this algorithm in practice. The results of test functions and performance metrics prove that the algorithm proposed is able to explore different regions of a search space, avoid local optima, converge towards the global optimum, and exploit promising regions of a search space during optimization effectively. The SCA algorithm obtains a smooth shape for the airfoil with a very low drag, which demonstrates that this algorithm can highly be effective in solving real problems with constrained and unknown search spaces. Note that the source codes of the SCA algorithm are publicly available at http://…/SCA.html.

vitrivr
The growth of multimedia collections – in terms of size, heterogeneity, and variety of media types – necessitates systems that are able to conjointly deal with several forms of media, especially when it comes to searching for particular objects. However, existing retrieval systems are organized in silos and treat different media types separately. As a consequence, retrieval across media types is either not supported at all or subject to major limitations. In this paper, we present vitrivr, a content-based multimedia information retrieval stack. As opposed to the keyword search approach implemented by most media management systems, vitrivr makes direct use of the object’s content to facilitate different types of similarity search, such as Query-by-Example or Query-by-Sketch, for and, most importantly, across different media types – namely, images, audio, videos, and 3D models. Furthermore, we introduce a new web-based user interface that enables easy-to-use, multimodal retrieval from and browsing in mixed media collections. The effectiveness of vitrivr is shown on the basis of a user study that involves different query and media types. To the best of our knowledge, the full vitrivr stack is unique in that it is the first multimedia retrieval system that seamlessly integrates support for four different types of media. As such, it paves the way towards an all-purpose, content-based multimedia information retrieval system. …

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parameterized networks. Our method is simple and efficient to implement and does not require changing parameters or computing full gradients in the course of optimization. We provide a detailed theoretical analysis for convergence and parameter selection including their dependence on the mini-batch size in the quadratic case. We also provide theoretical convergence results for a more general convex setting. We provide an experimental evaluation showing strong performance of our method in comparison to Adam and SGD for several standard architectures of deep networks including ResNet, convolutional and fully connected networks. We also show its performance for convex kernel machines. …

If you did not already know

Mercury-ML
Mercury-ML is an open source Machine Learning workflow management library. Its core contributors are employees of Alexander Thamm GmbH …

Differentiable Greedy Network (DGN)
Optimal selection of a subset of items from a given set is a hard problem that requires combinatorial optimization. In this paper, we propose a subset selection algorithm that is trainable with gradient-based methods yet achieves near-optimal performance via submodular optimization. We focus on the task of identifying a relevant set of sentences for claim verification in the context of the FEVER task. Conventional methods for this task look at sentences on their individual merit and thus do not optimize the informativeness of sentences as a set. We show that our proposed method which builds on the idea of unfolding a greedy algorithm into a computational graph allows both interpretability and gradient-based training. The proposed differentiable greedy network (DGN) outperforms discrete optimization algorithms as well as other baseline methods in terms of precision and recall. …

Balanced k-Means
Mesh partitioning is an indispensable tool for efficient parallel numerical simulations. Its goal is to minimize communication between the processes of a simulation while achieving load balance. Established graph-based partitioning tools yield a high solution quality; however, their scalability is limited. Geometric approaches usually scale better, but their solution quality may be unsatisfactory for `non-trivial’ mesh topologies. In this paper, we present a scalable version of $k$-means that is adapted to yield balanced clusters. Balanced $k$-means constitutes the core of our new partitioning algorithm Geographer. Bootstrapping of initial centers is performed with space-filling curves, leading to fast convergence of the subsequent balanced k-means algorithm. Our experiments with up to 16384 MPI processes on numerous benchmark meshes show the following: (i) Geographer produces partitions with a lower communication volume than state-of-the-art geometric partitioners from the Zoltan package; (ii) Geographer scales well on large inputs; (iii) a Delaunay mesh with a few billion vertices and edges can be partitioned in a few seconds. …

BlockPuzzle
In this work we propose a novel task framework under which a variety of physical reasoning puzzles can be constructed using very simple rules. Under sparse reward settings, most of these tasks can be very challenging for a reinforcement learning agent to learn. We build several simple environments with this task framework in Mujoco and OpenAI gym and attempt to solve them. We are able to solve the environments by designing curricula to guide the agent in learning and using imitation learning methods to transfer knowledge from a simpler environment. This is only a first step for the task framework, and further research on how to solve the harder tasks and transfer knowledge between tasks is needed. …

If you did not already know

Unified Attention Network (UAN)
We propose a new architecture that learns to attend to different Convolutional Neural Networks (CNN) layers (i.e., different levels of abstraction) and different spatial locations (i.e., specific layers within a given feature map) in a sequential manner to perform the task at hand. Specifically, at each Recurrent Neural Network (RNN) timestep, a CNN layer is selected and its output is processed by a spatial soft-attention mechanism. We refer to this architecture as the Unified Attention Network (UAN), since it combines the ‘what’ and ‘where’ aspects of attention, i.e., ‘what’ level of abstraction to attend to, and ‘where’ should the network look at. We demonstrate the effectiveness of this approach on two computer vision tasks: (i) image-based camera pose and orientation regression and (ii) indoor scene classification. We evaluate our method on standard benchmarks for camera localization (Cambridge, 7-Scene, and TUM-LSI datasets) and for scene classification (MIT-67 indoor dataset), and show that our method improves upon the results of previous methods. Empirically, we show that combining ‘what’ and ‘where’ aspects of attention improves network performance on both tasks. …

ClusART
Topic detection becomes more important due to the increase of information electronically available and the necessity to process and filter it. In this context our master’s thesis work was carried out, where we proposed to present a new approach to the detection of topics called ClusART. Thus, we proposed a three-phase approach, namely : a first phase during which lexical preprocessing was conducted. A second phase during which the construction and generation of vectors representing the documents was carried out. A third phase which is itself composed of two steps. In the first step we used the FuzzyART algorithm for the training phase. In the second step we used a classifier using Paragraph Vector for the test phase. The comparative study of our approach on the 20 Newsgroups dataset showed that our approach is able to detect almost relevant topics. …

Segregation Network
The problem of multiple class novelty detection is gaining increasing importance due to the large availability of multimedia data and the increasing requirement of the classification models to work in an open set scenario. To this end, novelty detection tries to answer this important question: given a test example should we even try to classify it? In this work, we design a novel deep learning framework, termed Segregation Network, which is trained using the mixup technique. We construct interpolated points using convex combinations of pairs of training data and use our novel loss function for prediction of its constituent classes. During testing, for each input query, mixed samples with the known class prototypes are generated and passed through the proposed network. The output of the network reveals the constituent classes which can be used to determine whether the incoming data is from the known class set or not. Our algorithm is trained using just the data from the known classes and does not require any auxiliary dataset or attributes. Extensive evaluation on two benchmark datasets namely Caltech-256 and Stanford Dogs and comparison with the state-of-the-art justifies the effectiveness of the proposed framework. …