# WhatIs-M

 m2cgen m2cgen (Model 2 Code Generator) – is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go). M3 M3, a metrics platform, and M3DB, a distributed time series database, were developed at Uber outof necessity. After using what was available as open source and finding we were unable to use themat our scale due to issues with their reliability, cost and operationally intensive naturewe built our own metrics platform piece by piece. We used our experience to help us build a nativedistributed time series database, a highly dynamic and performant aggregation service, query engineand other supporting infrastructure. M3Lcmf Multi-view Multi-instance Multi-label Learning(M3L) deals with complex objects encompassing diverse instances, represented with different feature views, and annotated with multiple labels. Existing M3L solutions only partially explore the inter or intra relations between objects (or bags), instances, and labels, which can convey important contextual information for M3L. As such, they may have a compromised performance. In this paper, we propose a collaborative matrix factorization based solution called M3Lcmf. M3Lcmf first uses a heterogeneous network composed of nodes of bags, instances, and labels, to encode different types of relations via multiple relational data matrices. To preserve the intrinsic structure of the data matrices, M3Lcmf collaboratively factorizes them into low-rank matrices, explores the latent relationships between bags, instances, and labels, and selectively merges the data matrices. An aggregation scheme is further introduced to aggregate the instance-level labels into bag-level and to guide the factorization. An empirical study on benchmark datasets show that M3Lcmf outperforms other related competitive solutions both in the instance-level and bag-level prediction. M4CD In this paper, we propose a robust change detection method for intelligent visual surveillance. This method, named M4CD, includes three major steps. Firstly, a sample-based background model that integrates color and texture cues is built and updated over time. Secondly, multiple heterogeneous features (including brightness variation, chromaticity variation, and texture variation) are extracted by comparing the input frame with the background model, and a multi-source learning strategy is designed to online estimate the probability distributions for both foreground and background. The three features are approximately conditionally independent, making multi-source learning feasible. Pixel-wise foreground posteriors are then estimated with Bayes rule. Finally, the Markov random field (MRF) optimization and heuristic post-processing techniques are used sequentially to improve accuracy. In particular, a two-layer MRF model is constructed to represent pixel-based and superpixel-based contextual constraints compactly. Experimental results on the CDnet dataset indicate that M4CD is robust under complex environments and ranks among the top methods. MAC Network We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning. Drawing inspiration from first principles of computer organization, MAC moves away from monolithic black-box neural architectures towards a design that encourages both transparency and versatility. The model approaches problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintains a separation between control and memory. By stringing the cells together and imposing structural constraints that regulate their interaction, MAC effectively learns to perform iterative reasoning processes that are directly inferred from the data in an end-to-end approach. We demonstrate the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, we show that the model is computationally-efficient and data-efficient, in particular requiring 5x less data than existing models to achieve strong results. Machine Comprehension Model Sogou Machine Reading Comprehension Toolkit Machine Learning Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory. Machine Learning Algorithms alphabetically A list of machine learning algorithms Machine Learning Algorithms by Category A list of machine learning algorithms Machine Learning AUtomation Toolbox(MLaut) In this paper we present MLaut (Machine Learning AUtomation Toolbox) for the python data science ecosystem. MLaut automates large-scale evaluation and benchmarking of machine learning algorithms on a large number of datasets. MLaut provides a high-level workflow interface to machine algorithm algorithms, implements a local back-end to a database of dataset collections, trained algorithms, and experimental results, and provides easy-to-use interfaces to the scikit-learn and keras modelling libraries. Experiments are easy to set up with default settings in a few lines of code, while remaining fully customizable to the level of hyper-parameter tuning, pipeline composition, or deep learning architecture. As a principal test case for MLaut, we conducted a large-scale supervised classification study in order to benchmark the performance of a number of machine learning algorithms – to our knowledge also the first larger-scale study on standard supervised learning data sets to include deep learning algorithms. While corroborating a number of previous findings in literature, we found (within the limitations of our study) that deep neural networks do not perform well on basic supervised learning, i.e., outside the more specialized, image-, audio-, or text-based tasks. Machine Learning Canvas A framework to connect the dots between data collection, machine learning, and value creation Machine Learning Query Language(MLQL) Machine Listening Intelligence This manifesto paper will introduce machine listening intelligence, an integrated research framework for acoustic and musical signals modelling, based on signal processing, deep learning and computational musicology. Machine Reading Comprehension(MRC) Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension Machine Reasoning Imagine that the toddler who was once pushing the glass off the table now understands the physics of movement and gravity. Even without having encountered this situation before, the toddler can surmise what will inevitably happen. The toddler can apply the same logic to another object on the table – adapting that knowledge and applying it to a TV remote on the same table – because he knows why it happens. That’s machine reasoning. Machine reasoning is a more human-like approach within the AI spectrum that’s highly relevant to big data investigations, therefore it allows for more flexible adaptation than machine learning. However, machine reasoning requires heuristics and curation, which is usually done by knowledgeable domain experts. This process is where machine reasoning may be difficult for companies to scale – it requires a great deal of expert human effort for this curation to take place. Machine reasoning is best applied in deterministic scenarios – that is, determining whether something is true or not, or whether something will happen or not. Knowing this, it’s clear why machine learning and machine reasoning work well together. Machine Teaching In this paper, we consider the problem of machine teaching, the inverse problem of machine learning. Different from traditional machine teaching which views the learners as batch algorithms, we study a new paradigm where the learner uses an iterative algorithm and a teacher can feed examples sequentially and intelligently based on the current performance of the learner. We show that the teaching complexity in the iterative case is very different from that in the batch case. Instead of constructing a minimal training set for learners, our iterative machine teaching focuses on achieving fast convergence in the learner model. Depending on the level of information the teacher has from the learner model, we design teaching algorithms which can provably reduce the number of teaching examples and achieve faster convergence than learning without teachers. We also validate our theoretical findings with extensive experiments on different data distribution and real image datasets. Machine Vision(MV) Machine vision (MV) is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance in industry. The scope of MV is broad. MV is related to, though distinct from, computer vision. Machines Talking To Machines(M2M) We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue ‘outlines’, i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows. Macroblock Scaling(MBS) We estimate the proper channel (width) scaling of Convolution Neural Networks (CNNs) for model reduction. Unlike the traditional scaling method that reduces every CNN channel width by the same scaling factor, we address each CNN macroblock adaptively depending on its information redundancy measured by our proposed effective flops. Our proposed macroblock scaling (MBS) algorithm can be applied to various CNN architectures to reduce their model size. These applicable models range from compact CNN models such as MobileNet (25.53% reduction, ImageNet) and ShuffleNet (20.74% reduction, ImageNet) to ultra-deep ones such as ResNet-101 (51.67% reduction, ImageNet) and ResNet-1202 (72.71% reduction, CIFAR-10) with negligible accuracy degradation. MBS also performs better reduction at a much lower cost than does the state-of-the-art optimization-based method. MBS’s simplicity and efficiency, its flexibility to work with any CNN model, and its scalability to work with models of any depth makes it an attractive choice for CNN model size reduction. MAESTRO We present MAESTRO, a framework to describe and analyze CNN dataflows, and predict performance and energy-efficiency when running neural network layers across various hardware configurations. This includes two components: (i) a concise language to describe arbitrary dataflows and (ii) and analysis framework that accepts the dataflow description, hardware resource description, and DNN layer description as inputs and generates buffer requirements, buffer access counts, network-on-chip (NoC) bandwidth requirements, and roofline performance information. We demonstrate both components across several dataflows as case studies. MAgent We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents’ optimal polices, but more importantly, the observation and understanding of individual agent’s behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch. Magma Magma is an open-source software platform that gives network operators an open, flexible and extendable mobile core network solution. Magma enables better connectivity by: • Allowing operators to offer cellular service without vendor lock-in with a modern, open source core network • Enabling operators to manage their networks more efficiently with more automation, less downtime, better predictability, and more agility to add new services and applications • Enabling federation between existing MNOs and new infrastructure providers for expanding rural infrastructure • Allowing operators who are constrained with licensed spectrum to add capacity and reach by using Wi-Fi and CBRS MAGnet Over recent years, deep reinforcement learning has shown strong successes in complex single-agent tasks, and more recently this approach has also been applied to multi-agent domains. In this paper, we propose a novel approach, called MAGnet, to multi-agent reinforcement learning (MARL) that utilizes a relevance graph representation of the environment obtained by a self-attention mechanism, and a message-generation technique inspired by the NerveNet architecture. We applied our MAGnet approach to the Pommerman game and the results show that it significantly outperforms state-of-the-art MARL solutions, including DQN, MADDPG, and MCTS. Magnetic Laplacian Matrix MagneticMap Magnitude Vector space embedding models like word2vec, GloVe, fastText, and ELMo are extremely popular representations in natural language processing (NLP) applications. We present Magnitude, a fast, lightweight tool for utilizing and processing embeddings. Magnitude is an open source Python package with a compact vector storage file format that allows for efficient manipulation of huge numbers of embeddings. Magnitude performs common operations up to 60 to 6,000 times faster than Gensim. Magnitude introduces several novel features for improved robustness like out-of-vocabulary lookups. Magnitude Bounded Matrix Factorisation(MBMF) Low rank matrix factorisation is often used in recommender systems as a way of extracting latent features. When dealing with large and sparse datasets, traditional recommendation algorithms face the problem of acquiring large, unrestrained, fluctuating values over predictions especially for users/items with very few corresponding observations. Although the problem has been somewhat solved by imposing bounding constraints over its objectives, and/or over all entries to be within a fixed range, in terms of gaining better recommendations, these approaches have two major shortcomings that we aim to mitigate in this work: one is they can only deal with one pair of fixed bounds for all entries, and the other one is they are very time-consuming when applied on large scale recommender systems. In this paper, we propose a novel algorithm named Magnitude Bounded Matrix Factorisation (MBMF), which allows different bounds for individual users/items and performs very fast on large scale datasets. The key idea of our algorithm is to construct a model by constraining the magnitudes of each individual user/item feature vector. We achieve this by converting from the Cartesian to Spherical coordinate system with radii set as the corresponding magnitudes, which allows the above constrained optimisation problem to become an unconstrained one. The Stochastic Gradient Descent (SGD) method is then applied to solve the unconstrained task efficiently. Experiments on synthetic and real datasets demonstrate that in most cases the proposed MBMF is superior over all existing algorithms in terms of accuracy and time complexity. Magnitude-Shape Plot This article proposes a new graphical tool, the magnitude-shape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MS-plot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. A dividing curve or surface is provided to separate non-outlying data from the outliers. Both the simulated data and the practical examples confirm that the MS-plot is superior to existing tools for visualizing centrality and detecting outliers for functional data. Mahalanobis Distance The Mahalanobis distance is a descriptive statistic that provides a relative measure of a data point’s distance (residual) from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936. The Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant. In other words, it has a multivariate effect size. MAIA In recent decades, it has become a significant tendency for industrial manufacturers to adopt decentralization as a new manufacturing paradigm. This enables more efficient operations and facilitates the shift from mass to customized production. At the same time, advances in data analytics give more insights into the production lines, thus improving its overall productivity. The primary objective of this paper is to apply a decentralized architecture to address new challenges in industrial analytics. The main contributions of this work are therefore two-fold: (1) an assessment of the microservices’ feasibility in industrial environments, and (2) a microservices-based architecture for industrial data analytics. Also, a prototype has been developed, analyzed, and evaluated, to provide further practical insights. Initial evaluation results of this prototype underpin the adoption of microservices in industrial analytics with less than 20ms end-to-end processing latency for predicting movement paths for 100 autonomous robots on a commodity hardware server. However, it also identifies several drawbacks of the approach, which is, among others, the complexity in structure, leading to higher resource consumption. Majority-CRF We explore active learning (AL) utterance selection for improving the accuracy of new underrepresented domains in a natural language understanding (NLU) system. Moreover, we propose an AL algorithm called Majority-CRF that uses an ensemble of classification and sequence labeling models to guide utterance selection for annotation. Experiments with three domains show that Majority-CRF achieves 6.6%-9% relative error rate reduction compared to random sampling with the same annotation budget, and statistically significant improvements compared to other AL approaches. Additionally, case studies with human-in-the-loop AL on six new domains show 4.6%-9% improvement on an existing NLU system. MAKESPEARE Inductive program synthesis, from input/output examples, can provide an opportunity to automatically create programs from scratch without presupposing the algorithmic form of the solution. For induction of general programs with loops (as opposed to loop-free programs, or synthesis for domain-specific languages), the state of the art is at the level of introductory programming assignments. Most problems that require algorithmic subtlety, such as fast sorting, have remained out of reach without the benefit of significant problem-specific background knowledge. A key challenge is to identify cues that are available to guide search towards correct looping programs. We present MAKESPEARE, a simple delayed-acceptance hillclimbing method that synthesizes low-level looping programs from input/output examples. During search, delayed acceptance bypasses small gains to identify significantly-improved stepping stone programs that tend to generalize and enable further progress. The method performs well on a set of established benchmarks, and succeeds on the previously unsolved ‘Collatz Numbers’ program synthesis problem. Additional benchmarks include the problem of rapidly sorting integer arrays, in which we observe the emergence of comb sort (a Shell sort variant that is empirically fast). MAKESPEARE has also synthesized a record-setting program on one of the puzzles from the TIS-100 assembly language programming game. Maler In this paper, we study adaptive online convex optimization, and aim to design a universal algorithm that achieves optimal regret bounds for multiple common types of loss functions. Existing universal methods are limited in the sense that they are optimal for only a subclass of loss functions. To address this limitation, we propose a novel online method, namely Maler, which enjoys the optimal $O(\sqrt{T})$, $O(d\log T)$ and $O(\log T)$ regret bounds for general convex, exponentially concave, and strongly convex functions respectively. The essential idea is to run multiple types of learning algorithms with different learning rates in parallel, and utilize a meta algorithm to track the best one on the fly. Empirical results demonstrate the effectiveness of our method. Mallows Rank Model BayesMallows: An R Package for the Bayesian Mallows Model Malthusian Reinforcement Learning Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. In Malthusian RL, increases in a subpopulation’s average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth. Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play. Malware Analysis and Attributed using Genetic Information(MAAGI) Artificial intelligence methods have often been applied to perform specific functions or tasks in the cyber-defense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyber-attacks, and malware-based attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future characteristics of malware. In this paper, we present the Malware Analysis and Attributed using Genetic Information (MAAGI) system. The underlying idea behind the MAAGI system is that there are strong similarities between malware behavior and biological organism behavior, and applying biologically inspired methods to corpora of malware can help analysts better understand the ecosystem of malware attacks. Due to the sophistication of the malware and the analysis, the MAAGI system relies heavily on artificial intelligence techniques to provide this capability. It has already yielded promising results over its development life, and will hopefully inspire more integration between the artificial intelligence and cyber–defense communities. Managed Memory Computing(MMC) Aggregated data cubes are the most effective form of storage of aggregated or summarized data for quick analysis. This technology is driven by Online Analytical Processing technology. Utilizing these data cubes involves intense disk I/O operations. This at times lowers the speed for users of data. Conventional, in-memory processing does not rely on stored and summarized or aggregated data but brings all the relevant data to the memory. This technology then utilizes intense processing and large amounts of memory to perform all calculations and aggregations while in memory. Managed Memory Computing blends the best of both methods, allowing users to define data cubes with per-structured and aggregated data, providing a logical business layer to users, and offering in-memory computation. These features make the response time for user interactions far superior and enable the most balanced approach between disk I/O and in-memory processing. The hybrid approach of Managed Memory Computing provides analysis, dashboards, graphical interaction, ad hoc querying, presentation, and discussion driven analytic at blazing speeds, making the Business Intelligence Tool ready for everything from an interactive session in the boardroom to a production planning meeting on the factory floor. Managed R Archive Network(MRAN) Revolution Analytics’ Managed R Archive Network Mandolin Markov Logic Networks join probabilistic modeling with first-order logic and have been shown to integrate well with the Semantic Web foundations. While several approaches have been devised to tackle the subproblems of rule mining, grounding, and inference, no comprehensive workflow has been proposed so far. In this paper, we fill this gap by introducing a framework called Mandolin, which implements a workflow for knowledge discovery specifically on RDF datasets. Our framework imports knowledge from referenced graphs, creates similarity relationships among similar literals, and relies on state-of-the-art techniques for rule mining, grounding, and inference computation. We show that our best configuration scales well and achieves at least comparable results with respect to other statistical-relational-learning algorithms on link prediction. Manhattan Distance Taxicab geometry, considered by Hermann Minkowski in 19th century Germany, is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. The taxicab metric is also known as rectilinear distance, L1 distance or norm, city block distance, Manhattan distance, or Manhattan length, with corresponding variations in the name of the geometry. The latter names allude to the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab geometry. ➚ “Lp Space” Manhattan Plot A Manhattan plot is a type of scatter plot, usually used to display data with a large number of data-points – many of non-zero amplitude, and with a distribution of higher-magnitude values, for instance in genome-wide association studies (GWAS). It gains its name from the similarity of such a plot to the Manhattan skyline: a profile of skyscrapers towering above the lower level “buildings” which vary around a lower height. Manifold Interpretation and diagnosis of machine learning models have gained renewed interest in recent years with breakthroughs in new approaches. We present Manifold, a framework that utilizes visual analysis techniques to support interpretation, debugging, and comparison of machine learning models in a more transparent and interactive manner. Conventional techniques usually focus on visualizing the internal logic of a specific model type (i.e., deep neural networks), lacking the ability to extend to a more complex scenario where different model types are integrated. To this end, Manifold is designed as a generic framework that does not rely on or access the internal logic of the model and solely observes the input (i.e., instances or features) and the output (i.e., the predicted result and probability distribution). We describe the workflow of Manifold as an iterative process consisting of three major phases that are commonly involved in the model development and diagnosis process: inspection (hypothesis), explanation (reasoning), and refinement (verification). The visual components supporting these tasks include a scatterplot-based visual summary that overviews the models’ outcome and a customizable tabular view that reveals feature discrimination. We demonstrate current applications of the framework on the classification and regression tasks and discuss other potential machine learning use scenarios where Manifold can be applied. Manifold: A Model-Agnostic Visual Debugging Tool for Machine Learning at Uber Manifold Adversarial Training(MAT) The recently proposed adversarial training methods show the robustness to both adversarial and original examples and achieve state-of-the-art results in supervised and semi-supervised learning. All the existing adversarial training methods consider only how the worst perturbed examples (i.e., adversarial examples) could affect the model output. Despite their success, we argue that such setting may be in lack of generalization, since the output space (or label space) is apparently less informative. In this paper, we propose a novel method, called Manifold Adversarial Training (MAT). MAT manages to build an adversarial framework based on how the worst perturbation could affect the distributional manifold rather than the output space. Particularly, a latent data space with the Gaussian Mixture Model (GMM) will be first derived. On one hand, MAT tries to perturb the input samples in the way that would rough the distributional manifold the worst. On the other hand, the deep learning model is trained trying to promote in the latent space the manifold smoothness, measured by the variation of Gaussian mixtures (given the local perturbation around the data point). Importantly, since the latent space is more informative than the output space, the proposed MAT can learn better a robust and compact data representation, leading to further performance improvement. The proposed MAT is important in that it can be considered as a superset of one recently-proposed discriminative feature learning approach called center loss. We conducted a series of experiments in both supervised and semi-supervised learning on three benchmark data sets, showing that the proposed MAT can achieve remarkable performance, much better than those of the state-of-the-art adversarial approaches. Manifold Criterion guided Transfer Learning(MCTL) In many practical transfer learning scenarios, the feature distribution is different across the source and target domains (i.e. non-i.i.d.). Maximum mean discrepancy (MMD), as a domain discrepancy metric, has achieved promising performance in unsupervised domain adaptation (DA). We argue that MMD-based DA methods ignore the data locality structure, which, to some extent, would cause the negative transfer effect. The locality plays an important role in minimizing the nonlinear local domain discrepancy underlying the marginal distributions. For better exploiting the domain locality, a novel local generative discrepancy metric (LGDM) based intermediate domain generation learning called Manifold Criterion guided Transfer Learning (MCTL) is proposed in this paper. The merits of the proposed MCTL are four-fold: 1) the concept of manifold criterion (MC) is first proposed as a measure validating the distribution matching across domains, and domain adaptation is achieved if the MC is satisfied; 2) the proposed MC can well guide the generation of the intermediate domain sharing similar distribution with the target domain, by minimizing the local domain discrepancy; 3) a global generative discrepancy metric (GGDM) is presented, such that both the global and local discrepancy can be effectively and positively reduced; 4) a simplified version of MCTL called MCTL-S is presented under a perfect domain generation assumption for more generic learning scenario. Experiments on a number of benchmark visual transfer tasks demonstrate the superiority of the proposed manifold criterion guided generative transfer method, by comparing with other state-of-the-art methods. The source code is available in https://…/MCTL. Manifold Learning Manifold Learning (often also referred to as non-linear dimensionality reduction) pursuits the goal to embed data that originally lies in a high dimensional space in a lower dimensional space, while preserving characteristic properties. This is possible because for any high dimensional data to be interesting, it must be intrinsically low dimensional. For example, images of faces might be represented as points in a high dimensional space (let’s say your camera has 5MP – so your images, considering each pixel consists of three values , lie in a 15M dimensional space), but not every 5MP image is a face. Faces lie on a sub-manifold in this high dimensional space. A sub-manifold is locally Euclidean, i.e. if you take two very similar points, for example two images of identical twins, you can interpolate between them and still obtain an image on the manifold, but globally not Euclidean – if you take two images that are very different – for example Arnold Schwarzenegger and Hillary Clinton – you cannot interpolate between them. I develop algorithms that map these high dimensional data points into a low dimensional space, while preserving local neighborhoods. This can be interpreted as a non-linear generalization of PCA. ➘ “Nonlinear Dimensionality Reduction” http://…/manifold.html Manifold Regularized Generative Adversarial Network(MR-GAN) Despite the growing interest in generative adversarial networks (GANs), training GANs remains a challenging problem, both from a theoretical and a practical standpoint. To address this challenge, in this paper, we propose a novel way to exploit the unique geometry of the real data, especially the manifold information. More specifically, we design a method to regularize GAN training by adding an additional regularization term referred to as manifold regularizer. The manifold regularizer forces the generator to respect the unique geometry of the real data manifold and generate high quality data. Furthermore, we theoretically prove that the addition of this regularization term in any class of GANs including DCGAN and Wasserstein GAN leads to improved performance in terms of generalization, existence of equilibrium, and stability. Preliminary experiments show that the proposed manifold regularization helps in avoiding mode collapse and leads to stable training. ManifoldNet Deep neural networks have become the main work horse for many tasks involving learning from data in a variety of applications in Science and Engineering. Traditionally, the input to these networks lie in a vector space and the operations employed within the network are well defined on vector-spaces. In the recent past, due to technological advances in sensing, it has become possible to acquire manifold-valued data sets either directly or indirectly. Examples include but are not limited to data from omnidirectional cameras on automobiles, drones etc., synthetic aperture radar imaging, diffusion magnetic resonance imaging, elastography and conductance imaging in the Medical Imaging domain and others. Thus, there is need to generalize the deep neural networks to cope with input data that reside on curved manifolds where vector space operations are not naturally admissible. In this paper, we present a novel theoretical framework to generalize the widely popular convolutional neural networks (CNNs) to high dimensional manifold-valued data inputs. We call these networks, ManifoldNets. In ManifoldNets, convolution operation on data residing on Riemannian manifolds is achieved via a provably convergent recursive computation of the weighted Fr\'{e}chet Mean (wFM) of the given data, where the weights makeup the convolution mask, to be learned. Further, we prove that the proposed wFM layer achieves a contraction mapping and hence ManifoldNet does not need the non-linear ReLU unit used in standard CNNs. We present experiments, using the ManifoldNet framework, to achieve dimensionality reduction by computing the principal linear subspaces that naturally reside on a Grassmannian. The experimental results demonstrate the efficacy of ManifoldNets in the context of classification and reconstruction accuracy. ManiFool Deep convolutional neural networks have been shown to be vulnerable to arbitrary geometric transformations. However, there is no systematic method to measure the invariance properties of deep networks to such transformations. We propose ManiFool as a simple yet scalable algorithm to measure the invariance of deep networks. In particular, our algorithm measures the robustness of deep networks to geometric transformations in a worst-case regime as they can be problematic for sensitive applications. Our extensive experimental results show that ManiFool can be used to measure the invariance of fairly complex networks on high dimensional datasets and these values can be used for analyzing the reasons for it. Furthermore, we build on Manifool to propose a new adversarial training scheme and we show its effectiveness on improving the invariance properties of deep neural networks. Mann-Kendall Trend Test(MK Test) Given n consecutive observations of a time series zt; t = 1;…; n, Mann (1945) suggested using the Kendall rank correlation of zt with t; t = 1;…; n to test for monotonic trend. ➚ “Kendall Rank Correlation Coefficient” Many Task Learning(MaTL) Typical multi-task learning (MTL) methods rely on architectural adjustments and a large trainable parameter set to jointly optimize over several tasks. However, when the number of tasks increases so do the complexity of the architectural adjustments and resource requirements. In this paper, we introduce a method which applies a conditional feature-wise transformation over the convolutional activations that enables a model to successfully perform a large number of tasks. To distinguish from regular MTL, we introduce Many Task Learning (MaTL) as a special case of MTL where more than 20 tasks are performed by a single model. Our method dubbed Task Routing (TR) is encapsulated in a layer we call the Task Routing Layer (TRL), which applied in an MaTL scenario successfully fits hundreds of classification tasks in one model. We evaluate our method on 5 datasets against strong baselines and state-of-the-art approaches. ➘ “Multi-Task Learning” Map-Based Multi-Policy Reinforcement Learning(MMPRL) In order for robots to perform mission-critical tasks, it is essential that they are able to quickly adapt to changes in their environment as well as to injuries and or other bodily changes. Deep reinforcement learning has been shown to be successful in training robot control policies for operation in complex environments. However, existing methods typically employ only a single policy. This can limit the adaptability since a large environmental modification might require a completely different behavior compared to the learning environment. To solve this problem, we propose Map-based Multi-Policy Reinforcement Learning (MMPRL), which aims to search and store multiple policies that encode different behavioral features while maximizing the expected reward in advance of the environment change. Thanks to these policies, which are stored into a multi-dimensional discrete map according to its behavioral feature, adaptation can be performed within reasonable time without retraining the robot. An appropriate pre-trained policy from the map can be recalled using Bayesian optimization. Our experiments show that MMPRL enables robots to quickly adapt to large changes without requiring any prior knowledge on the type of injuries that could occur. A highlight of the learned behaviors can be found here: https://youtu.be/qcCepAKL32U . Maple Maple combines the world’s most powerful math engine with an interface that makes it extremely easy to analyze, explore, visualize, and solve mathematical problems. mapnik Mapnik is a high-powered rendering library that can take GIS data from a number of sources (ESRI shapefiles, PostGIS databases, etc.) and use them to render beautiful 2-dimensional maps. It’s used as the underlying rendering solution for a lot of online mapping services, most notably including MapQuest and the OpenStreetMap project, so it’s a truly production-quality framework. And, despite being written in C++, it comes with bindings for Python and Node, so you can leverage it in the language of your choice. Render Google Maps Tiles with Mapnik and Python Mapping and Debugging(MaD) Neuromorphic systems or dedicated hardware for neuromorphic computing is getting popular with the advancement in research on different device materials for synapses, especially in crossbar architecture and also algorithms specific or compatible to neuromorphic hardware. Hence, an automated mapping of any deep neural network onto the neuromorphic chip with crossbar array of synapses and an efficient debugging framework is very essential. Here, mapping is defined as the deployment of a section of deep neural network layer onto a neuromorphic core and the generation of connection lists among population of neurons to specify the connectivity between various neuromorphic cores on the neuromorphic chip. Debugging is the verification of computations performed on the neuromorphic chip during inferencing. Together the framework becomes Mapping and Debugging (MaD) framework. MaD framework is quite general in usage as it is a Python wrapper which can be integrated with almost every simulator tools for neuromorphic chips. This paper illustrates the MaD framework in detail, considering some optimizations while mapping onto a single neuromorphic core. A classification task on MNIST and CIFAR-10 datasets are considered for test case implementation of MaD framework. MapReduce for C(MR4C) MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. Pairing the performance and flexibility of natively developed algorithms with the unfettered scalability and throughput inherent in Hadoop, MR4C enables large-scale deployment of advanced data processing applications. MaRe Application containers are emerging as key components in scientific processing, as they can improve reproducibility and standardization in-silico analysis. Chaining software tools in processing pipelines is a common practice in scientific applications and, as application containers gain momentum, workflow systems are starting to provide support for this emerging technology. Nevertheless, workflow systems fall short when it comes to data-intensive analysis, as they do not provide locality-aware scheduling for parallel workloads. To this extent, Big Data cluster-computing frameworks, such as Apache Spark, represent a natural choice. However, even though these frameworks excel at parallelizing code blocks, they do not provide any support for containerized tools parallelization. Here we introduce MaRe, which extends Apache Spark, providing an easy way to parallelize container-based analytics, with transparent management of data locality. MaRe is Docker-compliant, and it can be used as a standalone solution, as well as a workflow system add-on. We demonstrate MaRe on two data-intensive applications in virtual drug screening and in predictive toxicology, showing good scalability. MaRe is generally applicable and available as open source: https://…/MaRe Margin Disparity Discrepancy This paper addresses the problem of unsupervised domain adaption from theoretical and algorithmic perspectives. Existing domain adaptation theories naturally imply minimax optimization algorithms, which connect well with the adversarial-learning based domain adaptation methods. However, several disconnections still form the gap between theory and algorithm. We extend previous theories (Ben-David et al., 2010; Mansour et al., 2009c) to multiclass classification in domain adaptation, where classifiers based on scoring functions and margin loss are standard algorithmic choices. We introduce a novel measurement, margin disparity discrepancy, that is tailored both to distribution comparison with asymmetric margin loss, and to minimax optimization for easier training. Using this discrepancy, we derive new generalization bounds in terms of Rademacher complexity. Our theory can be seamlessly transformed into an adversarial learning algorithm for domain adaptation, successfully bridging the gap between theory and algorithm. A series of empirical studies show that our algorithm achieves the state-of-the-art accuracies on challenging domain adaptation tasks. Marginal Structural Model(MSM) Marginal structural models are a class of statistical models used for causal inference in epidemiology. Such models handle the issue of time-dependent confounding in evaluation of the efficacy of interventions by inverse probability weighting for receipt of treatment. For instance, in the study of the effect of zidovudine in AIDS-related mortality, CD4 lymphocyte is used both for treatment indication, is influenced by treatment, and affects survival. Time-dependent confounders are typically highly prognostic of health outcomes and applied in dosing or indication for certain therapies, such as body weight or lab values such as alanine aminotransferase or bilirubin. Marginal Structural Models for Time-varying Endogenous Treatments: A Time-Varying Instrumental Variable Approach Marginalized Average Aggregation(MAA) In weakly-supervised temporal action localization, previous works have failed to locate dense and integral regions for each entire action due to the overestimation of the most salient regions. To alleviate this issue, we propose a marginalized average attentional network (MAAN) to suppress the dominant response of the most salient regions in a principled manner. The MAAN employs a novel marginalized average aggregation (MAA) module and learns a set of latent discriminative probabilities in an end-to-end fashion. MAA samples multiple subsets from the video snippet features according to a set of latent discriminative probabilities and takes the expectation over all the averaged subset features. Theoretically, we prove that the MAA module with learned latent discriminative probabilities successfully reduces the difference in responses between the most salient regions and the others. Therefore, MAAN is able to generate better class activation sequences and identify dense and integral action regions in the videos. Moreover, we propose a fast algorithm to reduce the complexity of constructing MAA from O($2^T$) to O($T^2$). Extensive experiments on two large-scale video datasets show that our MAAN achieves superior performance on weakly-supervised temporal action localization Margin-Based Pareto Deep Ensemble Pruning(MBPEP) Machine learning algorithms have been effectively applied into various real world tasks. However, it is difficult to provide high-quality machine learning solutions to accommodate an unknown distribution of input datasets; this difficulty is called the uncertainty prediction problems. In this paper, a margin-based Pareto deep ensemble pruning (MBPEP) model is proposed. It achieves the high-quality uncertainty estimation with a small value of the prediction interval width (MPIW) and a high confidence of prediction interval coverage probability (PICP) by using deep ensemble networks. In addition to these networks, unique loss functions are proposed, and these functions make the sub-learners available for standard gradient descent learning. Furthermore, the margin criterion fine-tuning-based Pareto pruning method is introduced to optimize the ensembles. Several experiments including predicting uncertainties of classification and regression are conducted to analyze the performance of MBPEP. The experimental results show that MBPEP achieves a small interval width and a low learning error with an optimal number of ensembles. For the real-world problems, MBPEP performs well on input datasets with unknown distributions datasets incomings and improves learning performance on a multi task problem when compared to that of each single model. Marian We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed. Marimekko Chart The Marimekko name has been adopted within business and the management consultancy industry to refer to a bar chart where all the bars are of equal height, there are no spaces between the bars, and the bars are in turn each divided into segments of different width. The design of the ‘marimekko’ chart is said to resemble a Marimekko print. The chart’s design encodes two variables (such as percentage of sales and market share), but it is criticised for making the data hard to perceive and to compare visually. Marked Point Process(MPP) A simple temporal point process (SPP) is an important class of time series, where the sample realization of the process is solely composed of the times at which events occur. Particular examples of point process data are neuronal spike patterns or spike trains, and a large number of distance and similarity metrics for those data have been proposed. A marked point process (MPP) is an extension of a simple temporal point process, in which a certain vector valued mark is associated with each of the temporal points in the SPP. Analyses of MPPs are of practical importance because instances of MPPs include recordings of natural disasters such as earthquakes and tornadoes. In this paper, we introduce an R package mmpp, which implements a number of distance and similarity metrics for SPP, and also extends those metrics for dealing with MPP. mmpp Marker Passing ➘ “Spreading Activation” Marker-Assisted Mini-Pooling(mMPA) mMPA Market Basket Analysis(MBA) Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don’t buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn’t buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule: IF {beer, no bar meal} THEN {crisps}. The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase crisps is referred to as the confidence. The algorithms for performing market basket analysis are fairly straightforward (Berry and Linhoff is a reasonable introductory resource for this). The complexities mainly arise in exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. A major difficulty is that a large number of the rules found may be trivial for anyone familiar with the business. Although the volume of data has been reduced, we are still asking the user to find a needle in a haystack. Requiring rules to have a high minimum support level and a high confidence level risks missing any exploitable result we might have found. One partial solution to this problem is differential market basket analysis, as described below. Marketing Attribution Attribution is the process of identifying a set of user actions (‘events’) that contribute in some manner to a desired outcome, and then assigning a value to each of these events. Marketing attribution provides a level of understanding of what combination of events influence individuals to engage in a desired behavior, typically referred to as a conversion. Attribution is the process of assigning credit to various marketing efforts when a sale is generated. In the modern world, this is no easy task. There are myriad ways to touch a customer today and the goal of attribution is to tease out the impact that each touch had in convincing you to make a purchase. Was it the email you were sent? Or the Google link you clicked? Or the banner ad you clicked when visiting a different site? Or the ad you saw with your video on YouTube? Or one of many other potential touch points? Or is it a mix? It is quite common today for a customer to have been exposed to multiple influences in the lead up to a purchase. How do you attribute the relationship? The question is not simply academic because it has real world consequences. Budgets are set based on performance. So, the person in charge of Google advertising has a huge motivation to ensure that they get all the credit they deserve. Also, accurate attribution will allow resources to be properly focused on the approaches that truly work best. https://…/1029 Markov Blanket In machine learning, the Markov blanket for a node A in a Bayesian network is the set of nodes dA composed of A’s parents, its children, and its children’s other parents. In a Markov network, the Markov blanket of a node is its set of neighboring nodes. A Markov blanket may also be denoted by MB(A). The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node. The term was coined by Pearl in 1988. In a Bayesian network, the values of the parents and children of a node evidently give information about that node; however, its children’s parents also have to be included, because they can be used to explain away the node in question. In a Markov random field, the Markov blanket for a node is simply its adjacent nodes. Markov Brains Markov Brains are a class of evolvable artificial neural networks (ANN). They differ from conventional ANNs in many aspects, but the key difference is that instead of a layered architecture, with each node performing the same function, Markov Brains are networks built from individual computational components. These computational components interact with each other, receive inputs from sensors, and control motor outputs. The function of the computational components, their connections to each other, as well as connections to sensors and motors are all subject to evolutionary optimization. Here we describe in detail how a Markov Brain works, what techniques can be used to study them, and how they can be evolved. Markov Chain A Markov chain (discrete-time Markov chain or DTMC), named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another on a state space. It is a random process usually characterized as memoryless: the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of ‘memorylessness’ is called the Markov property. Markov chains have many applications as statistical models of real-world processes. http://…/9789814451505 Markov Chain Gradient Descent Stochastic gradient methods are the workhorse (algorithms) of large-scale optimization problems in machine learning, signal processing, and other computational sciences and engineering. This paper studies Markov chain gradient descent, a variant of stochastic gradient descent where the random samples are taken on the trajectory of a Markov chain. Existing results of this method assume convex objectives and a reversible Markov chain and thus have their limitations. We establish new non-ergodic convergence under wider step sizes, for nonconvex problems, and for non-reversible finite-state Markov chains. Nonconvexity makes our method applicable to broader problem classes. Non-reversible finite-state Markov chains, on the other hand, can mix substatially faster. To obtain these results, we introduce a new technique that varies the mixing levels of the Markov chains. The reported numerical results validate our contributions. Markov Chain Las Vegas(MCLV) We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps K (referred as MCLV-K). We present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K), with LVS-K significantly outperforming CD-K training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models. Markov Chain Monte Carlo(MCMC) In statistics, Markov chain Monte Carlo (MCMC) methods (which include random walk Monte Carlo methods) are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. Usually it is not hard to construct a Markov chain with the desired properties. The more difficult problem is to determine how many steps are needed to converge to the stationary distribution within an acceptable error. A good chain will have rapid mixing-the stationary distribution is reached quickly starting from an arbitrary position-described further under Markov chain mixing time Markov Chain Neural Network In this work we present a modified neural network model which is capable to simulate Markov Chains. We show how to express and train such a network, how to ensure given statistical properties reflected in the training data and we demonstrate several applications where the network produces non-deterministic outcomes. One example is a random walker model, e.g. useful for simulation of Brownian motions or a natural Tic-Tac-Toe network which ensures non-deterministic game behavior. Markov Cluster Algorithm(MCL) The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs. MCL Markov Decision Process(MDP) Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of research on Markov decision processes resulted from Ronald A. Howard’s book published in 1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines, including robotics, automated control, economics, and manufacturing. Markov Decision Process for Diversifying the Search Results in Information Retrieval(MDP-DIV) Recently, some studies have utilized the Markov Decision Process for diversifying (MDP-DIV) the search results in information retrieval. Though promising performances can be delivered, MDP-DIV suffers from a very slow convergence, which hinders its usability in real applications. In this paper, we aim to promote the performance of MDP-DIV by speeding up the convergence rate without much accuracy sacrifice. The slow convergence is incurred by two main reasons: the large action space and data scarcity. On the one hand, the sequential decision making at each position needs to evaluate the query-document relevance for all the candidate set, which results in a huge searching space for MDP; on the other hand, due to the data scarcity, the agent has to proceed more ‘trial and error’ interactions with the environment. To tackle this problem, we propose MDP-DIV-kNN and MDP-DIV-NTN methods. The MDP-DIV-kNN method adopts a $k$ nearest neighbor strategy, i.e., discarding the $k$ nearest neighbors of the recently-selected action (document), to reduce the diversification searching space. The MDP-DIV-NTN employs a pre-trained diversification neural tensor network (NTN-DIV) as the evaluation model, and combines the results with MDP to produce the final ranking solution. The experiment results demonstrate that the two proposed methods indeed accelerate the convergence rate of the MDP-DIV, which is 3x faster, while the accuracies produced barely degrade, or even are better. Markov Jump Process(MJP) In the context of a continuous-time Markov process, the Kolmogorov equations, including Kolmogorov forward equations and Kolmogorov backward equations, are a pair of systems of differential equations that describe the time-evolution of the probability P(x,s;y,t), where x,y in Omega (the state space) and t > s are the final and initial time respectively. Markov Logic Network(MLN) With the increase of dirty data, data cleaning turns into a crux of data analysis. Most of the existing algorithms rely on either qualitative techniques (e.g., data rules) or quantitative ones (e.g., statistical methods). In this paper, we present a novel hybrid data cleaning framework on top of Markov logic networks (MLNs), termed as MLNClean, which is capable of cleaning both schema-level and instance-level errors. MLNClean mainly consists of two cleaning stages, namely, first cleaning multiple data versions separately (each of which corresponds to one data rule), and then deriving the final clean data based on multiple data versions. Moreover, we propose a series of techniques/concepts, e.g., the MLN index, the concepts of reliability score and fusion score, to facilitate the cleaning process. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of MLNClean to the state-of-the-art approach in terms of both accuracy and efficiency. Markov Modulated Hawkes Process(MMHP) Modeling event dynamics is central to many disciplines. Patterns in observed event arrival times are commonly modeled using point processes. Such event arrival data often exhibits self-exciting, heterogeneous and sporadic trends, which is challenging for conventional models. It is reasonable to assume that there exists a hidden state process that drives different event dynamics at different states. In this paper, we propose a Markov Modulated Hawkes Process (MMHP) model for learning such a mixture of event dynamics and develop corresponding inference algorithms. Numerical experiments using synthetic data and data from an animal behavior study demonstrate that MMHP with the proposed estimation algorithms consistently recover the true hidden state process in simulations, and separately captures distinct event dynamics that reveal interesting social structures in the real data. Markov Random Field(MRF) In the domain of physics and probability, a Markov random field (often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. A Markov random field is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic. Thus, a Markov network can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); on the other hand, it can’t represent certain dependencies that a Bayesian network can (such as induced dependencies). Markov switch smooth-transition HYGARCH model HYGARCH model is basically used to model long-range dependence in volatility. We propose Markov switch smooth-transition HYGARCH model, where the volatility in each state is a time-dependent convex combination of GARCH and FIGARCH. This model provides a flexible structure to capture different levels of volatilities and also short and long memory effects. The necessary and sufficient condition for the asymptotic stability is derived. Forecast of conditional variance is studied by using all past information through a parsimonious way. Bayesian estimations based on Gibbs sampling are provided. A simulation study has been given to evaluate the estimations and model stability. The competitive performance of the proposed model is shown by comparing it with the HYGARCH and smooth-transition HYGARCH models for some period of the \textit{S}\&\textit{P}500 indices based on volatility and value-at-risk forecasts. Markov-Conley Chain(MCC) a-Rank: Multi-Agent Evaluation by Evolution Markov-Modulated Linear Regression Classical linear regression is considered for a case when regression parameters depend on the external random environment. The last is described as a continuous time Markov chain with finite state space. Here the expected sojourn times in various states are additional regressors. Necessary formulas for an estimation of regression parameters have been derived. The numerical example illustrates the results obtained. Markowitz Efficient Frontier In modern portfolio theory, the efficient frontier (or portfolio frontier) is an investment portfolio which occupies the ‘efficient’ parts of the risk-return spectrum. Formally, it is the set of portfolios which satisfy the condition that no other portfolio exists with a higher expected return but with the same standard deviation of return. The efficient frontier was first formulated by Harry Markowitz in 1952. Markowitz Portfolio Selection Problem Bayesian learning for the Markowitz portfolio selection problem MARVIN In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL’s MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing. Mashboard Also called real-time dashboard, a mashboard is a Web 2.0 buzzword that is used to describe analytic mash-ups that allow businesses to create or add components that may analyze and present data, look up inventory, accept orders, and other tasks without ever having to access the system that carries out the transaction. Mask Editor Deep convolutional neural network (DCNN) is the state-of-the-art method for image segmentation, which is one of key challenging computer vision tasks. However, DCNN requires a lot of training images with corresponding image masks to get a good segmentation result. Image annotation software which is easy to use and allows fast image mask generation is in great demand. To the best of our knowledge, all existing image annotation software support only drawing bounding polygons, bounding boxes, or bounding ellipses to mark target objects. These existing software are inefficient when targeting objects that have irregular shapes (e.g., defects in fabric images or tire images). In this paper we design an easy-to-use image annotation software called Mask Editor for image mask generation. Mask Editor allows drawing any bounding curve to mark objects and improves efficiency to mark objects with irregular shapes. Mask Editor also supports drawing bounding polygons, drawing bounding boxes, drawing bounding ellipses, painting, erasing, super-pixel-marking, image cropping, multi-class masks, mask loading, and mask modifying. Mask Scoring R-CNN Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at \url{https://…/maskscoring_rcnn}. Mask-Based Text Removal Network(MTRNet) Text removal algorithms have been proposed for uni-lingual scripts with regular shapes and layouts. However, to the best of our knowledge, a generic text removal method which is able to remove all or user-specified text regions regardless of font, script, language or shape is not available. Developing such a generic text eraser for real scenes is a challenging task, since it inherits all the challenges of multi-lingual and curved text detection and inpainting. To fill this gap, we propose a mask-based text removal network (MTRNet). MTRNet is a conditional adversarial generative network (cGAN) with an auxiliary mask. The introduced auxiliary mask not only makes the cGAN a generic text eraser, but also enables stable training and early convergence on a challenging large-scale synthetic dataset, initially proposed for text detection in real scenes. What’s more, MTRNet achieves state-of-the-art results on several real-world datasets including ICDAR 2013, ICDAR 2017 MLT, and CTW1500, without being explicitly trained on this data, outperforming previous state-of-the-art methods trained directly on these datasets. Masked Autoencoder for Distribution Estimation(MADE) There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder’s parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with stateof- the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. GitXiv Masked Convolutional Generative Flow(MaCow) Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models. MAsked Sequence to Sequence pre-training(MASS) Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and language modeling. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over the baselines without pre-training or with other pre-training methods. Specially, we achieve the state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model. MaskGAN Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maximum likelihood and teacher forcing. These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model. Mask-ShadowGAN This paper presents a new method for shadow removal using unpaired data, enabling us to avoid tedious annotations and obtain more diverse training samples. However, directly employing adversarial learning and cycle-consistency constraints is insufficient to learn the underlying relationship between the shadow and shadow-free domains, since the mapping between shadow and shadow-free images is not simply one-to-one. To address the problem, we formulate Mask-ShadowGAN, a new deep framework that automatically learns to produce a shadow mask from the input shadow image and then takes the mask to guide the shadow generation via re-formulated cycle-consistency constraints. Particularly, the framework simultaneously learns to produce shadow masks and learns to remove shadows, to maximize the overall performance. Also, we prepared an unpaired dataset for shadow removal and demonstrated the effectiveness of Mask-ShadowGAN on various experiments, even it was trained on unpaired data. Mass Displacement Network(MDN) Despite the large improvements in performance attained by using deep learning in computer vision, one can often further improve results with some additional post-processing that exploits the geometric nature of the underlying task. This commonly involves displacing the posterior distribution of a CNN in a way that makes it more appropriate for the task at hand, e.g. better aligned with local image features, or more compact. In this work we integrate this geometric post-processing within a deep architecture, introducing a differentiable and probabilistically sound counterpart to the common geometric voting technique used for evidence accumulation in vision. We refer to the resulting neural models as Mass Displacement Networks (MDNs), and apply them to human pose estimation in two distinct setups: (a) landmark localization, where we collapse a distribution to a point, allowing for precise localization of body keypoints and (b) communication across body parts, where we transfer evidence from one part to the other, allowing for a globally consistent pose estimate. We evaluate on large-scale pose estimation benchmarks, such as MPII Human Pose and COCO datasets, and report systematic improvements when compared to strong baselines. Mass Personalization Mass personalization is defined as custom tailoring by a company in accordance with its end users tastes and preferences. From collaborative engineering perspective, mass customization can be viewed as collaborative efforts between customers and manufacturers, who have different sets of priorities and need to jointly search for solutions that best match customers’ individual specific needs with manufacturers’ customization capabilities. The main difference between mass customization and mass personalization is that customization is the ability for a company to give its customers an opportunity to create and choose product to certain specifications, but does have limits. Clothing industry has also adopted the mass customization paradigm and some footwear retailers are producing mass customized shoes. The gaming market is seeing personalization in the new custom controller industry. A new, and notable, company called “Experience Custom” gives customers the opportunity to order personalized gaming controllers. A website knowing a user’s location, and buying habits, will present offers and suggestions tailored to the user’s demographics; this is an example of mass personalization. The personalization is not individual but rather the user is first classified and then the personalization is based on the group they belong to. Behavioral targeting represents a concept that is similar to mass personalization. MASSES We introduce MASSES, a simple evaluation metric for the task of Visual Question Answering (VQA). In its standard form, the VQA task is operationalized as follows: Given an image and an open-ended question in natural language, systems are required to provide a suitable answer. Currently, model performance is evaluated by means of a somehow simplistic metric: If the predicted answer is chosen by at least 3 human annotators out of 10, then it is 100% correct. Though intuitively valuable, this metric has some important limitations. First, it ignores whether the predicted answer is the one selected by the Majority (MA) of annotators. Second, it does not account for the quantitative Subjectivity (S) of the answers in the sample (and dataset). Third, information about the Semantic Similarity (SES) of the responses is completely neglected. Based on such limitations, we propose a multi-component metric that accounts for all these issues. We show that our metric is effective in providing a more fine-grained evaluation both on the quantitative and qualitative level. Massive Online Analysis(MOA) MOA (Massive Online Analysis) is a free open-source software specific for Data stream mining with Concept drift. It’s written in Java and developed at the University of Waikato, New Zealand. MOA is an open-source framework software that allows to build and run experiments of machine learning or data mining on evolving data streams. It includes a set of learners and stream generators that can be used from the Graphical User Interface (GUI), the command-line, and the Java API. MOA contains several collections of machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines. http://moa.cms.waikato.ac.nz Massive Open Online Course(MOOC) A Massive Open Online Course (MOOC) is an online course aimed at unlimited participation and open access via the web. In addition to traditional course materials such as videos, readings, and problem sets, MOOCs provide interactive user forums that help build a community for students, professors, and teaching assistants (TAs). MOOCs are a recent development in distance education which began to emerge in 2012. Massively-Parallel Neural Array(MPNA) The state-of-the-art accelerators for Convolutional Neural Networks (CNNs) typically focus on accelerating only the convolutional layers, but do not prioritize the fully-connected layers much. Hence, they lack a synergistic optimization of the hardware architecture and diverse dataflows for the complete CNN design, which can provide a higher potential for performance/energy efficiency. Towards this, we propose a novel Massively-Parallel Neural Array (MPNA) accelerator that integrates two heterogeneous systolic arrays and respective highly-optimized dataflow patterns to jointly accelerate both the convolutional (CONV) and the fully-connected (FC) layers. Besides fully-exploiting the available off-chip memory bandwidth, these optimized dataflows enable high data-reuse of all the data types (i.e., weights, input and output activations), and thereby enable our MPNA to achieve high energy savings. We synthesized our MPNA architecture using the ASIC design flow for a 28nm technology, and performed functional and timing validation using multiple real-world complex CNNs. MPNA achieves 149.7GOPS/W at 280MHz and consumes 239mW. Experimental results show that our MPNA architecture provides 1.7x overall performance improvement compared to state-of-the-art accelerator, and 51% energy saving compared to the baseline architecture. MATCHA The trade-off between convergence error and communication delays in decentralized stochastic gradient descent~(SGD) is dictated by the sparsity of the inter-worker communication graph. In this paper, we propose MATCHA, a decentralized SGD method where we use matching decomposition sampling of the base graph to parallelize inter-worker information exchange so as to significantly reduce communication delay. At the same time, under standard assumptions for any general topology, in spite of the significant reduction of the communication delay, MATCHA maintains the same convergence rate as that of the state-of-the-art in terms of epochs. Experiments on a suite of datasets and deep neural networks validate the theoretical analysis and demonstrate the effectiveness of the proposed scheme as far as reducing communication delays is concerned. Matchbox We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a low-dimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn user-item preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using Assumed-Density Filtering (ADF) for training, the model requires only a single pass through the training data. This is an on-line learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the on-line ADF approach yields state-of-the-art performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data. Matching Matching is a statistical technique which is used to evaluate the effect of a treatment by comparing the treated and the non-treated units in an observational study or quasi-experiment (i.e. when the treatment is not randomly assigned). The goal of matching is, for every treated unit, to find one (or more) non-treated unit(s) with similar observable characteristics against whom the effect of the treatment can be assessed. By matching treated units to similar non-treated units, matching enables a comparison of outcomes among treated and non-treated units to estimate the effect of the treatment reducing bias due to confounding. Propensity score matching, an early matching technique, was developed as part of the Rubin causal model. Matching has been promoted by Donald Rubin. It was prominently criticized in economics by LaLonde (1986), who compared estimates of treatment effects from an experiment to comparable estimates produced with matching methods and showed that matching methods are biased. Dehejia and Wahba (1999) reevaluated LaLonde’s critique and show that matching is a good solution. Similar critiques have been raised in political science and sociology journals. Matching Sparsifier In this paper, we present a construction of a matching sparsifier’, that is, a sparse subgraph of the given graph that preserves large matchings approximately and is robust to modifications of the graph. We use this matching sparsifier to obtain several new algorithmic results for the maximum matching problem: * An almost $(3/2)$-approximation one-way communication protocol for the maximum matching problem, significantly simplifying the $(3/2)$-approximation protocol of Goel, Kapralov, and Khanna (SODA 2012) and extending it from bipartite graphs to general graphs. * An almost $(3/2)$-approximation algorithm for the stochastic matching problem, improving upon and significantly simplifying the previous $1.999$-approximation algorithm of Assadi, Khanna, and Li (EC 2017). * An almost $(3/2)$-approximation algorithm for the fault-tolerant matching problem, which, to our knowledge, is the first non-trivial algorithm for this problem. Our matching sparsifier is obtained by proving new properties of the edge-degree constrained subgraph (EDCS) of Bernstein and Stein (ICALP 2015; SODA 2016)—designed in the context of maintaining matchings in dynamic graphs—that identifies EDCS as an excellent choice for a matching sparsifier. This leads to surprisingly simple and non-technical proofs of the above results in a unified way. Along the way, we also provide a much simpler proof of the fact that an EDCS is guaranteed to contain a large matching, which may be of independent interest. MatchZoo In recent years, deep neural models have been widely adopted for text matching tasks, such as question answering and information retrieval, showing improved performance as compared with previous methods. In this paper, we introduce the MatchZoo toolkit that aims to facilitate the designing, comparing and sharing of deep text matching models. Specifically, the toolkit provides a unified data preparation module for different text matching problems, a flexible layer-based model construction process, and a variety of training objectives and evaluation metrics. In addition, the toolkit has implemented two schools of representative deep text matching models, namely representation-focused models and interaction-focused models. Finally, users can easily modify existing models, create and share their own models for text matching in MatchZoo. MatchZoo Math Kernel Library(MKL) Intel Math Kernel Library (Intel MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand optimized by exploiting Intel’s multicore and many-core processors. The library supports Intel and compatible processors and is available for Windows, Linux and OS X operating systems. MKL functions are optimized with each new processor releases from Intel. Mathematica Mathematica is a computational software program used in many scientific, engineering, mathematical and computing fields, based on symbolic mathematics. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois. The Wolfram Language is the programming language used in Mathematica. Mathematical Statistics Mathematical statistics is the application of mathematics to statistics, which was originally conceived as the science of the state – the collection and analysis of facts about a country: its economy, land, military, population, and so forth. Mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory. Mathematical Theory of Evidence ➘ “Theory of Evidence” Mathematics Mathematics (from Greek μάθημα máthēma, ‘knowledge, study, learning’), often shortened to maths or math, is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics. Mathematicians seek out patterns and use them to formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proof. When mathematical structures are good models of real phenomena, then mathematical reasoning can provide insight or predictions about nature. Through the use of abstraction and logic, mathematics developed from counting, calculation, measurement, and the systematic study of the shapes and motions of physical objects. Practical mathematics has been a human activity for as far back as written records exist. The research required to solve mathematical problems can take years or even centuries of sustained inquiry. Mathematics Content Understanding Although the scientific digital library is growing at a rapid pace, scholars/students often find reading Science, Technology, Engineering, and Mathematics (STEM) literature daunting, especially for the math-content/formula. In this paper, we propose a novel problem, “mathematics content understanding”, for cyberlearning and cyberreading. To address this problem, we create a Formula Evolution Map (FEM) offline and implement a novel online learning/reading environment, PDF Reader with Math-Assistant (PRMA), which incorporates innovative math-scaffolding methods. The proposed algorithm/system can auto-characterize student emerging math-information need while reading a paper and enable students to readily explore the formula evolution trajectory in FEM. Based on a math-information need, PRMA utilizes innovative joint embedding, formula evolution mining, and heterogeneous graph mining algorithms to recommend high quality Open Educational Resources (OERs), e.g., video, Wikipedia page, or slides, to help students better understand the math-content in the paper. Evaluation and exit surveys show that the PRMA system and the proposed formula understanding algorithm can effectively assist master and PhD students better understand the complex math-content in the class readings. MathJax A JavaScript display engine for mathematics that works in all browsers. MATLAB MATLAB is the high-level language and interactive environment used by millions of engineers and scientists worldwide. It lets you explore and visualize ideas and collaborate across disciplines including signal and image processing, communications, control systems, and computational finance. You can use MATLAB in projects such as modeling energy consumption to build smart power grids, developing control algorithms for hypersonic vehicles, analyzing weather data to visualize the track and intensity of hurricanes, and running millions of simulations to pinpoint optimal dosing for antibiotics. Matricized-Tensor Times Khatri-Rao Product(MTTKRP) The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to $7.4\times$ over existing parallel software. Matrix Calculus In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations. The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics. Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a row vector. Both of these conventions are possible even when the common assumption is made that vectors should be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be somewhat standard throughout a single field that commonly use matrix calculus (e.g. econometrics, statistics, estimation theory and machine learning). However, even within a given field different authors can be found using competing conventions. Authors of both groups often write as though their specific convention is standard. Serious mistakes can result when combining results from different authors without carefully verifying that compatible notations are used. Therefore great care should be taken to ensure notational consistency. Definitions of these two conventions and comparisons between them are collected in the layout conventions section. Matrix Decomposition In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems. Matrix Krasulina We present Matrix Krasulina, an algorithm for online k-PCA, by generalizing the classic Krasulina’s method (Krasulina, 1969) from vector to matrix case. We show, both theoretically and empirically, that the algorithm naturally adapts to data low-rankness and converges exponentially fast to the ground-truth principal subspace. Notably, our result suggests that despite various recent efforts to accelerate the convergence of stochastic-gradient based methods by adding a O(n)-time variance reduction step, for the k-PCA problem, a truly online SGD variant suffices to achieve exponential convergence on intrinsically low-rank data. Matrix Linear Discriminant Analysis We propose a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches. Matrix Profile The last decade has seen a flurry of research on all-pairs-similarity-search (or, self-join) for text, DNA, and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. Surprisingly, however, little progress has been made on addressing this problem for time series subsequences. In this thesis, we have introduced a near universal time series data mining tool called matrix profile which solves the all-pairs-similarity-search problem and caches the output in an easy-to-access fashion. The proposed algorithm is not only parameter-free, exact and scalable, but also applicable for both single and multidimensional time series. By building time series data mining methods on top of matrix profile, many time series data mining tasks (e.g., motif discovery, discord discovery, shapelet discovery, semantic segmentation, and clustering) can be efficiently solved. Because the same matrix profile can be shared by a diverse set of time series data mining methods, matrix profile is versatile and computed-once-use-many-times data structure. We demonstrate the utility of matrix profile for many time series data mining problems, including motif discovery, discord discovery, weakly labeled time series classification, and representation learning on domains as diverse as seismology, entomology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring, and medicine. We hope the matrix profile is not the end but the beginning of many more time series data mining projects. Matrix-centric Neural Networks We present a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix. This differs from current neural architectures that rely on vector representations. We consider matrices as central to the architecture and they compose the input, hidden and output layers. The model representation is more compact and elegant – the number of parameters grows only with the largest dimension of the incoming layer rather than the number of hidden units. We derive feed-forward nets that map an input matrix into an output matrix, and recurrent nets which map a sequence of input matrices into a sequence of output matrices. Experiments on handwritten digits recognition, face reconstruction, sequence to sequence learning and EEG classification demonstrate the efficacy and compactness of the matrix-centric architectures. MatrixRL Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon $H$. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound ${O}\big(H^2d\log T\sqrt{T}\big)$ where $d$ is the number of features. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound ${O}\big(H^2\widetilde{d}\log T\sqrt{T}\big)$, where $\widetilde{d}$ is the effective dimension of the kernel space. To our best knowledge, for RL using features or kernels, our results are the first regret bounds that are near-optimal in time $T$ and dimension $d$ (or $\widetilde{d}$) and polynomial in the planning horizon $H$. Matrix-Variate Gaussian(MVG) Differential privacy mechanism design has traditionally been tailored for a scalar-valued query function. Although many mechanisms such as the Laplace and Gaussian mechanisms can be extended to a matrix-valued query function by adding i.i.d. noise to each element of the matrix, this method is often suboptimal as it forfeits an opportunity to exploit the structural characteristics typically associated with matrix analysis. To address this challenge, we propose a novel differential privacy mechanism called the Matrix-Variate Gaussian (MVG) mechanism, which adds a matrix-valued noise drawn from a matrix-variate Gaussian distribution, and we rigorously prove that the MVG mechanism preserves $(\epsilon,\delta)$-differential privacy. Furthermore, we introduce the concept of directional noise made possible by the design of the MVG mechanism. Directional noise allows the impact of the noise on the utility of the matrix-valued query function to be moderated. Finally, we experimentally demonstrate the performance of our mechanism using three matrix-valued queries on three privacy-sensitive datasets. We find that the MVG mechanism notably outperforms four previous state-of-the-art approaches, and provides comparable utility to the non-private baseline. Our work thus presents a promising prospect for both future research and implementation of differential privacy for matrix-valued query functions. Matroid In combinatorics, a branch of mathematics, a matroid is a structure that captures and generalizes the notion of linear independence in vector spaces. There are many equivalent ways to define a matroid, the most significant being in terms of independent sets, bases, circuits, closed sets or flats, closure operators, and rank functions. Matroid theory borrows extensively from the terminology of linear algebra and graph theory, largely because it is the abstraction of various notions of central importance in these fields. Matroids have found applications in geometry, topology, combinatorial optimization, network theory and coding theory. MatRox We present MatRox, a novel model-based algorithm and implementation of Hierarchically Semi-Separable (HSS) matrix computations on parallel architectures. MatRox uses a novel storage format to improve data locality and scalability of HSS matrix-matrix multiplications on shared memory multicore processors. We build a performance model for HSS matrix-matrix multiplications. Based on the performance model, a mixed-rank heuristic is introduced to find an optimal HSS-tree depth for a faster HSS matrix evaluation. Uniform sampling is used to improve the performance of HSS compression. MatRox outperforms state-of-the-art HSS matrix multiplication codes, GOFMM and STRUMPACK, with average speedups of 2.8x and 6.1x respectively on target multicore processors. Matryoshka Network In this paper, we develop novel, efficient 2D encodings for 3D geometry, which enable reconstructing full 3D shapes from a single image at high resolution. The key idea is to pose 3D shape reconstruction as a 2D prediction problem. To that end, we first develop a simple baseline network that predicts entire voxel tubes at each pixel of a reference view. By leveraging well-proven architectures for 2D pixel-prediction tasks, we attain state-of-the-art results, clearly outperforming purely voxel-based approaches. We scale this baseline to higher resolutions by proposing a memory-efficient shape encoding, which recursively decomposes a 3D shape into nested shape layers, similar to the pieces of a Matryoshka doll. This allows reconstructing highly detailed shapes with complex topology, as demonstrated in extensive experiments; we clearly outperform previous octree-based approaches despite having a much simpler architecture using standard network components. Our Matryoshka networks further enable reconstructing shapes from IDs or shape similarity, as well as shape sampling. Matthews Correlation Coefficient(MCC) The Matthews Correlation Coefficient (MCC) has a range of -1 to 1 where -1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier. Using the MCC allows one to gauge how well their classification model/function is performing. Another method for evaluating classifiers is known as the ROC curve. Wikipedia mccr Maucha Diagrams This diagram was proposed by Rezso Maucha in 1932 as a way to vizualise the relative ionic composition of water samples. oviz Max of Weighed Distance(MWD) Adversarial attacks add perturbations to the input features with the intent of changing the classification produced by a machine learning system. Small perturbations can yield adversarial examples which are misclassified despite being virtually indistinguishable from the unperturbed input. Classifiers trained with standard neural network techniques are highly susceptible to adversarial examples, allowing an adversary to create misclassifications of their choice. We introduce a new type of network unit, called MWD (max of weighed distance) units that have a built-in resistant to adversarial attacks. These units are highly non-linear, and we develop the techniques needed to effectively train them. We show that simple interval techniques for propagating perturbation effects through the network enables the efficient computation of robustness (i.e., accuracy guarantees) for MWD networks under any perturbations, including adversarial attacks. MWD networks are significantly more robust to input perturbations than ReLU networks. On permutation invariant MNIST, when test examples can be perturbed by 20% of the input range, MWD networks provably retain accuracy above 83%, while the accuracy of ReLU networks drops below 5%. The provable accuracy of MWD networks is superior even to the observed accuracy of ReLU networks trained with the help of adversarial examples. In the absence of adversarial attacks, MWD networks match the performance of sigmoid networks, and have accuracy only slightly below that of ReLU networks. MaxDiff Model The MaxDiff is a long-established academic mathematical theory with very specific assumptions about how people make choices: it assumes that respondents evaluate all possible pairs of items within the displayed set and choose the pair that reflects the maximum difference in preference or importance. it may be thought of as a variation of the method of Paired Comparisons. Consider a set in which a respondent evaluates four items: A, B, C and D. If the respondent says that A is best and D is worst, these two responses inform us on five of six possible implied paired comparisons: A > B, A > C, A > D, B > D, C > D The only paired comparison that cannot be inferred is B vs. C. In a choice among five items, MaxDiff questioning informs on seven of ten implied paired comparisons. How to use Covariates to Improve your MaxDiff Model Maxima Units Search(MUS) An algorithm for extracting identity submatrices of small rank and pivotal units from large and sparse matrices is proposed. The procedure has already been satisfactorily applied for solving the label switching problem in Bayesian mixture models. Here we introduce it on its own and explore possible applications in different contexts. Maximal alpha-Leakage A tunable measure for information leakage called \textit{maximal $\alpha$-leakage} is introduced. This measure quantifies the maximal gain of an adversary in refining a tilted version of its prior belief of any (potentially random) function of a dataset conditioning on a disclosed dataset. The choice of $\alpha$ determines the specific adversarial action ranging from refining a belief for $\alpha =1$ to guessing the best posterior for $\alpha = \infty$, and for these extremal values this measure simplifies to mutual information (MI) and maximal leakage (MaxL), respectively. For all other $\alpha$ this measure is shown to be the Arimoto channel capacity. Several properties of this measure are proven including: (i) quasi-convexity in the mapping between the original and disclosed datasets; (ii) data processing inequalities; and (iii) a composition property. Maximal Clique Enumeration(MCE) Shared-Memory Parallel Maximal Clique Enumeration Maximal Equilibrium-Independent Passivity(MEIP) The theory of network identification, namely identifying the (weighted) interaction topology among a known number of agents, has been widely developed for linear agents over recent years. However, the theory for nonlinear agents is far less developed, and non-applicable to large systems due to long running times. We use the notion of maximal equilibrium-independent passivity (MEIP) and network optimization theory to present a network identification method for nonlinear agents. We do so by first designing a sub-cubic time algorithm for LTI agents, and then augment it by linearization to achieve a sub-cubic time algorithm for network reconstruction for nonlinear agents and controllers. Lastly, we study the problem of network reconstruction from a complexity theory standpoint, showing that the presented algorithms are in fact optimal in terms of time complexity. We provide examples of reconstructing large-scale networks, including a network of first-order linear agents, and a non-linear neural network model. Maximal Information Coefficient(MIC) In statistics, the maximal information coefficient (MIC) is a measure of the strength of the linear or non-linear association between two variables X and Y. The MIC belongs to the maximal information-based nonparametric exploration (MINE) class of statistics. In a simulation study, MIC outperformed some selected low power tests, however concerns have been raised regarding reduced statistical power in detecting some associations in settings with low sample size when compared to powerful methods such as distance correlation and HHG. Comparisons with these methods, in which MIC was outperformed, were made in and. It is claimed that MIC approximately satisfies a property called equitability which is illustrated by selected simulation studies. It was later proved that no non-trivial coefficient can exactly satisfy the equitability property as defined by Reshef et al. Some criticisms of MIC are addressed by Reshef et al. in further studies published on arXiv. Maximal Label Search(MLS) Many graph search algorithms use a vertex labeling to compute an ordering of the vertices. We examine such algorithms which compute a peo (perfect elimination ordering) of a chordal graph and corresponding algorithms which compute an meo (minimal elimination ordering) of a non-chordal graph, an ordering used to compute a minimal triangulation of the input graph. We express all known peo-computing search algorithms as instances of a generic algorithm called MLS (maximal label search) and generalize Algorithm MLS into CompMLS, which can compute any peo. We then extend these algorithms to versions which compute an meo and likewise generalize all known meo-computing search algorithms. We show that not all minimal triangulations can be computed by such a graph search, and, more surprisingly, that all these search algorithms compute the same set of minimal triangulations, even though the computed meos are different. Finally, we present a complexity analysis of these algorithms. An extended abstract of part of this paper was published in WG 2005. Computing a clique tree with algorithm MLS (Maximal Label Search) Maximal Mean Variance(MMV) We propose a new sufficient dimension reduction approach designed deliberately for high-dimensional classification. This novel method is named maximal mean variance (MMV), inspired by the mean variance index first proposed by Cui, Li and Zhong (2015), which measures the dependence between a categorical random variable with multiple classes and a continuous random variable. Our method requires reasonably mild restrictions on the predicting variables and keeps the model-free advantage without the need to estimate the link function. The consistency of the MMV estimator is established under regularity conditions for both fixed and diverging dimension (p) cases and the number of the response classes can also be allowed to diverge with the sample size n. We also construct the asymptotic normality for the estimator when the dimension of the predicting vector is fixed. Furthermore, our method works pretty well when n < p. The surprising classification efficiency gain of the proposed method is demonstrated by simulation studies and real data analysis. Maximally Divergent Intervals(MDI) Automatic detection of anomalies in space- and time-varying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatio-temporal time-series, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the ‘Maximally Divergent Intervals’ (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high Kullback-Leibler divergence compared with all other data given. In this regard, we define an unbiased Kullback-Leibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on large-scale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data. Maximally Filtered Clique Forest(MFCF) We propose a topological learning algorithm for the estimation of the conditional dependency structure of large sets of random variables from sparse and noisy data. The algorithm, named Maximally Filtered Clique Forest (MFCF), produces a clique forest and an associated Markov Random Field (MRF) by generalising Prim’s minimum spanning tree algorithm. To the best of our knowledge, the MFCF presents three elements of novelty with respect to existing structure learning approaches. The first is the repeated application of a local topological move, the clique expansion, that preserves the decomposability of the underlying graph. Through this move the decomposability and calculation of scores is performed incrementally at the variable (rather than edge) level, and this provides better computational performance and an intuitive application of multivariate statistical tests. The second is the capability to accommodate a variety of score functions and, while this paper is focused on multivariate normal distributions, it can be directly generalised to different types of statistics. Finally, the third is the variable range of allowed clique sizes which is an adjustable topological constraint that acts as a topological penalizer providing a way to tackle sparsity at $l_0$ semi-norm level; this allows a clean decoupling of structure learning and parameter estimation. The MFCF produces a representation of the clique forest, together with a perfect ordering of the cliques and a perfect elimination ordering for the vertices. As an example we propose an application to covariance selection models and we show that the MCFC outperforms the Graphical Lasso for a number of classes of matrices. Maximum a posteriori(MAP) In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher’s method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation. Maximum Causal Tsallis Entropy(MCTE) In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies. Maximum Complex Correntropy Criterion(MCCC) Recent studies have demonstrated that correntropy is an efficient tool for analyzing higher-order statistical moments in nonGaussian noise environments. Although correntropy has been used with complex data, no theoretical study was pursued to elucidate its properties, nor how to best use it for optimization. This paper presents a probabilistic interpretation for correntropy using complex-valued data called complex correntropy. A recursive solution for the maximum complex correntropy criterion (MCCC) is introduced based on a fixed point solution. This technique is applied to a simple system identification case study, and the results demonstrate prominent advantages when compared to the complex recursive least squares (RLS) algorithm. By using such probabilistic interpretation, correntropy can be applied to solve several problems involving complex data in a more straightforward way. Keywords: complex-valued data correntropy, maximum complex correntropy criterion, fixed-point algorithm. Maximum Correntropy Criterion Kalman Filter(MCC-KF) We present robust dynamic resource allocation mechanisms to allocate application resources meeting Service Level Objectives (SLOs) agreed between cloud providers and customers. In fact, two filter-based robust controllers, i.e. H-infinity filter and Maximum Correntropy Criterion Kalman filter (MCC-KF), are proposed. The controllers are self-adaptive, with process noise variances and covariances calculated using previous measurements within a time window. In the allocation process, a bounded client mean response time (mRT) is maintained. Both controllers are deployed and evaluated on an experimental testbed hosting the RUBiS (Rice University Bidding System) auction benchmark web site. The proposed controllers offer improved performance under abrupt workload changes, shown via rigorous comparison with current state-of-the-art. On our experimental setup, the Single-Input-Single-Output (SISO) controllers can operate on the same server where the resource allocation is performed; while Multi-Input-Multi-Output (MIMO) controllers are on a separate server where all the data are collected for decision making. SISO controllers take decisions not dependent to other system states (servers), albeit MIMO controllers are characterized by increased communication overhead and potential delays. While SISO controllers offer improved performance over MIMO ones, the latter enable a more informed decision making framework for resource allocation problem of multi-tier applications. Maximum Distance Sub-Lattice Problem(MDSP) In this paper, we define a problem on lattices called the Maximum Distance Sub-lattice Problem (MDSP). The decision version of this problem is shown to be in NP. We prove that MDSP is isomorphic to a well-known problem called closest vector problem (CVP). We give an exact and a heuristic algorithm for MDSP. Using experimental results we show that the LLL algorithm can be accelerated when it is combined with the heuristic algorithm for MDSP. Maximum Entropy Flow Networks Maximum Entropy Flow Networks Maximum Entropy Regularizer(MER) Incremental learning suffers from two challenging problems; forgetting of old knowledge and intransigence on learning new knowledge. Prediction by the model incrementally learned with a subset of the dataset are thus uncertain and the uncertainty accumulates through the tasks by knowledge transfer. To prevent overfitting to the uncertain knowledge, we propose to penalize confident fitting to the uncertain knowledge by the Maximum Entropy Regularizer (MER). Additionally, to reduce class imbalance and induce a self-paced curriculum on new classes, we exclude a few samples from the new classes in every mini-batch, which we call DropOut Sampling (DOS). We further rethink evaluation metrics for forgetting and intransigence in incremental learning by tracking each sample’s confusion at the transition of a task since the existing metrics that compute the difference in accuracy are often misleading. We show that the proposed method, named ‘MEDIC’, outperforms the state-of-the-art incremental learning algorithms in accuracy, forgetting, and intransigence measured by both the existing and the proposed metrics by a large margin in extensive empirical validations on CIFAR100 and a popular subset of ImageNet dataset (TinyImageNet). Maximum Entropy Spectral Analysis(MESA) Maximum Expected Utility Principle of maximum expected utility: A rational agent should chose the action which maximizes ist expected utility, given its knowledge. ➚ “Expected Utility Hypothesis” Maximum Inner Product Search(MIPS) Maximum Likelihood(ML) In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters. The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model). In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the ‘agreement’ of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do not exist. Maximum Likelihood Estimates(MLE) In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters. ➘ “Maximum Likelihood” Maximum Margin Interval Trees Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets. Maximum Margin Principal Components Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widely used in predictive modeling. A key factor in its widespread use in this domain is the fact that the projection of a dataset onto its first $K$ principal components minimizes the sum of squared errors between the original data and the projected data over all possible rank $K$ projections. Thus, PCA provides optimal low-rank representations of data for least-squares linear regression under standard modeling assumptions. On the other hand, when the loss function for a prediction problem is not the least-squares error, PCA is typically a heuristic choice of dimensionality reduction — in particular for classification problems under the zero-one loss. In this paper we target classification problems by proposing a straightforward alternative to PCA that aims to minimize the difference in margin distribution between the original and the projected data. Extensive experiments show that our simple approach typically outperforms PCA on any particular dataset, in terms of classification error, though this difference is not always statistically significant, and despite being a filter method is frequently competitive with Partial Least Squares (PLS) and Lasso on a wide range of datasets. Maximum Mean Discrepancy(MMD) The core idea in maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space (RKHS) is to match two distributions based on the mean of features in the Hilbert space induced by a kernel K. This is justified because when K is universal there is an injection between the space of distributions and the space of mean feature vectors lying in its RKHS. From a practical perspective too, the MMD approach is appealing because unlike other parametric density estimation methods, it can be applied to arbitrary domains and to high-dimensional data, and is computationally tractable. This approach was earlier used in the covariance shift problem (Gretton et al., 2009), the two-sample problem (Gretton et al., 2012a), and recently in (Zhang et al., 2013) for estimating class ratios. Maximum Percent Error(MPE) To combine the proportions from different studies for meta-analysis, Freeman and Tukey double arcsine tranformation can be useful for normalization and variance stabilization. The inverse function of the double arcsine transformation has been also derived in the literature to recover the original scale of the proportion after aggregation. In this brief note, we present the domain and range of the inverse double arcsine transformation both analytically and graphically. We notice an erratic behavior in the mathematical formula for the inverse double arcsine tranformation at both limits of its domain, and propose approximation methods for both small and large samples. We also propose a simple accuracy measure, the maximum percent error (MPE), of the large sample approximation, which can be used to determine the sample size that would provide a certain accuracy level, and conversely to determine the accuracy level of the approximation given a sample size. Maximum Variance Total Variation Denoising(MVTV) We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present Maximum Variance Total Variation denoising (MVTV), an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a state-of-the-art alternative method for interpretable nonlinear regression. MVTV divides the feature space into blocks of constant value and fits the value of all blocks jointly via a convex optimization routine. Our method is fully data-adaptive, in that it incorporates highly robust routines for tuning all hyperparameters automatically. We compare our approach against CART and CRISP via both a complexity-accuracy tradeoff metric and a human study, demonstrating that that MVTV is a more powerful and interpretable method. Maximum-Entropy Fine-Grained Classification Fine-Grained Visual Classification (FGVC) is an important computer vision problem that involves small diversity within the different classes, and often requires expert annotators to collect data. Utilizing this notion of small visual diversity, we revisit Maximum-Entropy learning in the context of fine-grained classification, and provide a training routine that maximizes the entropy of the output probability distribution for training convolutional neural networks on FGVC tasks. We provide a theoretical as well as empirical justification of our approach, and achieve state-of-the-art performance across a variety of classification tasks in FGVC, that can potentially be extended to any fine-tuning task. Our method is robust to different hyperparameter values, amount of training data and amount of training label noise and can hence be a valuable tool in many similar problems. Maximum-Margin Markov Network(M3N) In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. Maximum-Overlap Offset(MOO) ➘ “Mode Aware Data Flow” Max-Mahalanobis Linear Discriminant Analysis(MM-LDA) A deep neural network (DNN) consists of a nonlinear transformation from an input to a feature representation, followed by a common softmax linear classifier. Though many efforts have been devoted to designing a proper architecture for nonlinear transformation, little investigation has been done on the classifier part. In this paper, we show that a properly designed classifier can improve robustness to adversarial attacks and lead to better prediction results. Specifically, we define a Max-Mahalanobis distribution (MMD) and theoretically show that if the input distributes as a MMD, the linear discriminant analysis (LDA) classifier will have the best robustness to adversarial examples. We further propose a novel Max-Mahalanobis linear discriminant analysis (MM-LDA) network, which explicitly maps a complicated data distribution in the input space to a MMD in the latent feature space and then applies LDA to make predictions. Our results demonstrate that the MM-LDA networks are significantly more robust to adversarial attacks, and have better performance in class-biased classification. Max-Margin Deep Generative Models(mmDGMs) Deep generative models (DGMs) are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability. However, it is relatively insufficient to empower the discriminative ability of DGMs on making accurate predictions. This paper presents max-margin deep generative models (mmDGMs) and a class-conditional variant (mmDCGMs), which explore the strongly discriminative principle of max-margin learning to improve the predictive performance of DGMs in both supervised and semi-supervised learning, while retaining the generative capability. In semi-supervised learning, we use the predictions of a max-margin classifier as the missing labels instead of performing full posterior inference for efficiency; we also introduce additional max-margin and label-balance regularization terms of unlabeled data for effectiveness. We develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objectives in different settings. Empirical results on various datasets demonstrate that: (1) max-margin learning can significantly improve the prediction performance of DGMs and meanwhile retain the generative ability; (2) in supervised learning, mmDGMs are competitive to the best fully discriminative networks when employing convolutional neural networks as the generative and recognition models; and (3) in semi-supervised learning, mmDCGMs can perform efficient inference and achieve state-of-the-art classification results on several benchmarks. Max-Margin Markov Graph Model(M3GM) Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers. On the local level, individual relations between synsets (semantic building blocks) such as hypernymy and meronymy enhance our understanding of the words used to express their meanings. Globally, analysis of graph-theoretic properties of the entire net sheds light on the structure of human language as a whole. In this paper, we combine global and local properties of semantic graphs through the framework of Max-Margin Markov Graph Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that scales to large multi-relational graphs. We demonstrate how such global modeling improves performance on the local task of predicting semantic relations between synsets, yielding new state-of-the-art results on the WN18RR dataset, a challenging version of WordNet link prediction in which ‘easy’ reciprocal cases are removed. In addition, the M3GM model identifies multirelational motifs that are characteristic of well-formed lexical semantic ontologies. Maxout Network We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance. Maxout Networks GitXiv Max-Value Entropy Search(MES) Bayesian optimization (BO) is an effective tool for black-box optimization in which objective function evaluation is usually quite expensive. In practice, lower fidelity approximations of the objective function are often available. Recently, multi-fidelity Bayesian optimization (MFBO) has attracted considerable attention because it can dramatically accelerate the optimization process by using those cheaper observations. We propose a novel information theoretic approach to MFBO. Information-based approaches are popular and empirically successful in BO, but existing studies for information-based MFBO are plagued by difficulty for accurately estimating the information gain. Our approach is based on a variant of information-based BO called max-value entropy search (MES), which greatly facilitates evaluation of the information gain in MFBO. In fact, computations of our acquisition function is written analytically except for one dimensional integral and sampling, which can be calculated efficiently and accurately. We demonstrate effectiveness of our approach by using synthetic and benchmark datasets, and further we show a real-world application to materials science data. McDiarmid Drift Detection Method(MDDM) Increasingly, Internet of Things (IoT) domains, such as sensor networks, smart cities, and social networks, generate vast amounts of data. Such data are not only unbounded and rapidly evolving. Rather, the content thereof dynamically evolves over time, often in unforeseen ways. These variations are due to so-called concept drifts, caused by changes in the underlying data generation mechanisms. In a classification setting, concept drift causes the previously learned models to become inaccurate, unsafe and even unusable. Accordingly, concept drifts need to be detected, and handled, as soon as possible. In medical applications and military zones, for example, change in behaviors should be detected in near real-time, to avoid potential loss of life. To this end, we introduce the McDiarmid Drift Detection Method (MDDM), which utilizes McDiarmid’s inequality in order to detect concept drift. The MDDM approach proceeds by sliding a window over prediction results, and associate window entries with weights. Higher weights are assigned to the most recent entries, in order to emphasize their importance. As instances are processed, the detection algorithm compares a weighted mean of elements inside the sliding window with the maximum weighted mean observed so far. A significant difference between the two weighted means, upper-bounded by the McDiarmid inequality, implies a concept drift. Our extensive experimentation against synthetic and real-world data streams show that our novel method outperforms the state-of-the-art. Specifically, MDDM yields shorter detection delays as well as lower false negative rates, while maintaining high classification accuracies. MCES-P Optimal decision making with limited or no information in stochastic environments where multiple agents interact is a challenging topic in the realm of artificial intelligence. Reinforcement learning (RL) is a popular approach for arriving at optimal strategies by predicating stimuli, such as the reward for following a strategy, on experience. RL is heavily explored in the single-agent context, but is a nascent concept in multiagent problems. To this end, I propose several principled model-free and partially model-based reinforcement learning approaches for several multiagent settings. In the realm of normative reinforcement learning, I introduce scalable extensions to Monte Carlo exploring starts for partially observable Markov Decision Processes (POMDP), dubbed MCES-P, where I expand the theory and algorithm to the multiagent setting. I first examine MCES-P with probably approximately correct (PAC) bounds in the context of multiagent setting, showing MCESP+PAC holds in the presence of other agents. I then propose a more sample-efficient methodology for antagonistic settings, MCESIP+PAC. For cooperative settings, I extend MCES-P to the Multiagent POMDP, dubbed MCESMP+PAC. I then explore the use of reinforcement learning as a methodology in searching for optima in realistic and latent model environments. First, I explore a parameterized Q-learning approach in modeling humans learning to reason in an uncertain, multiagent environment. Next, I propose an implementation of MCES-P, along with image segmentation, to create an adaptive team-based reinforcement learning technique to positively identify the presence of phenotypically-expressed water and pathogen stress in crop fields. MC-ISTA-Net The optimization inspired network can bridge convex optimization and neural networks in Compressive Sensing (CS) reconstruction of natural image, like ISTA-Net+, which mapping optimization algorithm: iterative shrinkage-thresholding algorithm (ISTA) into network. However, measurement matrix and input initialization are still hand-crafted, and multi-channel feature map contain information at different frequencies, which is treated equally across channels, hindering the ability of CS reconstruction in optimization-inspired networks. In order to solve the above problems, we proposed MC-ISTA-Net McNemar Test In statistics, McNemar’s test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is “marginal homogeneity”). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium. McTorch In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch. It aims to lower the barrier for users wishing to use manifold constraints in deep learning applications, i.e., when the parameters are constrained to lie on a manifold. Such constraints include the popular orthogonality and rank constraints, and have been recently used in a number of applications in deep learning. McTorch follows PyTorch’s architecture and decouples manifold definitions and optimizers, i.e., once a new manifold is added it can be used with any existing optimizer and vice-versa. McTorch is available at https://…/mctorch. Mean Absolute Deviation(MAD) The mean absolute deviation (MAD), also referred to as the mean deviation (or sometimes average absolute deviation, though see above for a distinction), is the mean of the absolute deviations of a set of data about the data’s mean. In other words, it is the average distance of the data set from its mean. MAD has been proposed to be used in place of standard deviation since it corresponds better to real life. Because the MAD is a simpler measure of variability than the standard deviation, it can be used as pedagogical tool to help motivate the standard deviation. Mean Absolute Percentage Deviation(MAPD) The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of accuracy of a method for constructing fitted time series values in statistics, specifically in trend estimation. It usually expresses accuracy as a percentage, Mean Average Percentage Error(MAPE) The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of accuracy of a method for constructing fitted time series values in statistics, specifically in trend estimation. It usually expresses accuracy as a percentage, Mean Average Precision(mAP) Mean average precision for a set of queries is the mean of the average precision scores for each query. Breaking down Mean Average Precision (mAP) Mean Directional Accuracy(MDA) Mean Directional Accuracy (MDA), also known as Mean Direction Accuracy, is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. In simple words, MDA provides the probability that the under study forecasting method can detect the correct direction of the time series. MDA is a popular metric for forecasting performance in economics and finance. MDA is used in economics applications where the economists is often interested only in directional movement of variable of interest. As an example in macroeconomics, a monetary authority who likes to know the direction of the inflation, to raises interest rates or decrease the rates if inflation is predicted to rise or drop respectively. Another example can be found in financial planning where the user wants to know if the demand has increasing direction or decreasing trend. Mean Field Reinforcement Learning(MFRL) Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of user interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution. Experiments on resource allocation, Ising model estimation, and battle game tasks verify the learning effectiveness of our mean field approaches in handling many-agent interactions in population. Mean Field Residual Network We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the ‘edge of chaos’ hypothesis, these subexponential and polynomial laws allow residual networks to ‘hover over the boundary between stability and chaos,’ thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind. Mean First Passage Time based DYNA(MFPT-DYNA) We propose a hybrid approach aimed at improving the sample efficiency in goal-directed reinforcement learning. We do this via a two-step mechanism where firstly, we approximate a model from Model-Free reinforcement learning. Then, we leverage this approximate model along with a notion of reachability using Mean First Passage Times to perform Model-Based reinforcement learning. Built on such a novel observation, we design two new algorithms – Mean First Passage Time based Q-Learning (MFPT-Q) and Mean First Passage Time based DYNA (MFPT-DYNA), that have been fundamentally modified from the state-of-the-art reinforcement learning techniques. Preliminary results have shown that our hybrid approaches converge with much fewer iterations than their corresponding state-of-the-art counterparts and therefore requiring much fewer samples and much fewer training trials to converge. Mean First Passage Time based Q-Learning(MFPT-Q) We propose a hybrid approach aimed at improving the sample efficiency in goal-directed reinforcement learning. We do this via a two-step mechanism where firstly, we approximate a model from Model-Free reinforcement learning. Then, we leverage this approximate model along with a notion of reachability using Mean First Passage Times to perform Model-Based reinforcement learning. Built on such a novel observation, we design two new algorithms – Mean First Passage Time based Q-Learning (MFPT-Q) and Mean First Passage Time based DYNA (MFPT-DYNA), that have been fundamentally modified from the state-of-the-art reinforcement learning techniques. Preliminary results have shown that our hybrid approaches converge with much fewer iterations than their corresponding state-of-the-art counterparts and therefore requiring much fewer samples and much fewer training trials to converge. Mean Shift Mean shift is a non-parametric feature-space analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing. http://…/Mean-Shift-Theory.pdf Mean Shift Clustering The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters. http://…/mean_shift.pdf http://…/mean-shift Mean Squared Error(MSE) In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. The difference occurs because of randomness or because the estimator doesn’t account for information that could produce a more accurate estimate. Meaningful Purposive Interaction Analysis(MPIA) This book introduces Meaningful Purposive Interaction Analysis (MPIA) theory, which combines social network analysis (SNA) with latent semantic analysis (LSA) to help create and analyse a meaningful learning landscape from the digital traces left by a learning community in the co-construction of knowledge. The hybrid algorithm is implemented in the statistical programming language and environment R, introducing packages which capture – through matrix algebra – elements of learners’ work with more knowledgeable others and resourceful content artefacts. The book provides comprehensive package-by-package application examples, and code samples that guide the reader through the MPIA model to show how the MPIA landscape can be constructed and the learner’s journey mapped and analysed. This building block application will allow the reader to progress to using and building analytics to guide students and support decision-making in learning. Measure Differential Equations(MDE) A new type of differential equations for probability measures on Euclidean spaces, called Measure Differential Equations (briefly MDEs), is introduced. MDEs correspond to Probability Vector Fields, which map measures on an Euclidean space to measures on its tangent bundle. Solutions are intended in weak sense and existence, uniqueness and continuous dependence results are proved under suitable conditions. The latter are expressed in terms of the Wasserstein metric on the base and fiber of the tangent bundle. MDEs represent a natural measure-theoretic generalization of Ordinary Differential Equations via a monoid morphism mapping sums of vector fields to fiber convolution of the corresponding Probability Vector Fields. Various examples, including finite-speed diffusion and concentration, are shown, together with relationships to Partial Differential Equations. Finally, MDEs are also natural mean-field limits of multi-particle systems, with convergence results extending the classical Dubroshin approach. Measure Forecast Accuracy MEBoost Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. Mechanical Turk(MTurk) Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that enables individuals and businesses (known as Requesters) to coordinate the use of human intelligence to perform tasks that computers are currently unable to do. It is one of the sites of Amazon Web Services. Employers are able to post jobs known as HITs (Human Intelligence Tasks), such as choosing the best among several photographs of a storefront, writing product descriptions, or identifying performers on music CDs. Workers (called Providers in Mechanical Turk’s Terms of Service, or, more colloquially, Turkers) can then browse among existing jobs and complete them for a monetary payment set by the employer. To place jobs, the requesting programs use an open application programming interface (API), or the more limited MTurk Requester site. Employers are restricted to US-based entities. Mechanism for Emergency Demand Response(MEDR) Demand response (DR) is not only a crucial solution to the demand side management but also a vital means of electricity market in maintaining power grid reliability, sustainability and stability. DR can enable consumers (e.g. data centers) to reduce their electricity consumption when the supply of electricity is a shortage. The consumers will be rewarded in the case of DR if they reduce or shift some of their energy usage during peak hours. Aiming at solving the efficiency of DR, in this paper, we present MEDR, a mechanism on emergency DR in colocation data center. First, we formalize the MEDR problem and propose a dynamic programming to solve the optimization version of the problem. We then design a deterministic mechanism as a solution to solve the MEDR problem. We show that our proposed mechanism is truthful. Next, we prove that our mechanism is an FPTAS, i.e., it can be approximated within $1 + \epsilon$ for any given $\epsilon > 0$, while the running time of our mechanism is polynomial in $n$ and $1/\epsilon$, where $n$ is the number of tenants in the datacenter. Furthermore, we also give an auction system covering the efficient FPTAS algorithm as bidding decision program for DR in colocation datacenter. Finally, we choose a practical smart grid dataset to build a large number of datasets for simulation in performance evaluation. By evaluating metrics of the approximation ratio of our mechanism, the non-negative utility of tenants and social cost of colocation datacenter, the results demonstrate the effectiveness of our work. Median Absolute Deviation(MAD) In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample. Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are (1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1. Median Polish The median polish is an exploratory data analysis procedure proposed by the statistician John Tukey. It finds an additively-fit model for data in a two-way layout table (usually, results from a factorial experiment) of the form row effect + column effect + overall median. STMedianPolish Median Probability Model(MPM) Often the goal of model selection is to choose a model for future prediction, and it is natural to measure the accuracy of a future prediction by squared error loss. Under the Bayesian approach, it is commonly perceived that the optimal predictive model is the model with highest posterior probability, but this is not necessarily the case. In this paper we show that, for selection among normal linear models, the optimal predictive model is often the median probability model, which is defined as the model consisting of those variables which have overall posterior probability greater than or equal to 1/2 of being in a model. The median probability model often differs from the highest probability model. The median probability model (MPM) Barbieri and Berger (2004) is defined as the model consisting of those variables whose marginal posterior probability of inclusion is at least 0.5. The MPM rule yields the best single model for prediction in orthogonal and nested correlated designs. This result was originally conceived under a specific class of priors, such as the point mass mixtures of non-informative and g-type priors. The MPM rule, however, has become so very popular that it is now being deployed for a wider variety of priors and under correlated designs, where the properties of MPM are not yet completely understood. The Median Probability Model and Correlated Variables MediaRank In the recent political climate, the topic of news quality has drawn attention both from the public and the academic communities. The growing distrust of traditional news media makes it harder to find a common base of accepted truth. In this work, we design and build MediaRank (www.media-rank.com), a fully automated system to rank over 50,000 online news sources around the world. MediaRank collects and analyzes one million news webpages and two million related tweets everyday. We base our algorithmic analysis on four properties journalists have established to be associated with reporting quality: peer reputation, reporting bias / breadth, bottomline financial pressure, and popularity. Our major contributions of this paper include: (i) Open, interpretable quality rankings for over 50,000 of the world’s major news sources. Our rankings are validated against 35 published news rankings, including French, German, Russian, and Spanish language sources. MediaRank scores correlate positively with 34 of 35 of these expert rankings. (ii) New computational methods for measuring influence and bottomline pressure. To the best of our knowledge, we are the first to study the large-scale news reporting citation graph in-depth. We also propose new ways to measure the aggressiveness of advertisements and identify social bots, establishing a connection between both types of bad behavior. (iii) Analyzing the effect of media source bias and significance. We prove that news sources cite others despite different political views in accord with quality measures. However, in four English-speaking countries (US, UK, Canada, and Australia), the highest ranking sources all disproportionately favor left-wing parties, even when the majority of news sources exhibited conservative slants. Mediation In statistics, a mediation model is one that seeks to identify and explicate the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third explanatory variable, known as a mediator variable. Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables. In other words, mediating relationships occur when a third variable plays an important role in governing the relationship between the other two variables. mediation,mma,mlma MediChainTM The set of distributed ledger architectures known as blockchain is best known for cryptocurrency applications such as Bitcoin and Ethereum. These permissionless block chains are showing the potential to be disruptive to the financial services industry. Their broader adoption is likely to be limited by the maximum block size, the cost of the Proof of Work consensus mechanism, and the increasing size of any given chain overwhelming most of the participating nodes. These factors have led to many cryptocurrency blockchains to become centralized in the nodes with enough computing power and storage to be a dominant miner and validator. Permissioned chains operate in trusted environments and can, therefore, avoid the computationally expensive consensus mechanisms. Permissioned chains are still susceptible to asset storage demands and non-standard user interfaces that will impede their adoption. This paper describes an approach to addressing these limitations: permissioned blockchain that uses off-chain storage of the data assets and this is accessed through a standard browser and mobile app. The implementation in the Hyperledger framework is described as is an example use of patient-centered health data management. MedImpute Missing data is a common problem in real-world settings and particularly relevant in healthcare applications where researchers use Electronic Health Records (EHR) and results of observational studies to apply analytics methods. This issue becomes even more prominent for longitudinal data sets, where multiple instances of the same individual correspond to different observations in time. Standard imputation methods do not take into account patient specific information incorporated in multivariate panel data. We introduce the novel imputation algorithm MedImpute that addresses this problem, extending the flexible framework of OptImpute suggested by Bertsimas et al. (2018). Our algorithm provides imputations for data sets with missing continuous and categorical features, and we present the formulation and implement scalable first-order methods for a $K$-NN model. We test the performance of our algorithm on longitudinal data from the Framingham Heart Study when data are missing completely at random (MCAR). We demonstrate that MedImpute leads to significant improvements in both imputation accuracy and downstream model AUC compared to state-of-the-art methods. Medoid Medoids are representative objects of a data set or a cluster with a data set whose average dissimilarity to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined such as 3-D trajectories or in the gene expression context. The term is used in computer science in data clustering algorithms. MedSim We present MedSim, a novel semantic SIMilarity method based on public well-established bio-MEDical knowledge graphs (KGs) and large-scale corpus, to study the therapeutic substitution of antibiotics. Besides hierarchy and corpus of KGs, MedSim further interprets medicine characteristics by constructing multi-dimensional medicine-specific feature vectors. Dataset of 528 antibiotic pairs scored by doctors is applied for evaluation and MedSim has produced statistically significant improvement over other semantic similarity methods. Furthermore, some promising applications of MedSim in drug substitution and drug abuse prevention are presented in case study. Medusa Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience. Meeting Bot In this paper we present Meeting Bot, a reinforcement learning based conversational system that interacts with multiple users to schedule meetings. The system is able to interpret user utterences and map them to preferred time slots, which are then fed to a reinforcement learning (RL) system with the goal of converging on an agreeable time slot. The RL system is able to adapt to user preferences and environmental changes in meeting arrival rate while still scheduling effectively. Learning is performed via policy gradient with exploration, by utilizing an MLP as an approximator of the policy function. Results demonstrate that the system outperforms standard scheduling algorithms in terms of overall scheduling efficiency. Additionally, the system is able to adapt its strategy to situations when users consistently reject or accept meetings in certain slots (such as Friday afternoon versus Thursday morning), or when the meeting is called by members who are at a more senior designation. MEKA The MEKA project provides an open source implementation of methods for multi-label learning and evaluation. In multi-label classification, we want to predict multiple output variables for each input instance. This different from the ‘standard’ case (binary, or multi-class classification) which involves only a single target variable. MEKA is based on the WEKA Machine Learning Toolkit; it includes dozens of multi-label methods from the scientific literature, as well as a wrapper to the related MULAN framework. Memetic Algorithms(MA) Memetic algorithms (MA) represent one of the recent growing areas of research in evolutionary computation. The term MA is now widely used as a synergy of evolutionary or any population-based approach with separate individual learning or local improvement procedures for problem search. Quite often, MA are also referred to in the literature as Baldwinian evolutionary algorithms (EA), Lamarckian EAs, cultural algorithms, or genetic local search. A Gentle Introduction to Memetic Algorithms Memetic Graph Clustering ➘ “VieClus” Memorized Sparse Backpropagation(MSBP) Neural network learning is typically slow since backpropagation needs to compute full gradients and backpropagate them across multiple layers. Despite its success of existing work in accelerating propagation through sparseness, the relevant theoretical characteristics remain unexplored and we empirically find that they suffer from the loss of information contained in unpropagated gradients. To tackle these problems, in this work, we present a unified sparse backpropagation framework and provide a detailed analysis of its theoretical characteristics. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Furthermore, a simple yet effective algorithm named memorized sparse backpropagation (MSBP) is proposed to remedy the problem of information loss by storing unpropagated gradients in memory for the next learning. The experiments demonstrate that the proposed MSBP is able to effectively alleviate the information loss in traditional sparse backpropagation while achieving comparable acceleration. Memory Attention-aware Recommender System(MARS) In this paper, we study the problem of modeling users’ diverse interests. Previous methods usually learn a fixed user representation, which has a limited ability to represent distinct interests of a user. In order to model users’ various interests, we propose a Memory Attention-aware Recommender System (MARS). MARS utilizes a memory component and a novel attentional mechanism to learn deep \textit{adaptive user representations}. Trained in an end-to-end fashion, MARS adaptively summarizes users’ interests. In the experiments, MARS outperforms seven state-of-the-art methods on three real-world datasets in terms of recall and mean average precision. We also demonstrate that MARS has a great interpretability to explain its recommendation results, which is important in many recommendation scenarios. Memory Augmented Control Network(MACN) Planning problems in partially observable environments cannot be solved directly with convolutional networks and require some form of memory. But, even memory networks with sophisticated addressing schemes are unable to learn intelligent reasoning satisfactorily due to the complexity of simultaneously learning to access memory and plan. To mitigate these challenges we introduce the Memory Augmented Control Network (MACN). The proposed network architecture consists of three main parts. The first part uses convolutions to extract features and the second part uses a neural network-based planning module to pre-plan in the environment. The third part uses a network controller that learns to store those specific instances of past information that are necessary for planning. The performance of the network is evaluated in discrete grid world environments for path planning in the presence of simple and complex obstacles. We show that our network learns to plan and can generalize to new environments. Memory Augmented Neural Network(MANN) Neural Turing Machines (NTMs) are an instance of Memory Augmented Neural Networks, a new class of recurrent neural networks which decouple computation from memory by introducing an external memory unit. NTMs have demonstrated superior performance over Long Short-Term Memory Cells in several sequence learning tasks. A number of open source implementations of NTMs exist but are unstable during training and/or fail to replicate the reported performance of NTMs. This paper presents the details of our successful implementation of a NTM. Our implementation learns to solve three sequential learning tasks from the original NTM paper. We find that the choice of memory contents initialization scheme is crucial in successfully implementing a NTM. Networks with memory contents initialized to small constant values converge on average 2 times faster than the next best memory contents initialization scheme. Memory In Memory(MIM) Natural spatiotemporal processes can be highly non-stationary in many ways, e.g. the low-level non-stationarity such as spatial correlations or temporal dependencies of local pixel values; and the high-level variations such as the accumulation, deformation or dissipation of radar echoes in precipitation forecasting. From Cramer’s Decomposition, any non-stationary process can be decomposed into deterministic, time-variant polynomials, plus a zero-mean stochastic term. By applying differencing operations appropriately, we may turn time-variant polynomials into a constant, making the deterministic component predictable. However, most previous recurrent neural networks for spatiotemporal prediction do not use the differential signals effectively, and their relatively simple state transition functions prevent them from learning too complicated variations in spacetime. We propose the Memory In Memory (MIM) networks and corresponding recurrent blocks for this purpose. The MIM blocks exploit the differential signals between adjacent recurrent states to model the non-stationary and approximately stationary properties in spatiotemporal dynamics with two cascaded, self-renewed memory modules. By stacking multiple MIM blocks, we could potentially handle higher-order non-stationarity. The MIM networks achieve the state-of-the-art results on three spatiotemporal prediction tasks across both synthetic and real-world datasets. We believe that the general idea of this work can be potentially applied to other time-series forecasting tasks. Memory Networks We describe a new class of learning models called memory networks. Memory networks reason with inference components combined with a long-term memory component; they learn how to use these jointly. The long-term memory can be read and written to, with the goal of using it for prediction. We investigate these models in the context of question answering (QA) where the long-term memory effectively acts as a (dynamic) knowledge base, and the output is a textual response. We evaluate them on a large-scale QA task, and a smaller, but more complex, toy task generated from a simulated world. In the latter, we show the reasoning power of such models by chaining multiple supporting sentences to answer questions that require understanding the intension of verbs. Memory Replay GAN(MeRGAN) Previous works on sequential learning address the problem of forgetting in discriminative models. In this paper we consider the case of generative models. In particular, we investigate generative adversarial networks (GANs) in the task of learning new categories in a sequential fashion. We first show that sequential fine tuning renders the network unable to properly generate images from previous categories (i.e. forgetting). Addressing this problem, we propose Memory Replay GANs (MeRGANs), a conditional GAN framework that integrates a memory replay generator. We study two methods to prevent forgetting by leveraging these replays, namely joint training with replay and replay alignment. Qualitative and quantitative experimental results in MNIST, SVHN and LSUN datasets show that our memory replay approach can generate competitive images while significantly mitigating the forgetting of previous categories Memory Time-Series Network(MTNet) Multivariate time series forecasting is extensively studied throughout the years with ubiquitous applications in areas such as finance, traffic, environment, etc. Still, concerns have been raised on traditional methods for incapable of modeling complex patterns or dependencies lying in real word data. To address such concerns, various deep learning models, mainly Recurrent Neural Network (RNN) based methods, are proposed. Nevertheless, capturing extremely long-term patterns while effectively incorporating information from other variables remains a challenge for time-series forecasting. Furthermore, lack-of-explainability remains one serious drawback for deep neural network models. Inspired by Memory Network proposed for solving the question-answering task, we propose a deep learning based model named Memory Time-series network (MTNet) for time series forecasting. MTNet consists of a large memory component, three separate encoders, and an autoregressive component to train jointly. Additionally, the attention mechanism designed enable MTNet to be highly interpretable. We can easily tell which part of the historic data is referenced the most. Memory-Augmented Autoencoder(MemAE) Deep autoencoder has been extensively used for anomaly detection. Training on the normal data, the autoencoder is expected to produce higher reconstruction error for the abnormal inputs than the normal ones, which is adopted as a criterion for identifying anomalies. However, this assumption does not always hold in practice. It has been observed that sometimes the autoencoder ‘generalizes’ so well that it can also reconstruct anomalies well, leading to the miss detection of anomalies. To mitigate this drawback for autoencoder based anomaly detector, we propose to augment the autoencoder with a memory module and develop an improved autoencoder called memory-augmented autoencoder, i.e. MemAE. Given an input, MemAE firstly obtains the encoding from the encoder and then uses it as a query to retrieve the most relevant memory items for reconstruction. At the training stage, the memory contents are updated and are encouraged to represent the prototypical elements of the normal data. At the test stage, the learned memory will be fixed, and the reconstruction is obtained from a few selected memory records of the normal data. The reconstruction will thus tend to be close to a normal sample. Thus the reconstructed errors on anomalies will be strengthened for anomaly detection. MemAE is free of assumptions on the data type and thus general to be applied to different tasks. Experiments on various datasets prove the excellent generalization and high effectiveness of the proposed MemAE. Memory-Efficient Convolution(MEC) Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods have been proposed including im2col-based convolution, FFT-based convolution, or Winograd-based algorithm. However, all these indirect methods have high memory-overhead, which creates performance degradation and offers a poor trade-off between performance and memory consumption. In this work, we propose a memory-efficient convolution or MEC with compact lowering, which reduces memory-overhead substantially and accelerates convolution process. MEC lowers the input matrix in a simple yet efficient/compact way (i.e., much less memory-overhead), and then executes multiple small matrix multiplications in parallel to get convolution completed. Additionally, the reduced memory footprint improves memory sub-system efficiency, improving performance. Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms. Memory-Limited Online Subspace Estimation Scheme(MOSES) This paper introduces Memory-limited Online Subspace Estimation Scheme (MOSES) for both estimating the principal components of data and reducing its dimension. More specifically, consider a scenario where the data vectors are presented sequentially to a user who has limited storage and processing time available, for example in the context of sensor networks. In this scenario, MOSES maintains an estimate of leading principal components of the data that has arrived so far and also reduces its dimension. In terms of its origins, MOSES slightly generalises the popular incremental Singular Vale Decomposition (SVD) to handle thin blocks of data. This simple generalisation is in part what allows us to complement MOSES with a comprehensive statistical analysis that is not available for incremental SVD, despite its empirical success. This generalisation also enables us to concretely interpret MOSES as an approximate solver for the underlying non-convex optimisation program. We also find that MOSES shows state-of-the-art performance in our numerical experiments with both synthetic and real-world datasets. Memristive Neural Network(MNN) Mendelian Randomization The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). http://…/018150.full.pdf MentorNet Recent studies have discovered that deep networks are capable of memorizing the entire data even when the labels are completely random. Since deep models are trained on big data where labels are often noisy, the ability to overfit noise can lead to poor performance. To overcome the overfitting on corrupted training data, we propose a novel technique to regularize deep networks in the data dimension. This is achieved by learning a neural network called MentorNet to supervise the training of the base network, namely, StudentNet. Our work is inspired by curriculum learning and advances the theory by learning a curriculum from data by neural networks. We demonstrate the efficacy of MentorNet on several benchmarks. Comprehensive experiments show that it is able to significantly improve the generalization performance of the state-of-the-art deep networks on corrupted training data. Mercury-ML Mercury-ML is an open source Machine Learning workflow management library. Its core contributors are employees of Alexander Thamm GmbH Merge and Select In this article we introduce Merge and Select – a methodology – and factorMerger – an R package – for exploration and visualization of k-group comparisons. Comparison of k-groups is one of the most important issues in exploratory analyses and it has zillions of applications. The classical solution is to test a null hypothesis that observations from all groups come from the same distribution. If the global null hypothesis is rejected a more detailed analysis of differences among pairs of groups is performed. The traditional approach is to use pairwise post hoc tests in order to verify which groups differ significantly. However, this approach fails with large number of groups in both interpretation and visualization layer. The Merge and Select methodology solves this problem by using easy to understand description of LRT based similarity among groups. Merged-Averaged Classifiers via Hashing(MACH) We present Merged-Averaged Classifiers via Hashing (MACH) for K-classification with ultra-large values of K. Compared to traditional one-vs-all classifiers that require O(Kd) memory and inference cost, MACH only need O(d log K) (d is dimensionality )memory while only requiring O(K log K + d log K) operation for inference. MACH is a generic K-classification algorithm, with provably theoretical guarantees, which requires O(log K) memory without any assumption on the relationship between classes. MACH uses universal hashing to reduce classification with a large number of classes to few independent classification tasks with small (constant) number of classes. We provide theoretical quantification of discriminability-memory tradeoff. With MACH we can train ODP dataset with 100,000 classes and 400,000 features on a single Titan X GPU, with the classification accuracy of 19.28%, which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. In contrast, MACH can achieve 9% accuracy with 480x reduction in the model size (of mere 0.3GB). With MACH, we also demonstrate complete training of fine-grained imagenet dataset (compressed size 104GB), with 21,000 classes, on a single GPU. To the best of our knowledge, this is the first work to demonstrate complete training of these extreme-class datasets on a single Titan X. MergeNet We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples. MergeShuffle This article introduces an algorithm, MergeShuffle, which is an extremely efficient algorithm to generate random permutations (or to randomly permute an existing array). It is easy to implement, runs in $n\log_2 n + O(1)$ time, is in-place, uses $n\log_2 n + \Theta(n)$ random bits, and can be parallelized accross any number of processes, in a shared-memory PRAM model. Finally, our preliminary simulations using OpenMP suggest it is more efficient than the Rao-Sandelius algorithm, one of the fastest existing random permutation algorithms. We also show how it is possible to further reduce the number of random bits consumed, by introducing a second algorithm BalancedShuffle, a variant of the Rao-Sandelius algorithm which is more conservative in the way it recursively partitions arrays to be shuffled. While this algorithm is of lesser practical interest, we believe it may be of theoretical value. Our full code is available at: https://…/mergeshuffle mermaid Generation of diagrams and flowcharts from text in a similar manner as markdown. Ever wanted to simplify documentation and avoid heavy tools like Visio when explaining your code? This is why mermaid was born, a simple markdown-like script language for generating charts from text via javascript. Mesa Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. MESH With the rapid growth of large online social networks, the ability to analyze large-scale social structure and behavior has become critically important, and this has led to the development of several scalable graph processing systems. In reality, however, social interaction takes place not only between pairs of individuals as in the graph model, but rather in the context of multi-user groups. Research has shown that such group dynamics can be better modeled through a more general hypergraph model, resulting in the need to build scalable hypergraph processing systems. In this paper, we present MESH, a flexible distributed framework for scalable hypergraph processing. MESH provides an easy-to-use and expressive application programming interface that naturally extends the think like a vertex model common to many popular graph processing systems. Our framework provides a flexible implementation based on an underlying graph processing system, and enables different design choices for the key implementation issues of partitioning a hypergraph representation. We implement MESH on top of the popular GraphX graph processing framework in Apache Spark. Using a variety of real datasets and experiments conducted on a local 8-node cluster as well as a 65-node Amazon AWS testbed, we demonstrate that MESH provides flexibility based on data and application characteristics, as well as scalability with cluster size. We further show that it is competitive in performance to HyperX, another hypergraph processing system based on Spark, while providing a much simpler implementation (requiring about 5X fewer lines of code), thus showing that simplicity and flexibility need not come at the cost of performance. MeshCNN A polygonal mesh representation provides an efficient approximation for 3D shapes. It explicitly captures both shape surface and topology, and leverages non-uniformity to represent large flat regions as well as sharp, intricate features. This non-uniformity and irregularity, however, inhibits mesh analysis efforts using neural networks that combine convolution and pooling operations. In this paper, we utilize the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes. Analogous to classic CNNs, MeshCNN combines specialized convolution and pooling layers that operate on the mesh edges, by leveraging their intrinsic geodesic connections. Convolutions are applied on edges and the four edges of their incident triangles, and pooling is applied via an edge collapse operation that retains surface topology, thereby, generating new mesh connectivity for the subsequent convolutions. MeshCNN learns which edges to collapse, thus forming a task-driven process where the network exposes and expands the important features while discarding the redundant ones. We demonstrate the effectiveness of our task-driven pooling on various learning tasks applied to 3D meshes. MeshGAN Generative Adversarial Networks (GANs) are currently the method of choice for generating visual data. Certain GAN architectures and training methods have demonstrated exceptional performance in generating realistic synthetic images (in particular, of human faces). However, for 3D object, GANs still fall short of the success they have had with images. One of the reasons is due to the fact that so far GANs have been applied as 3D convolutional architectures to discrete volumetric representations of 3D objects. In this paper, we propose the first intrinsic GANs architecture operating directly on 3D meshes (named as MeshGAN). Both quantitative and qualitative results are provided to show that MeshGAN can be used to generate high-fidelity 3D face with rich identities and expressions. MeSH-gram Eliciting semantic similarity between concepts in the biomedical domain remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships The underlying idea is that two words that have close meaning gather similar contexts. In this study, we propose a new neural network model named MeSH-gram which relies on a straighforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings) descriptors instead words. Trained on publicly available corpus PubMed MEDLINE, MeSH-gram is evaluated on reference standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at several windows contexts. A deeper comparison is performed with tewenty existing models. All the obtained results of Spearman’s rank correlations between human scores and computed similarities show that MeSH-gram outperforms the skip-gram model, and is comparable to the best methods but that need more computation and external resources. Mesh-TensorFlow Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the ‘batch’ dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT’14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://…/mesh . Message Importance Divergence(MID) Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events dominate the importance of the total message to some degree. Therefore, it is significant to design an information transfer measure based on the message importance which emphasizes the small probability events. In this paper, we propose the message importance divergence (MID) and investigate its characteristics and applications on three aspects. First, the message importance transfer capacity based on MID is presented to offer an upper bound for the information transfer with disturbance. Then, we utilize the MID to guide the queue length selection, which is the fundamental problem considered to have higher social or academic value in the caching operation of mobile edge computing. Finally, we extend the MID to the continuous case and discuss the robustness by using it to measuring information distance. Message Importance Measure(MIM) Rare events attract more attention and interests in many scenarios of big data such as anomaly detection and security systems. To characterize the rare events importance from probabilistic perspective, the message importance measure (MIM) is proposed as a kind of semantics analysis tool. Similar to Shannon entropy, the MIM has its special functional on information processing, in which the parameter $\varpi$ of MIM plays a vital role. Actually, the parameter $\varpi$ dominates the properties of MIM, based on which the MIM has three work regions where the corresponding parameters satisfy $0 \le \varpi \le 2/\max$, $\varpi > 2/\max$ and $\varpi < 0$ respectively. Furthermore, in the case $0 \le \varpi \le 2/\max$, there are some similarity between the MIM and Shannon entropy in the information compression and transmission, which provide a new viewpoint for information theory. This paper first constructs a system model with message importance measure and proposes the message importance loss to enrich the information processing strategies. Moreover, we propose the message importance loss capacity to measure the information importance harvest in a transmission. Furthermore, the message importance distortion function is presented to give an upper bound of information compression based on message importance measure. Additionally, the bitrate transmission constrained by the message importance loss is investigated to broaden the scope for Shannon information theory. Message Passing Algorithms Constraint Satisfaction Problems (CSPs) are defined over a set of variables whose state must satisfy a number of constraints. We study a class of algorithms called Message Passing Algorithms, which aim at finding the probability distribution of the variables over the space of satisfying assignments. These algorithms involve passing local messages (according to some message update rules) over the edges of a factor graph constructed corresponding to the CSP. Message Passing Interface(MPI) Message Passing Interface (MPI) is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran or the C programming language. There are several well-tested and efficient implementations of MPI, including some that are free or in the public domain. These fostered the development of a parallel software industry, and there encouraged development of portable and scalable large-scale parallel applications. http://…/r-and-meta-analysis.html metaplus,MAVIS Message Understanding Conference(MUC) The Message Understanding Conferences (MUC) were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. The character of this competition-many concurrent research teams competing against one another-required the development of standards for evaluation, e.g. the adoption of metrics like precision and recall. Message-Dropout In this paper, we propose a new learning technique named message-dropout to improve the performance for multi-agent deep reinforcement learning under two application scenarios: 1) classical multi-agent reinforcement learning with direct message communication among agents and 2) centralized training with decentralized execution. In the first application scenario of multi-agent systems in which direct message communication among agents is allowed, the message-dropout technique drops out the received messages from other agents in a block-wise manner with a certain probability in the training phase and compensates for this effect by multiplying the weights of the dropped-out block units with a correction probability. The applied message-dropout technique effectively handles the increased input dimension in multi-agent reinforcement learning with communication and makes learning robust against communication errors in the execution phase. In the second application scenario of centralized training with decentralized execution, we particularly consider the application of the proposed message-dropout to Multi-Agent Deep Deterministic Policy Gradient (MADDPG), which uses a centralized critic to train a decentralized actor for each agent. We evaluate the proposed message-dropout technique for several games, and numerical results show that the proposed message-dropout technique with proper dropout rate improves the reinforcement learning performance significantly in terms of the training speed and the steady-state performance in the execution phase. M-Estimation In statistics, M-estimators are a broad class of estimators, which are obtained as the minima of sums of functions of the data. Least-squares estimators are a special case of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. More generally, an M-estimator may be defined to be a zero of an estimating function. This estimating function is often the derivative of another statistical function. For example, a maximum-likelihood estimate is often defined to be a zero of the derivative of the likelihood function with respect to the parameter; thus, a maximum-likelihood estimator is often a critical point of the score function. In many applications, such M-estimators can be thought of as estimating characteristics of the population. geex Meta Bag Algorithm Meta Continual Learning Using neural networks in practical settings would benefit from the ability of the networks to learn new tasks throughout their lifetimes without forgetting the previous tasks. This ability is limited in the current deep neural networks by a problem called catastrophic forgetting, where training on new tasks tends to severely degrade performance on previous tasks. One way to lessen the impact of the forgetting problem is to constrain parameters that are important to previous tasks to stay close to the optimal parameters. Recently, multiple competitive approaches for computing the importance of the parameters with respect to the previous tasks have been presented. In this paper, we propose a learning to optimize algorithm for mitigating catastrophic forgetting. Instead of trying to formulate a new constraint function ourselves, we propose to train another neural network to predict parameter update steps that respect the importance of parameters to the previous tasks. In the proposed meta-training scheme, the update predictor is trained to minimize loss on a combination of current and past tasks. We show experimentally that the proposed approach works in the continual learning setting. Meta Filter Pruning(MFP) Existing methods usually utilize pre-defined criterions, such as p-norm, to prune unimportant filters. There are two major limitations in these methods. First, the relations of the filters are largely ignored. The filters usually work jointly to make an accurate prediction in a collaborative way. Similar filters will have equivalent effects on the network prediction, and the redundant filters can be further pruned. Second, the pruning criterion remains unchanged during training. As the network updated at each iteration, the filter distribution also changes continuously. The pruning criterions should also be adaptively switched. In this paper, we propose Meta Filter Pruning (MFP) to solve the above problems. First, as a complement to the existing p-norm criterion, we introduce a new pruning criterion considering the filter relation via filter distance. Additionally, we build a meta pruning framework for filter pruning, so that our method could adaptively select the most appropriate pruning criterion as the filter distribution changes. Experiments validate our approach on two image classification benchmarks. Notably, on ILSVRC-2012, our MFP reduces more than 50% FLOPs on ResNet-50 with only 0.44% top-5 accuracy loss. Meta Networks Deep neural networks have been successfully applied in applications with a large amount of labeled data. However, there are major drawbacks of the neural networks that are related to rapid generalization with small data and continual learning of new concepts without forgetting. We present a novel meta learning method, Meta Networks (MetaNet), that acquires a meta-level knowledge across tasks and shifts its inductive bias via fast parameterization for the rapid generalization. When tested on the standard one-shot learning benchmarks, our MetaNet models achieved near human-level accuracy. We demonstrated several appealing properties of MetaNet relating to generalization and continual learning. Meta-Analysis for Pathway Enrichment(MAPE) Motivation: Many pathway analysis (or gene set enrichment analysis) methods have been developed to identify enriched pathways under different biological states within a genomic study. As more and more microarray datasets accumulate, meta-analysis methods have also been developed to integrate information among multiple studies. Currently, most meta-analysis methods for combining genomic studies focus on biomarker detection and meta-analysis for pathway analysis has not been systematically pursued. Results: We investigated two approaches of meta-analysis for pathway enrichment (MAPE) by combining statistical significance across studies at the gene level (MAPE_G) or at the pathway level (MAPE_P). Simulation results showed increased statistical power of meta-analysis approaches compared to a single study analysis and showed complementary advantages of MAPE_G and MAPE_P under different scenarios. We also developed an integrated method (MAPE_I) that incorporates advantages of both approaches. Comprehensive simulations and applications to real data on drug response of breast cancer cell lines and lung cancer tissues were evaluated to compare the performance of three MAPE variations. MAPE_P has the advantage of not requiring gene matching across studies. When MAPE_G and MAPE_P show complementary advantages, the hybrid version of MAPE_I is generally recommended. MetaPath MetaBags Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in production-grade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a meta-learning algorithm that learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. A set of meta-decision trees are learned using different types of meta-features, specially created for this purpose – to then be bagged at meta-level. This procedure is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. The proposed method and meta-features are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and real-world application datasets. The obtained results show that our method significantly outperforms existing state-of-the-art approaches. Meta-Cognitive Machine Learning Machine learning is usually defined in behaviourist terms, where external validation is the primary mechanism of learning. In this paper, I argue for a more holistic interpretation in which finding more probable, efficient and abstract representations is as central to learning as performance. In other words, machine learning should be extended with strategies to reason over its own learning process, leading to so-called meta-cognitive machine learning. As such, the de facto definition of machine learning should be reformulated in these intrinsically multi-objective terms, taking into account not only the task performance but also internal learning objectives. To this end, we suggest a ‘model entropy function’ to be defined that quantifies the efficiency of the internal learning processes. It is conjured that the minimization of this model entropy leads to concept formation. Besides philosophical aspects, some initial illustrations are included to support the claims. Meta-Curvature We propose to learn curvature information for better generalization and fast model adaptation, called meta-curvature. Based on the model-agnostic meta-learner (MAML), we learn to transform the gradients in the inner optimization such that the transformed gradients achieve better generalization performance to a new task. For training large scale neural networks, we decompose the curvature matrix into smaller matrices and capture the dependencies of the model’s parameters with a series of tensor products. We demonstrate the effects of our proposed method on both few-shot image classification and few-shot reinforcement learning tasks. Experimental results show consistent improvements on classification tasks and promising results on reinforcement learning tasks. Furthermore, we observe faster convergence rates of the meta-training process. Finally, we present an analysis that explains better generalization performance with the meta-trained curvature. META-Dynamic Ensemble Selection(META-DES) The key issue in Dynamic Ensemble Selection (DES) is defining a suitable criterion for calculating the classifiers’ competence. There are several criteria available to measure the level of competence of base classifiers, such as local accuracy estimates and ranking. However, using only one criterion may lead to a poor estimation of the classifier’s competence. In order to deal with this issue, we have proposed a novel dynamic ensemble selection framework using meta-learning, called META-DES. An important aspect of the META-DES framework is that multiple criteria can be embedded in the system encoded as different sets of meta-features. However, some DES criteria are not suitable for every classification problem. For instance, local accuracy estimates may produce poor results when there is a high degree of overlap between the classes. Moreover, a higher classification accuracy can be obtained if the performance of the meta-classifier is optimized for the corresponding data. In this paper, we propose a novel version of the META-DES framework based on the formal definition of the Oracle, called META-DES.Oracle. The Oracle is an abstract method that represents an ideal classifier selection scheme. A meta-feature selection scheme using an overfitting cautious Binary Particle Swarm Optimization (BPSO) is proposed for improving the performance of the meta-classifier. The difference between the outputs obtained by the meta-classifier and those presented by the Oracle is minimized. Thus, the meta-classifier is expected to obtain results that are similar to the Oracle. Experiments carried out using 30 classification problems demonstrate that the optimization procedure based on the Oracle definition leads to a significant improvement in classification accuracy when compared to previous versions of the META-DES framework and other state-of-the-art DES techniques. Meta-Embedding Click-through rate (CTR) prediction has been one of the most central problems in computational advertising. Lately, embedding techniques that produce low-dimensional representations of ad IDs drastically improve CTR prediction accuracies. However, such learning techniques are data demanding and work poorly on new ads with little logging data, which is known as the cold-start problem. In this paper, we aim to improve CTR predictions during both the cold-start phase and the warm-up phase when a new ad is added to the candidate pool. We propose Meta-Embedding, a meta-learning-based approach that learns to generate desirable initial embeddings for new ad IDs. The proposed method trains an embedding generator for new ad IDs by making use of previously learned ads through gradient-based meta-learning. In other words, our method learns how to learn better embeddings. When a new ad comes, the trained generator initializes the embedding of its ID by feeding its contents and attributes. Next, the generated embedding can speed up the model fitting during the warm-up phase when a few labeled examples are available, compared to the existing initialization methods. Experimental results on three real-world datasets showed that Meta-Embedding can significantly improve both the cold-start and warm-up performances for six existing CTR prediction models, ranging from lightweight models such as Factorization Machines to complicated deep models such as PNN and DeepFM. All of the above apply to conversion rate (CVR) predictions as well. MetaForest A requirement of classic meta-analysis is that the studies being aggregated are conceptually similar, and ideally, close replications. However, in many fields, there is substantial heterogeneity between studies on the same topic. Similar research questions are studied in different laboratories, using different methods, instruments, and samples. Classic meta-analysis lacks the power to assess more than a handful of univariate moderators, or to investigate interactions between moderators, and non-linear effects. MetaForest, by contrast, has substantial power to explore heterogeneity in meta-analysis. It can identify important moderators from a larger set of potential candidates, even with as little as 20 studies (Van Lissa, in preparation). This is an appealing quality, because many meta-analyses have small sample sizes. Moreover, MetaForest yields a measure of variable importance which can be used to identify important moderators, and offers partial prediction plots to explore the shape of the marginal relationship between moderators and effect size. metaforest Meta-GNN Meta-learning has received a tremendous recent attention as a possible approach for mimicking human intelligence, i.e., acquiring new knowledge and skills with little or even no demonstration. Most of the existing meta-learning methods are proposed to tackle few-shot learning problems such as image and text, in rather Euclidean domain. However, there are very few works applying meta-learning to non-Euclidean domains, and the recently proposed graph neural networks (GNNs) models do not perform effectively on graph few-shot learning problems. Towards this, we propose a novel graph meta-learning framework — Meta-GNN — to tackle the few-shot node classification problem in graph meta-learning settings. It obtains the prior knowledge of classifiers by training on many similar few-shot learning tasks and then classifies the nodes from new classes with only few labeled samples. Additionally, Meta-GNN is a general model that can be straightforwardly incorporated into any existing state-of-the-art GNN. Our experiments conducted on three benchmark datasets demonstrate that our proposed approach not only improves the node classification performance by a large margin on few-shot learning problems in meta-learning paradigm, but also learns a more general and flexible model for task adaption. MetaGrasp Data-driven approach for grasping shows significant advance recently. But these approaches usually require much training data. To increase the efficiency of grasping data collection, this paper presents a novel grasp training system including the whole pipeline from data collection to model inference. The system can collect effective grasp sample with a corrective strategy assisted by antipodal grasp rule, and we design an affordance interpreter network to predict pixelwise grasp affordance map. We define graspability, ungraspability and background as grasp affordances. The key advantage of our system is that the pixel-level affordance interpreter network trained with only a small number of grasp samples under antipodal rule can achieve significant performance on totally unseen objects and backgrounds. The training sample is only collected in simulation. Extensive qualitative and quantitative experiments demonstrate the accuracy and robustness of our proposed approach. In the real-world grasp experiments, we achieve a grasp success rate of 93% on a set of household items and 91% on a set of adversarial items with only about 6,300 simulated samples. We also achieve 87% accuracy in clutter scenario. Although the model is trained using only RGB image, when changing the background textures, it also performs well and can achieve even 94% accuracy on the set of adversarial objects, which outperforms current state-of-the-art methods. meta-iNat Benchmark Traditional recognition methods typically require large, artificially-balanced training classes, while few-shot learning methods are tested on artificially small ones. In contrast to both extremes, real world recognition problems exhibit heavy-tailed class distributions, with cluttered scenes and a mix of coarse and fine-grained class distinctions. We show that prior methods designed for few-shot learning do not work out of the box in these challenging conditions, based on a new ‘meta-iNat’ benchmark. We introduce three parameter-free improvements: (a) better training procedures based on adapting cross-validation to meta-learning, (b) novel architectures that localize objects using limited bounding box annotations before classification, and (c) simple parameter-free expansions of the feature space based on bilinear pooling. Together, these improvements double the accuracy of state-of-the-art models on meta-iNat while generalizing to prior benchmarks, complex neural architectures, and settings with substantial domain shift. Meta-Interpretive Learning(MIGO) World-class human players have been outperformed in a number of complex two person games (Go, Chess, Checkers) by Deep Reinforcement Learning systems. However, owing to tractability considerations minimax regret of a learning system cannot be evaluated in such games. In this paper we consider simple games (Noughts-and-Crosses and Hexapawn) in which minimax regret can be efficiently evaluated. We use these games to compare Cumulative Minimax Regret for variants of both standard and deep reinforcement learning against two variants of a new Meta-Interpretive Learning system called MIGO. In our experiments all tested variants of both normal and deep reinforcement learning have worse performance (higher cumulative minimax regret) than both variants of MIGO on Noughts-and-Crosses and Hexapawn. Additionally, MIGO’s learned rules are relatively easy to comprehend, and are demonstrated to achieve significant transfer learning in both directions between Noughts-and-Crosses and Hexapawn. Meta-Learning Autoencoder(MeLA) Compared to humans, machine learning models generally require significantly more training examples and fail to extrapolate from experience to solve previously unseen challenges. To help close this performance gap, we augment single-task neural networks with a meta-recognition model which learns a succinct model code via its autoencoder structure, using just a few informative examples. The model code is then employed by a meta-generative model to construct parameters for the task-specific model. We demonstrate that for previously unseen tasks, without additional training, this Meta-Learning Autoencoder (MeLA) framework can build models that closely match the true underlying models, with loss significantly lower than given by fine-tuned baseline networks, and performance that compares favorably with state-of-the-art meta-learning algorithms. MeLA also adds the ability to identify influential training examples and predict which additional data will be most valuable to acquire to improve model prediction. Meta-Learning for Online Learning(MOLe) Humans and animals can learn complex predictive models that allow them to accurately and reliably reason about real-world phenomena, and they can adapt such models extremely quickly in the face of unexpected changes. Deep neural network models allow us to represent very complex functions, but lack this capacity for rapid online adaptation. The goal in this paper is to develop a method for continual online learning from an incoming stream of data, using deep neural network models. We formulate an online learning procedure that uses stochastic gradient descent to update model parameters, and an expectation maximization algorithm with a Chinese restaurant process prior to develop and maintain a mixture of models to handle non-stationary task distributions. This allows for all models to be adapted as necessary, with new models instantiated for task changes and old models recalled when previously seen tasks are encountered again. Furthermore, we observe that meta-learning can be used to meta-train a model such that this direct online adaptation with SGD is effective, which is otherwise not the case for large function approximators. In this work, we apply our meta-learning for online learning (MOLe) approach to model-based reinforcement learning, where adapting the predictive model is critical for control; we demonstrate that MOLe outperforms alternative prior methods, and enables effective continuous adaptation in non-stationary task distributions such as varying terrains, motor failures, and unexpected disturbances. Metalog Distribution In economics, business, engineering, science and other fields, continuous uncertainties frequently arise that are not easily- or well-characterized by previously-named continuous probability distributions. Frequently, there is data available from measurements, assessments, derivations, simulations or other sources that characterize the range of an uncertainty. But the underlying process that generated this data is either unknown or fails to lend itself to convenient derivation of equations that appropriately characterize the probability density (PDF), cumulative (CDF) or quantile distribution functions. The metalog distributions are a family of continuous univariate probability distributions that directly address this need. They can be used in most any situation in which CDF data is known and a flexible, simple, and easy-to-use continuous probability distribution is needed to represent that data. Consider their uses and benefits. Also consider their applications over a wide range of fields and data sources. rmetalog Metameric Sampling Despite their impressive performance, deep neural networks exhibit striking failures on out-of-distribution inputs. One core idea of adversarial example research is to reveal neural network errors under such distribution shift. We decompose these errors into two complementary sources: sensitivity and invariance. We show deep networks are not only too sensitive to task-irrelevant changes of their input, as is well-known from epsilon-adversarial examples, but are also too invariant to a wide range of task-relevant changes, thus making vast regions in input space vulnerable to adversarial attacks. After identifying this excessive invariance, we propose the usage of bijective deep networks to enable access to all variations. We introduce metameric sampling as an analytic attack for these networks, requiring no optimization, and show that it uncovers large subspaces of misclassified inputs. Then we apply these networks to MNIST and ImageNet and show that one can manipulate the class-specific content of almost any image without changing the hidden activations. Further, we extend the standard cross-entropy loss to strengthen the model against such manipulations via an information-theoretic analysis, providing the first approach tailored explicitly to overcome invariance-based vulnerability. We conclude by empirically illustrating its ability to control undesirable class-specific invariance, showing promise to overcome one major cause for adversarial examples. Meta-Metric-Learner Few-shot learning aims to learn classifiers for new classes with only a few training examples per class. Most existing few-shot learning approaches belong to either metric-based meta-learning or optimization-based meta-learning category, both of which have achieved successes in the simplified ‘$k$-shot $N$-way’ image classification settings. Specifically, the optimization-based approaches train a meta-learner to predict the parameters of the task-specific classifiers. The task-specific classifiers are required to be homogeneous-structured to ease the parameter prediction, so the meta-learning approaches could only handle few-shot learning problems where the tasks share a uniform number of classes. The metric-based approaches learn one task-invariant metric for all the tasks. Even though the metric-learning approaches allow different numbers of classes, they require the tasks all coming from a similar domain such that there exists a uniform metric that could work across tasks. In this work, we propose a hybrid meta-learning model called Meta-Metric-Learner which combines the merits of both optimization- and metric-based approaches. Our meta-metric-learning approach consists of two components, a task-specific metric-based learner as a base model, and a meta-learner that learns and specifies the base model. Thus our model is able to handle flexible numbers of classes as well as generate more generalized metrics for classification across tasks. We test our approach in the standard ‘$k$-shot $N$-way’ few-shot learning setting following previous works and a new realistic few-shot setting with flexible class numbers in both single-source form and multi-source forms. Experiments show that our approach can obtain superior performance in all settings. MetaMimic Humans are experts at high-fidelity imitation — closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions. MetaOptNet Many meta-learning approaches for few-shot learning rely on simple base learners such as nearest-neighbor classifiers. However, even in the few-shot regime, discriminatively trained linear predictors can offer better generalization. We propose to use these predictors as base learners to learn representations for few-shot learning and show they offer better tradeoffs between feature size and performance across a range of few-shot recognition benchmarks. Our objective is to learn feature embeddings that generalize well under a linear classification rule for novel categories. To efficiently solve the objective, we exploit two properties of linear classifiers: implicit differentiation of the optimality conditions of the convex problem and the dual formulation of the optimization problem. This allows us to use high-dimensional embeddings with improved generalization at a modest increase in computational overhead. Our approach, named MetaOptNet, achieves state-of-the-art performance on miniImageNet, tieredImageNet, CIFAR-FS, and FC100 few-shot learning benchmarks. MetaPruning In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. We have demonstrated competitive performances on MobileNet V1/V2 networks, up to 9.0/9.9 higher ImageNet accuracy than V1/V2. Compared to the previous state-of-the-art AutoML-based pruning methods, like AMC and NetAdapt, we achieve higher or comparable accuracy under various conditions. Meta-Sim Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task. Metatrace Reinforcement learning (RL) has had many successes in both ‘deep’ and ‘shallow’ settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this — most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Meta-learning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to non-stationarity in a problem. This paper applies meta-gradient descent to derive a set of step-size tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like $TD(\lambda)$. We explore tuning both a single scalar step-size and a separate step-size for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, non-stationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial step-size choice. Results show that the meta-step-size parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with non-stationarity in the learning task. Meta-Transfer Learning Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, meta-learning typically uses shallow neural networks (SNNs), thus limiting its effectiveness. In this paper we propose a novel few-shot learning method called meta-transfer learning (MTL) which learns to adapt a deep NN for few shot learning tasks. Specifically, ‘meta’ refers to training multiple tasks, and ‘transfer’ is achieved by learning scaling and shifting functions of DNN weights for each task. In addition, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum for MTL. We conduct experiments using (5-class, 1-shot) and (5-class, 5-shot) recognition tasks on two challenging few-shot learning benchmarks: miniImageNet and Fewshot-CIFAR100. Extensive comparisons to related works validate that our meta-transfer learning approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy. Meta-Unsupervised-Learning We introduce a new paradigm to investigate unsupervised learning, reducing unsupervised learning to supervised learning. Specifically, we mitigate the subjectivity in unsupervised decision-making by leveraging knowledge acquired from prior, possibly heterogeneous, supervised learning tasks. We demonstrate the versatility of our framework via comprehensive expositions and detailed experiments on several unsupervised problems such as (a) clustering, (b) outlier detection, and (c) similarity prediction under a common umbrella of meta-unsupervised-learning. We also provide rigorous PAC-agnostic bounds to establish the theoretical foundations of our framework, and show that our framing of meta-clustering circumvents Kleinberg’s impossibility theorem for clustering. Metcalfe’s Law Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system (n^2). First formulated in this form by George Gilder in 1993, and attributed to Robert Metcalfe in regard to Ethernet, Metcalfe’s law was originally presented, circa 1980, not in terms of users, but rather of ‘compatible communicating devices’ (for example, fax machines, telephones, etc.). Only more recently with the launch of the Internet did this law carry over to users and networks as its original intent was to describe Ethernet purchases and connections. The law is also very much related to economics and business management, especially with competitive companies looking to merge with one another. In the real world, requirements of Pareto efficiency imply that the law will not hold. Method of Codifferential Descent(MCD) Method of codifferential descent (MCD) developed by professor V.F. Demyanov for solving a large class of nonsmooth nonconvex optimization problems. ➚ “Generalised Method of Codifferential Descent” Method of Moments(MM) In statistics, the method of moments is a method of estimation of population parameters. One starts with deriving equations that relate the population moments (i.e., the expected values of powers of the random variable under consideration) to the parameters of interest. Then a sample is drawn and the population moments are estimated from the sample. The equations are then solved for the parameters of interest, using the sample moments in place of the (unknown) population moments. This results in estimates of those parameters. The method of moments was introduced by Karl Pearson in 1894. momentchi2 Method of Simulated Moments(MSM) In econometrics, the method of simulated moments (MSM) (also called simulated method of moments) is a structural estimation technique introduced by Daniel McFadden. It extends the generalized method of moments to cases where theoretical moment functions cannot be evaluated directly, such as when moment functions involve high-dimensional integrals. MSM’s earliest and principal applications have been to research in industrial organization, after its development by Ariel Pakes, David Pollard, and others, though applications in consumption are emerging. Metric In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set, but not all topologies can be generated by a metric. A topological space whose topology can be described by a metric is called metrizable. Metric Expression Network(MEnet) Recent CNN-based saliency models have achieved great performance on public datasets, however, most of them are sensitive to distortion (e.g., noise, compression). In this paper, an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) is proposed to overcome this drawback. Within this architecture, we construct a new topological metric space, with the implicit metric being determined by the deep network. In this way, we succeed in grouping all the pixels within the observed image semantically within this latent space into two regions: a salient region and a non-salient region. With this method, all feature extractions are carried out at the pixel level, which makes the output boundaries of salient object fine-grained. Experimental results show that the proposed metric can generate robust salient maps that allow for object segmentation. By testing the method on several public benchmarks, we show that the performance of MEnet has achieved good results. Furthermore, the proposed method outperforms previous CNN-based methods on distorted images. Metric Gaussian Variational Inference(MGVI) A variational Gaussian approximation of the posterior distribution can be an excellent way to infer posterior quantities. However, to capture all posterior correlations the parametrization of the full covariance is required, which scales quadratic with the problem size. This scaling prohibits full-covariance approximations for large-scale problems. As a solution to this limitation we propose Metric Gaussian Variational Inference (MGVI). This procedure approximates the variational covariance such that it requires no parameters on its own and still provides reliable posterior correlations and uncertainties for all model parameters. We approximate the variational covariance with the inverse Fisher metric, a local estimate of the true posterior uncertainty. This covariance is only stored implicitly and all necessary quantities can be extracted from it by independent samples drawn from the approximating Gaussian. MGVI requires the minimization of a stochastic estimate of the Kullback-Leibler divergence only with respect to the mean of the variational Gaussian, a quantity that only scales linearly with the problem size. We motivate the choice of this covariance from an information geometric perspective. The method is validated against established approaches in a small example and the scaling is demonstrated in a problem with over a million parameters. Metric Optimization Engine(MOE) MOE (Metric Optimization Engine) is an efficient way to optimize a system’s parameters, when evaluating parameters is time-consuming or expensive. It is an open source, machine learning tool for solving these global, black box optimization problems in an optimal way. Here are some examples of when you could use MOE: 1. Optimizing a system’s click-through rate (CTR). 2. Optimizing tunable parameters of a machine-learning prediction method. 3. Optimizing the design of an engineering system 4. Optimizing the parameters of a real-world experiment Metric-Based Adversarial Discriminative Domain Adaptation(M-ADDA) Unsupervised domain adaptation techniques have been successful for a wide range of problems where supervised labels are limited. The task is to classify an unlabeled target’ dataset by leveraging a labeled source’ dataset that comes from a slightly similar distribution. We propose metric-based adversarial discriminative domain adaptation (M-ADDA) which performs two main steps. First, it uses a metric learning approach to train the source model on the source dataset by optimizing the triplet loss function. This results in clusters where embeddings of the same label are close to each other and those with different labels are far from one another. Next, it uses the adversarial approach (as that used in ADDA \cite{2017arXiv170205464T}) to make the extracted features from the source and target datasets indistinguishable. Simultaneously, we optimize a novel loss function that encourages the target dataset’s embeddings to form clusters. While ADDA and M-ADDA use similar architectures, we show that M-ADDA performs significantly better on the digits adaptation datasets of MNIST and USPS. This suggests that using metric-learning for domain adaptation can lead to large improvements in classification accuracy for the domain adaptation task. The code is available at \url{https://…/M-ADDA}. Metric-Constrained Kernel Union-of-Subspaces(MC-KUoS) Modern information processing relies on the axiom that high-dimensional data lie near low-dimensional geometric structures. This paper revisits the problem of data-driven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing ‘related’ objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the union-of-subspaces model, and is termed the metric-constrained union-of-subspaces (MC-UoS) model. The second one of these models—suited for data drawn from a mixture of nonlinear manifolds—generalizes the kernel subspace model, and is termed the metric-constrained kernel union-of-subspaces (MC-KUoS) model. The main contributions of this paper in this regard include the following. First, it motivates and formalizes the problems of MC-UoS and MC-KUoS learning. Second, it presents algorithms that efficiently learn an MC-UoS or an MC-KUoS underlying data of interest. Third, it extends these algorithms to the case when parts of the data are missing. Last, but not least, it reports the outcomes of a series of numerical experiments involving both synthetic and real data that demonstrate the superiority of the proposed geometric models and learning algorithms over existing approaches in the literature. These experiments also help clarify the connections between this work and the literature on (subspace and kernel k-means) clustering. GitXiv Metric-Constrained Union-of-Subspaces(MC-UoS) Modern information processing relies on the axiom that high-dimensional data lie near low-dimensional geometric structures. This paper revisits the problem of data-driven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing ‘related’ objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the union-of-subspaces model, and is termed the metric-constrained union-of-subspaces (MC-UoS) model. The second one of these models—suited for data drawn from a mixture of nonlinear manifolds—generalizes the kernel subspace model, and is termed the metric-constrained kernel union-of-subspaces (MC-KUoS) model. The main contributions of this paper in this regard include the following. First, it motivates and formalizes the problems of MC-UoS and MC-KUoS learning. Second, it presents algorithms that efficiently learn an MC-UoS or an MC-KUoS underlying data of interest. Third, it extends these algorithms to the case when parts of the data are missing. Last, but not least, it reports the outcomes of a series of numerical experiments involving both synthetic and real data that demonstrate the superiority of the proposed geometric models and learning algorithms over existing approaches in the literature. These experiments also help clarify the connections between this work and the literature on (subspace and kernel k-means) clustering. GitXiv MetricGAN Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Moreover, based on MetricGAN, the metric scores of the generated data can also be arbitrarily specified by users. We tested the proposed MetricGAN on a speech enhancement task, which is particularly suitable to verify the proposed approach because there are multiple metrics measuring different aspects of speech signals. Moreover, these metrics are generally complex and could not be fully optimized by Lp or conventional adversarial losses. MetricsGraphics.js MetricsGraphics.js is a library built on top of D3 that is optimized for visualizing and laying out time-series data. It provides a simple way to produce common types of graphics in a principled, consistent and responsive way. The library currently supports line charts, scatterplots and histograms as well as features like rug plots and basic linear regression. metricsgraphics Metropolis Adjusted Langevin Algorithm(MALA) The Metropolis-Adjusted Langevin Algorithm (MALA) is a Markov Chain Monte Carlo method which creates a Markov chain reversible with respect to a given target distribution, N, with Lebesgue density on R^N; it can hence be used to approximately sample the target distribution. When the dimension N is large a key question is to determine the computational cost of the algorithm as a function of N. One approach to this question, which we adopt here, is to derive diffusion limits for the algorithm. The family of target measures that we consider in this paper are, in general, in non-product form and are of interest in applied problems as they arise in Bayesian nonparametric statistics and in the study of conditioned diffusions. Furthermore, we study the situation, which arises in practice, where the algorithm is started out of stationarity. We thereby significantly extend previous works which consider either only measures of product form, when the Markov chain is started out of stationarity, or measures defined via a density with respect to a Gaussian, when the Markov chain is started in stationarity. We prove that, in the non-stationary regime, the computational cost of the algorithm is of the order N^(1/2) with dimension, as opposed to what is known to happen in the stationary regime, where the cost is of the order N^(1/3). Counterstrike: Defending Deep Learning Architectures Against Adversarial Samples by Langevin Dynamics with Supervised Denoising Autoencoder Metropolis-Hastings Algorithm In statistics and in statistical physics, the Metropolis-Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). Metropolis-Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, other methods are usually available (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and are free from the problem of auto-correlated samples that is inherent in MCMC methods. http://…/1504.01896 Metropolis-Hastings Generative Adversarial Network(MH-GAN) We introduce the Metropolis-Hastings generative adversarial network (MH-GAN), which combines aspects of Markov chain Monte Carlo and GANs. The MH-GAN draws samples from the distribution implicitly defined by a GAN’s discriminator-generator pair, as opposed to sampling in a standard GAN which draws samples from the distribution defined by the generator. It uses the discriminator from GAN training to build a wrapper around the generator for improved sampling. With a perfect discriminator, this wrapped generator samples from the true distribution on the data exactly even when the generator is imperfect. We demonstrate the benefits of the improved generator on multiple benchmark datasets, including CIFAR-10 and CelebA, using DCGAN and WGAN. Metropolis-within-Gibbs(MWG) Sampling from lattice Gaussian distribution has emerged as an important problem in coding, decoding and cryptography. In this paper, the classic Gibbs algorithm from Markov chain Monte Carlo (MCMC) methods is demonstrated to be geometrically ergodic for lattice Gaussian sampling, which means the Markov chain arising from it converges exponentially fast to the stationary distribution. Meanwhile, the exponential convergence rate of Markov chain is also derived through the spectral radius of forward operator. Then, a comprehensive analysis regarding to the convergence rate is carried out and two sampling schemes are proposed to further enhance the convergence performance. The first one, referred to as Metropolis-within-Gibbs (MWG) algorithm, improves the convergence by refining the state space of the univariate sampling. On the other hand, the blocked strategy of Gibbs algorithm, which performs the sampling over multivariate at each Markov move, is also shown to yield a better convergence rate than the traditional univariate sampling. In order to perform blocked sampling efficiently, Gibbs-Klein (GK) algorithm is proposed, which samples block by block using Klein’s algorithm. Furthermore, the validity of GK algorithm is demonstrated by showing its ergodicity. Simulation results based on MIMO detections are presented to confirm the convergence gain brought by the proposed Gibbs sampling schemes. Metropolized Knockoff Sampling Model-X knockoffs is a wrapper that transforms essentially any feature importance measure into a variable selection algorithm, which discovers true effects while rigorously controlling the expected fraction of false positives. A frequently discussed challenge to apply this method is to construct knockoff variables, which are synthetic variables obeying a crucial exchangeability property with the explanatory variables under study. This paper introduces techniques for knockoff generation in great generality: we provide a sequential characterization of all possible knockoff distributions, which leads to a Metropolis-Hastings formulation of an exact knockoff sampler. We further show how to use conditional independence structure to speed up computations. Combining these two threads, we introduce an explicit set of sequential algorithms and empirically demonstrate their effectiveness. Our theoretical analysis proves that our algorithms achieve near-optimal computational complexity in certain cases. The techniques we develop are sufficiently rich to enable knockoff sampling in challenging models including cases where the covariates are continuous and heavy-tailed, and follow a graphical model such as the Ising model. Metzler Matrix In mathematics, a Metzler matrix is a matrix in which all the off-diagonal components are nonnegative (equal to or greater than zero). It is named after the American economist Lloyd Metzler. Metzler matrices appear in stability analysis of time delayed differential equations and positive linear dynamical systems. Their properties can be derived by applying the properties of nonnegative matrices to matrices of the form M + aI where M is a Metzler matrix. MFCMT Discriminative Correlation Filters (DCF)-based tracking algorithms exploiting conventional handcrafted features have achieved impressive results both in terms of accuracy and robustness. Template handcrafted features have shown excellent performance, but they perform poorly when the appearance of target changes rapidly such as fast motions and fast deformations. In contrast, statistical handcrafted features are insensitive to fast states changes, but they yield inferior performance in the scenarios of illumination variations and background clutters. In this work, to achieve an efficient tracking performance, we propose a novel visual tracking algorithm, named MFCMT, based on a complementary ensemble model with multiple features, including Histogram of Oriented Gradients (HOGs), Color Names (CNs) and Color Histograms (CHs). Additionally, to improve tracking results and prevent targets drift, we introduce an effective fusion method by exploiting relative entropy to coalesce all basic response maps and get an optimal response. Furthermore, we suggest a simple but efficient update strategy to boost tracking performance. Comprehensive evaluations are conducted on two tracking benchmarks demonstrate and the experimental results demonstrate that our method is competitive with numerous state-of-the-art trackers. Our tracker achieves impressive performance with faster speed on these benchmarks. MF-MI-Greedy How can we efficiently gather information to optimize an unknown function, when presented with multiple, mutually dependent information sources with different costs? For example, when optimizing a robotic system, intelligently trading off computer simulations and real robot testings can lead to significant savings. Existing methods, such as multi-fidelity GP-UCB or Entropy Search-based approaches, either make simplistic assumptions on the interaction among different fidelities or use simple heuristics that lack theoretical guarantees. In this paper, we study multi-fidelity Bayesian optimization with complex structural dependencies among multiple outputs, and propose MF-MI-Greedy, a principled algorithmic framework for addressing this problem. In particular, we model different fidelities using additive Gaussian processes based on shared latent structures with the target function. Then we use cost-sensitive mutual information gain for efficient Bayesian global optimization. We propose a simple notion of regret which incorporates the cost of different fidelities, and prove that MF-MI-Greedy achieves low regret. We demonstrate the strong empirical performance of our algorithm on both synthetic and real-world datasets. MG-WFBP Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters. With the increase of computational power, network communications have become one limiting factor on system scalability. In this paper, we observe that many deep neural networks have a large number of layers with only a small amount of data to be communicated. Based on the fact that merging some short communication tasks into a single one may reduce the overall communication time, we formulate an optimization problem to minimize the training iteration time. We develop an optimal solution named merged-gradient WFBP (MG-WFBP) and implement it in our open-source deep learning platform B-Caffe. Our experimental results on an 8-node GPU cluster with 10GbE interconnect and trace-based simulation results on a 64-node cluster both show that the MG-WFBP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD. MIaS Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task. Micro-Browsing Model Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR of a result has two core components: (a) the probability of examination of a result by a user, and (b) the perceived relevance of the result given that it has been examined by the user. There has been considerable work on user browsing models, to model and analyze both the examination and the relevance components of CTR. In this paper, we propose a novel formulation: a micro-browsing model for how users read result snippets. The snippet text of a result often plays a critical role in the perceived relevance of the result. We study how particular words within a line of snippet can influence user behavior. We validate this new micro-browsing user model by considering the problem of predicting which snippet will yield higher CTR, and show that classification accuracy is dramatically higher with our micro-browsing user model. The key insight in this paper is that varying relatively few words within a snippet, and even their location within a snippet, can have a significant influence on the clickthrough of a snippet. Micro-Macro Multilevel Modeling MicroMacroMultilevel Micro-Objective Reinforcement Learning The standard reinforcement learning (RL) formulation considers the expectation of the (discounted) cumulative reward. This is limiting in applications where we are concerned with not only the expected performance, but also the distribution of the performance. In this paper, we introduce micro-objective reinforcement learning — an alternative RL formalism that overcomes this issue. In this new formulation, a RL task is specified by a set of micro-objectives, which are constructs that specify the desirability or undesirability of events. In addition, micro-objectives allow prior knowledge in the form of temporal abstraction to be incorporated into the global RL objective. The generality of this formalism, and its relations to single/multi-objective RL, and hierarchical RL are discussed. Microsoft Academic Graph(MAG) We present the design and methodology for the large scale hybrid paper recommender system used by Microsoft Academic. The system provides recommendations for approximately 160 million English research papers and patents. Our approach handles incomplete citation information while also alleviating the cold-start problem that often affects other recommender systems. We use the Microsoft Academic Graph (MAG), titles, and available abstracts of research papers to build a recommendation list for all documents, thereby combining co-citation and content based approaches. Tuning system parameters also allows for blending and prioritization of each approach which, in turn, allows us to balance paper novelty versus authority in recommendation results. We evaluate the generated recommendations via a user study of 40 participants, with over 2400 recommendation pairs graded and discuss the quality of the results using P@10 and nDCG scores. We see that there is a strong correlation between participant scores and the similarity rankings produced by our system but that additional focus needs to be put towards improving recommender precision, particularly for content based recommendations. The results of the user survey and associated analysis scripts are made available via GitHub and the recommendations produced by our system are available as part of the MAG on Azure to facilitate further research and light up novel research paper recommendation applications. Microsoft Machine Learning for Apache Spark(MMLSpark) We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient Boosting, Model Interpretability, and other areas of modern computation. Furthermore, we present a novel system called Spark Serving that allows users to run any Apache Spark program as a distributed, sub-millisecond latency web service backed by their existing Spark Cluster. All MMLSpark contributions have the same API to enable simple composition across frameworks and usage across batch, streaming, and RESTful web serving scenarios on static, elastic, or serverless clusters. We showcase MMLSpark by creating a method for deep object detection capable of learning without human labeled data and demonstrate its effectiveness for Snow Leopard conservation. Microsoft Project Oxford Set of technologies dubbed Project Oxford that allows developers to create smarter apps, which can do things like recognize faces and interpret natural language even if the app developers are not experts in those fields. ‘If you are an app developer, you could just take the API capabilities and not worry about the machine learning aspect,’ said Vijay Vokkaarne, a principal group program manager with Bing, whose team is working on the speech aspect of Project Oxford. MIDA We consider the problem of identifying intermediate variables (or mediators) that regulate the effect of a treatment on a response variable. While there has been significant research on this topic, little work has been done when the set of potential mediators is high-dimensional and when they are interrelated. In particular, we assume that the causal structure of the treatment, the potential mediators and the response is a directed acyclic graph (DAG). High-dimensional DAG models have previously been used for the estimation of causal effects from observational data and methods called IDA and joint-IDA have been developed for estimating the effects of single interventions and multiple simultaneous interventions respectively. In this paper, we propose an IDA-type method called MIDA for estimating mediation effects from high-dimensional observational data. Although IDA and joint-IDA estimators have been shown to be consistent in certain sparse high-dimensional settings, their asymptotic properties such as convergence in distribution and inferential tools in such settings remained unknown. We prove high-dimensional consistency of MIDA for linear structural equation models with sub-Gaussian errors. More importantly, we derive distributional convergence results for MIDA in similar high-dimensional settings, which are applicable to IDA and joint-IDA estimators as well. To the best of our knowledge, these are the first distributional convergence results facilitating inference for IDA-type estimators. These results have been built on our novel theoretical results regarding uniform bounds for linear regression estimators over varying subsets of high-dimensional covariates, which may be of independent interest. Finally, we empirically validate our asymptotic theory and demonstrate the usefulness of MIDA in identifying large mediation effects via simulations and application to real data in genomics. MILABOT We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including neural network and template-based models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than other systems. The results highlight the potential of coupling ensemble systems with deep reinforcement learning as a fruitful path for developing real-world, open-domain conversational agents. Miller-Hagberg Algorithm We present an efficient algorithm to generate random graphs with a given sequence of expected degrees. Existing algorithms run in O(N 2 ) O(N2) time where N is the number of nodes. We prove that our algorithm runs in O(N+M) O(N+M) expected time where M is the expected number of edges. If the expected degrees are chosen from a distribution with finite mean, this is O(N) O(N) as N->Inf. MiMatrix In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent~(SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s network and properties of different types of nodes, we develop and implement a novel job server parallel software framework, named by ‘\textit{MiMatrix}’, for distributed deep learning training. Compared to the parameter server framework, in which parameter server is a bottleneck of data transfer in AllReduce algorithm of SSGD, the job server undertakes all of controlling, scheduling and monitoring tasks without model data transfer. In MiMatrix, we propose a novel GPUDirect Remote direct memory access~(RDMA)-aware parallel algorithm of AllReucde executed by computing servers, which both computation and handshake message are $O(1)$ at each epoch Min.Max Algorithm This paper focuses on modeling violent crime rates against population over the years 1960-2014 for the United States via cubic spline based method. We propose a new min/max algorithm on knots detection and estimation for cubic spline regression. We employ least squares estimation to find potential regression coefficients based upon the cubic spline model and the knots chosen by the min/max algorithm. We then utilize the best subsets regression method to aid in model selection in which we find the minimum value of the Bayesian Information Criteria. Finally, we report the $R_{adj}^{2}$ as a measure of overall goodness-of-fit of our selected model. Among the fifty states and Washington D.C., we have found 42 out of 51 with $R_{adj}^{2}$ value that was greater than $90\%$. We also present an overall model for the United States as a whole. Our method can serve as a unified model for violent crime rate over future years. Mined Semantic Analysis(MSA) Mined Semantic Analysis (MSA) is a novel distributional semantics approach which employs data mining techniques. MSA embraces knowledge-driven analysis of natural languages. It uncovers implicit relations between concepts by mining for their associations in target encyclopedic corpora. MSA exploits not only target corpus content but also its knowledge graph (e.g., ‘See also’ link graph of Wikipedia). Empirical results show competitive performance of MSA compared to prior state-of-the-art methods for measuring semantic relatedness on benchmark data sets. Additionally, we introduce the first analytical study to examine statistical significance of results reported by different semantic relatedness methods. Our study shows that, top performing results could be statistically equivalent though mathematically different. The study positions MSA as one of state-of-the-art methods for measuring semantic relatedness. MineRL Competition Though deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. As state-of-the-art reinforcement learning (RL) systems require an exponentially increasing number of samples, their development is restricted to a continually shrinking segment of the AI community. Likewise, many of these systems cannot be applied to real-world problems, where environment samples are expensive. Resolution of these limitations requires new, sample-efficient methods. To facilitate research in this direction, we introduce the MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors. The primary goal of the competition is to foster the development of algorithms which can efficiently leverage human demonstrations to drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments. To that end, we introduce: (1) the Minecraft ObtainDiamond task, a sequential decision making environment requiring long-term planning, hierarchical control, and efficient exploration methods; and (2) the MineRL-v0 dataset, a large-scale collection of over 60 million state-action pairs of human demonstrations that can be resimulated into embodied trajectories with arbitrary modifications to game state and visuals. Participants will compete to develop systems which solve the ObtainDiamond task with a limited number of samples from the environment simulator, Malmo. The competition is structured into two rounds in which competitors are provided several paired versions of the dataset and environment with different game textures. At the end of each round, competitors will submit containerized versions of their learning algorithms and they will then be trained/evaluated from scratch on a hold-out dataset-environment pair for a total of 4-days on a prespecified hardware platform. Miniature Atari(MinAtar) The Arcade Learning Environment (ALE) is a popular platform for evaluating reinforcement learning agents. Much of the appeal comes from the fact that Atari games are varied, showcase aspects of competency we expect from an intelligent agent, and are not biased towards any particular solution approach. The challenge of the ALE includes 1) the representation learning problem of extracting pertinent information from the raw pixels, and 2) the behavioural learning problem of leveraging complex, delayed associations between actions and rewards. Often, in reinforcement learning research, we care more about the latter, but the representation learning problem adds significant computational expense. In response, we introduce MinAtar, short for miniature Atari, a new evaluation platform that captures the general mechanics of specific Atari games, while simplifying certain aspects. In particular, we reduce the representational complexity to focus more on behavioural challenges. MinAtar consists of analogues to five Atari games which play out on a 10×10 grid. MinAtar provides a 10x10xn state representation. The n channels correspond to game-specific objects, such as ball, paddle and brick in the game Breakout. While significantly simplified, these domains are still rich enough to allow for interesting behaviours. To demonstrate the challenges posed by these domains, we evaluated a smaller version of the DQN architecture. We also tried variants of DQN without experience replay, and without a target network, to assess the impact of those two prominent components in the MinAtar environments. In addition, we evaluated a simpler agent that used actor-critic with eligibility traces, online updating, and no experience replay. We hope that by introducing a set of simplified, Atari-like games we can allow researchers to more efficiently investigate the unique behavioural challenges provided by the ALE. Mini-Batch AUC Optimization(MBA) Area under the receiver operating characteristics curve (AUC) is an important metric for a wide range of signal processing and machine learning problems, and scalable methods for optimizing AUC have recently been proposed. However, handling very large datasets remains an open challenge for this problem. This paper proposes a novel approach to AUC maximization, based on sampling mini-batches of positive/negative instance pairs and computing U-statistics to approximate a global risk minimization problem. The resulting algorithm is simple, fast, and learning-rate free. We show that the number of samples required for good performance is independent of the number of pairs available, which is a quadratic function of the positive and negative instances. Extensive experiments show the practical utility of the proposed method. Mini-batch Tempered MCMC(MINT-MCMC) In this paper we propose a general framework of performing MCMC with only a mini-batch of data. We show by estimating the Metropolis-Hasting ratio with only a mini-batch of data, one is essentially sampling from the true posterior raised to a known temperature. We show by experiments that our method, Mini-batch Tempered MCMC (MINT-MCMC), can efficiently explore multiple modes of a posterior distribution. As an application, we demonstrate one application of MINT-MCMC as an inference tool for Bayesian neural networks. We also show an cyclic version of our algorithm can be applied to build an ensemble of neural networks with little additional training cost. Minimal Achievable Sufficient Statistic(MASS) We introduce Minimal Achievable Sufficient Statistic (MASS) Learning, a training method for machine learning models that attempts to produce minimal sufficient statistics with respect to a class of functions (e.g. deep networks) being optimized over. In deriving MASS Learning, we also introduce Conserved Differential Information (CDI), an information-theoretic quantity that – unlike standard mutual information – can be usefully applied to deterministically-dependent continuous random variables like the input and output of a deep network. In a series of experiments, we show that deep networks trained with MASS Learning achieve competitive performance on supervised learning, regularization, and uncertainty quantification benchmarks. Minimal Support Vector Machine(Minimal SVM) Support Vector Machine (SVM) is an efficient classification approach, which finds a hyperplane to separate data from different classes. This hyperplane is determined by support vectors. In existing SVM formulations, the objective function uses L2 norm or L1 norm on slack variables. The number of support vectors is a measure of generalization errors. In this work, we propose a Minimal SVM, which uses L0.5 norm on slack variables. The result model further reduces the number of support vectors and increases the classification performance. Minimally Sufficient Statistic In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter. Indeed, the entire data set,X1 … Xn , can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter. However, using all n variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multi-dimensional statistic. If we can use a lower-dimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it. Minimax Concave Penalty(MCP) regnet Minimax Entropy(MME) Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any target supervision. However, we show that these techniques perform poorly when even a few labeled examples are available in the target. To address this semi-supervised domain adaptation (SSDA) setting, we propose a novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model. Our base model consists of a feature encoding network, followed by a classification layer that computes the features’ similarity to estimated prototypes (representatives of each class). Adaptation is achieved by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. We empirically demonstrate the superiority of our method over many baselines, including conventional feature alignment and few-shot methods, setting a new state of the art for SSDA. MiniMax Entropy Network(MMEN) How to effectively learn from unlabeled data from the target domain is crucial for domain adaptation, as it helps reduce the large performance gap due to domain shift or distribution change. In this paper, we propose an easy-to-implement method dubbed MiniMax Entropy Networks (MMEN) based on adversarial learning. Unlike most existing approaches which employ a generator to deal with domain difference, MMEN focuses on learning the categorical information from unlabeled target samples with the help of labeled source samples. Specifically, we set an unfair multi-class classifier named categorical discriminator, which classifies source samples accurately but be confused about the categories of target samples. The generator learns a common subspace that aligns the unlabeled samples based on the target pseudo-labels. For MMEN, we also provide theoretical explanations to show that the learning of feature alignment reduces domain mismatch at the category level. Experimental results on various benchmark datasets demonstrate the effectiveness of our method over existing state-of-the-art baselines. Minimax Regularization Classical approach to regularization is to design norms enhancing smoothness or sparsity and then to use this norm or some power of this norm as a regularization function. The choice of the regularization function (for instance a power function) in terms of the norm is mostly dictated by computational purpose rather than theoretical considerations. In this work, we design regularization functions that are motivated by theoretical arguments. To that end we introduce a concept of optimal regularization called ‘minimax regularization’ and, as a proof of concept, we show how to construct such a regularization function for the $\ell_1^d$ norm for the random design setup. We develop a similar construction for the deterministic design setup. It appears that the resulting regularized procedures are different from the one used in the LASSO in both setups. Minimizing Approximated Information Criteria(MIC) coxphMIC Minimum Correlation Regularization In social networks, heterogeneous multimedia data correlate to each other, such as videos and their corresponding tags in YouTube and image-text pairs in Facebook. Nearest neighbor retrieval across multiple modalities on large data sets becomes a hot yet challenging problem. Hashing is expected to be an efficient solution, since it represents data as binary codes. As the bit-wise XOR operations can be fast handled, the retrieval time is greatly reduced. Few existing multimodal hashing methods consider the correlation among hashing bits. The correlation has negative impact on hashing codes. When the hashing code length becomes longer, the retrieval performance improvement becomes slower. In this paper, we propose a minimum correlation regularization (MCR) for multimodal hashing. First, the sigmoid function is used to embed the data matrices. Then, the MCR is applied on the output of sigmoid function. As the output of sigmoid function approximates a binary code matrix, the proposed MCR can efficiently decorrelate the hashing codes. Experiments show the superiority of the proposed method becomes greater as the code length increases. Minimum Description Length(MDL) Given data over variables $(X_1,…,X_m, Y)$ we consider the problem of finding out whether $X$ jointly causes $Y$ or whether they are all confounded by an unobserved latent variable $Z$. To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where $X$ causes $Y$ and where there exists a latent variables $Z$ confounding both $X$ and $Y$ and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well — even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence. Minimum Description Length Principle(MDL) The minimum description length (MDL) principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory. Minimum Error Entropy Kalman Filter(MEE-KF) To date most linear and nonlinear Kalman filters (KFs) have been developed under the Gaussian assumption and the well-known minimum mean square error (MMSE) criterion. In order to improve the robustness with respect to impulsive (or heavy-tailed) non-Gaussian noises, the maximum correntropy criterion (MCC) has recently been used to replace the MMSE criterion in developing several robust Kalman-type filters. To deal with more complicated non-Gaussian noises such as noises from multimodal distributions, in the present paper we develop a new Kalman-type filter, called minimum error entropy Kalman filter (MEE-KF), by using the minimum error entropy (MEE) criterion instead of the MMSE or MCC. Similar to the MCC based KFs, the proposed filter is also an online algorithm with recursive process, in which the propagation equations are used to give prior estimates of the state and covariance matrix, and a fixed-point algorithm is used to update the posterior estimates. In addition, the minimum error entropy extended Kalman filter (MEE-EKF) is also developed for performance improvement in the nonlinear situations. The high accuracy and strong robustness of MEE-KF and MEE-EKF are confirmed by experimental results. Minimum Incremental Coding Length(MICL) We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We demonstrate the asymptotic optimality of this criterion for Gaussian distributions and analyze its relationships to classical classifiers. The theoretical results clarify the connections between our approach and popular classifiers such as maximum a posteriori (MAP), regularized discriminant analysis (RDA), $k$-nearest neighbor ($k$-NN), and support vector machine (SVM), as well as unsupervised methods based on lossy coding. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. Minimum Spanning Tree(MST) Given a connected, undirected graph, a spanning tree of that graph is a subgraph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its connected components. http://…/43mst http://…/t0000021.pdf Mining High Utility Itemset using PUN-Lists(MIP) In this paper, we propose a novel data structure called PUN-list, which maintains both the utility information about an itemset and utility upper bound for facilitating the processing of mining high utility itemsets. Based on PUN-lists, we present a method, called MIP (Mining high utility Itemset using PUN-Lists), for fast mining high utility itemsets. The efficiency of MIP is achieved with three techniques. First, itemsets are represented by a highly condensed data structure, PUN-list, which avoids costly, repeatedly utility computation. Second, the utility of an itemset can be efficiently calculated by scanning the PUN-list of the itemset and the PUN-lists of long itemsets can be fast constructed by the PUN-lists of short itemsets. Third, by employing the utility upper bound lying in the PUN-lists as the pruning strategy, MIP directly discovers high utility itemsets from the search space, called set-enumeration tree, without generating numerous candidates. Extensive experiments on various synthetic and real datasets show that PUN-list is very effective since MIP is at least an order of magnitude faster than recently reported algorithms on average. Minka’s Expectation Propagation Minkowski Distance The Minkowski distance is a metric on Euclidean space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. Minkowski Weighted K-Means(MWK-Means) This paper represents another step in overcoming a drawback of K-Means, its lack of defense against noisy features, using feature weights in the criterion. The Weighted K-Means method by Huang et al. (2008, 2004, 2005) is extended to the corresponding Minkowski metric for measuring distances. Under Minkowski metric the feature weights become intuitively appealing feature rescaling factors in a conventional K-Means criterion. To see how this can be used in addressing another issue of K-Means, the initial setting, a method to initialize K-Means with anomalous clusters is adapted. The Minkowski metric based method is experimentally validated on datasets from the UCI Machine Learning Repository and generated sets of Gaussian clusters, both as they are and with additional uniform random noise features, and appears to be competitive in comparison with other K-Means based feature weighting algorithms. The problem we are tracking here relates to the fact that K-Means treats all features in a dataset as if they had the same degree of relevance. However, we do know that in most datasets different features will have different degrees of relevance. It is not just a matter of feature selection (in which we say: features a and b are relevant but c isn’t), but of feature weighting. Min-Max Scaling An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling (often also simply called “normalization” – a common cause for ambiguities). In this approach, the data is scaled to a fixed range – usually 0 to 1. MinNorm Training In this work, we propose a new training method for finding minimum weight norm solutions in over-parameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem wherein the sum of the norm of the weights in each layer of the network is minimized, under the constraint of exactly fitting training data. It draws inspiration from support vector machines (SVMs), which are able to generalize well, despite often having an infinite number of free parameters in their primal form, and from recent theoretical generalization bounds on NNs which suggest that lower norm solutions generalize better. To solve this constrained optimization problem, our method employs Lagrange multipliers that act as integrators of error over training and identify support vector’-like examples. The method can be implemented as a wrapper around gradient based methods and uses standard back-propagation of gradients from the NN for both regression and classification versions of the algorithm. We provide theoretical justifications for the effectiveness of this algorithm in comparison to early stopping and $L_2$-regularization using simple, analytically tractable settings. In particular, we show faster convergence to the max-margin hyperplane in a shallow network (compared to vanilla gradient descent); faster convergence to the minimum-norm solution in a linear chain (compared to $L_2$-regularization); and initialization-independent generalization performance in a deep linear network. Finally, using the MNIST dataset, we demonstrate that this algorithm can boost test accuracy and identify difficult examples in real-world datasets. MinoanER Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety. MINT We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recently-developed efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodness-of-fit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriately-defined notion of an error vector. The theory is supported by numerical studies on both simulated and real data. Min-Wise Independent Permutations Locality Sensitive Hashing Scheme(MinHash) In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words. MirrorGAN Generating an image from a given text description has two goals: visual realism and semantic consistency. Although significant progress has been made in generating high-quality and visually realistic images using generative adversarial networks, guaranteeing semantic consistency between the text description and visual content remains very challenging. In this paper, we address this problem by proposing a novel global-local attentive and semantic-preserving text-to-image-to-text framework called MirrorGAN. MirrorGAN exploits the idea of learning text-to-image generation by redescription and consists of three modules: a semantic text embedding module (STEM), a global-local collaborative attentive module for cascaded image generation (GLAM), and a semantic text regeneration and alignment module (STREAM). STEM generates word- and sentence-level embeddings. GLAM has a cascaded architecture for generating target images from coarse to fine scales, leveraging both local word attention and global sentence attention to progressively enhance the diversity and semantic consistency of the generated images. STREAM seeks to regenerate the text description from the generated image, which semantically aligns with the given text description. Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods. MisGAN Generative adversarial networks (GANs) have been shown to provide an effective way to model complex distributions and have obtained impressive results on various challenging tasks. However, typical GANs require fully-observed data during training. In this paper, we present a GAN-based framework for learning from complex, high-dimensional incomplete data. The proposed framework learns a complete data generator along with a mask generator that models the missing data distribution. We further demonstrate how to impute missing data by equipping our framework with an adversarially trained imputer. We evaluate the proposed framework using a series of experiments with several types of missing data processes under the missing completely at random assumption. Mislabeled VAE Class labels are often imperfectly observed, due to mistakes and to genuine ambiguity among classes. We propose a new semi-supervised deep generative model that explicitly models noisy labels, called the Mislabeled VAE (M-VAE). The M-VAE can perform better than existing deep generative models which do not account for label noise. Additionally, the derivation of M-VAE gives new theoretical insights into the popular M1+M2 semi-supervised model. Missing Data Encoder(MDE) Image completion is the problem of generating whole images from fragments only. It encompasses inpainting (generating a patch given its surrounding), reverse inpainting/extrapolation (generating the periphery given the central patch) as well as colorization (generating one or several channels given other ones). In this paper, we employ a deep network to perform image completion, with adversarial training as well as perceptual and completion losses, and call it the ‘missing data encoder’ (MDE). We consider several configurations based on how the seed fragments are chosen. We show that training MDE for ‘random extrapolation and colorization’ (MDE-REC), i.e. using random channel-independent fragments, allows a better capture of the image semantics and geometry. MDE training makes use of a novel ‘hide-and-seek’ adversarial loss, where the discriminator seeks the original non-masked regions, while the generator tries to hide them. We validate our models both qualitatively and quantitatively on several datasets, showing their interest for image completion, unsupervised representation learning as well as face occlusion handling. Missing Value PC(MVPC) Missing data are ubiquitous in many domains such as healthcare. Depending on how they are missing, the (conditional) independence relations in the observed data may be different from those for the complete data generated by the underlying causal process and, as a consequence, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. It is then essential to extend existing causal discovery approaches to find true underlying causal structure from such incomplete data. In this paper, we aim at solving this problem for data that are missing with different mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness Graph (m-Graph), we analyze conditions under which addition correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose missing value PC (MVPC), which combines additional corrections with traditional causal discovery algorithm, in particular, PC. Our proposed MVPC is shown in theory to give asymptotically correct results even using data that are MAR and MNAR. Experiment results illustrate that the proposed algorithm can correct the conditional independence for values MCAR, MAR and rather general cases of values MNAR both with synthetic data as well as real-life healthcare application. Missing View Imputation with Generative Adversarial Networks(VIGAN) In an era where big data is becoming the norm, we are becoming less concerned with the quantity of the data for our models, but rather the quality. With such large amounts of data collected from multiple heterogeneous sources comes the associated problems, often missing views. As most models could not handle whole view missing problem, it brings up a significant challenge when conducting any multi-view analysis, especially when used in the context of very large and heterogeneous datasets. However if dealt with properly, joint learning from these complementary sources can be advantageous. In this work, we present a method for imputing missing views based on generative adversarial networks called VIGAN which combines cross-domain relations given unpaired data with multi-view relations given paired data. In our model, VIGAN first learns bidirectional mapping between view X and view Y using a cycle-consistent adversarial network. Moreover, we incorporate a denoising multimodal autoencoder to refine the initial approximation by making use of the joint representation. Empirical results give evidence indicating VIGAN offers competitive results compared to other methods on both numeric and image data. Missingness-Aware Temporal Convolutional Hitting-time Network(MATCH-Net) Accurate prediction of disease trajectories is critical for early identification and timely treatment of patients at risk. Conventional methods in survival analysis are often constrained by strong parametric assumptions and limited in their ability to learn from high-dimensional data, while existing neural network models are not readily-adapted to the longitudinal setting. This paper develops a novel convolutional approach that addresses these drawbacks. We present MATCH-Net: a Missingness-Aware Temporal Convolutional Hitting-time Network, designed to capture temporal dependencies and heterogeneous interactions in covariate trajectories and patterns of missingness. To the best of our knowledge, this is the first investigation of temporal convolutions in the context of dynamic prediction for personalized risk prognosis. Using real-world data from the Alzheimer’s Disease Neuroimaging Initiative, we demonstrate state-of-the-art performance without making any assumptions regarding underlying longitudinal or time-to-event processes attesting to the model’s potential utility in clinical decision support. Mix and Match(M&M) We introduce Mix&Match (M&M) – a training framework designed to facilitate rapid and effective learning in RL agents, especially those that would be too slow or too challenging to train otherwise. The key innovation is a procedure that allows us to automatically form a curriculum over agents. Through such a curriculum we can progressively train more complex agents by, effectively, bootstrapping from solutions found by simpler agents. In contradistinction to typical curriculum learning approaches, we do not gradually modify the tasks or environments presented, but instead use a process to gradually alter how the policy is represented internally. We show the broad applicability of our method by demonstrating significant performance gains in three different experimental setups: (1) We train an agent able to control more than 700 actions in a challenging 3D first-person task; using our method to progress through an action-space curriculum we achieve both faster training and better final performance than one obtains using traditional methods. (2) We further show that M&M can be used successfully to progress through a curriculum of architectural variants defining an agents internal state. (3) Finally, we illustrate how a variant of our method can be used to improve agent performance in a multitask setting. Mixed Data Frame(MDF) A mixed data frame (MDF) is a table collecting categorical, numerical and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column or group effects and interactions, for which a low-rank model has often been suggested. MIXed data Multilevel Anomaly Detection(MIXMAD) Anomalies are those deviating from the norm. Unsupervised anomaly detection often translates to identifying low density regions. Major problems arise when data is high-dimensional and mixed of discrete and continuous attributes. We propose MIXMAD, which stands for MIXed data Multilevel Anomaly Detection, an ensemble method that estimates the sparse regions across multiple levels of abstraction of mixed data. The hypothesis is for domains where multiple data abstractions exist, a data point may be anomalous with respect to the raw representation or more abstract representations. To this end, our method sequentially constructs an ensemble of Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. At the bottom level of each DBN, there is a Mixed-variate Restricted Boltzmann Machine that models the density of mixed data. Predictions across the ensemble are finally combined via rank aggregation. The proposed MIXMAD is evaluated on high-dimensional realworld datasets of different characteristics. The results demonstrate that for anomaly detection, (a) multilevel abstraction of high-dimensional and mixed data is a sensible strategy, and (b) empirically, MIXMAD is superior to popular unsupervised detection methods for both homogeneous and mixed data. Mixed Formal Learning This paper presents Mixed Formal Learning, a new architecture that learns models based on formal mathematical representations of the domain of interest and exposes latent variables. The second element in the architecture learns a particular skill, typically by using traditional prediction or classification mechanisms. Our key findings include that this architecture: (1) Facilitates transparency by exposing key latent variables based on a learned mathematical model; (2) Enables Low Shot and Zero Shot training of machine learning without sacrificing accuracy or recall. Mixed Markov Models(MMM) Markov random fields can encode complex probabilistic relationships involving multiple variables and admit efficient procedures for probabilistic inference. However, from a knowledge engineering point of view, these models suffer from a serious limitation. The graph of a Markov field must connect all pairs of variables that are conditionally dependent even for a single choice of values of the other variables. This makes it hard to encode interactions that occur only in a certain context and are absent in all others. Furthermore, the requirement that two variables be connected unless always conditionally independent may lead to excessively dense graphs, obscuring the independencies present among the variables and leading to computationally prohibitive inference algorithms. Mumford proposed an alternative modeling framework where the graph need not be rigid and completely determined a priori. Mixed Markov models contain node-valued random variables that, when instantiated, augment the graph by a set of transient edges. A single joint probability distribution relates the values of regular and node-valued variables. In this article, we study the analytical and computational properties of mixed Markov models. In particular, we show that positive mixed models have a local Markov property that is equivalent to their global factorization. We also describe a computationally efficient procedure for answering probabilistic queries in mixed Markov models. Mixed Membership Models(MMM) … We have reviewed and seen mixture models in detail. And we’ve seen hierarchical models-particularly those that capture nested structure in the data. 1. We will now combine these ideas to form mixed membership models, which is a powerful modeling methodology. 2. The basic ideas are · Data are grouped. · Each group is modeled with a mixture. · The mixture components are shared across all the groups. · The mixture proportions are vary from group to group. … mixedMem Mixed Neighbourhood Selection(MNS) MNS Mixed Variational Inference The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors. Mixed-Data Sampling(MIDAS) Mixed-data sampling (MIDAS) is an econometric regression or filtering method developed by Ghysels et al. The regression models can be viewed in some cases as substitutes for the Kalman filter when applied in the context of mixed frequency data. Bai, Ghysels and Wright (2010) examine the relationship between MIDAS regressions and Kalman filter state space models applied to mixed frequency data. In general, the latter involve a system of equations, whereas in contrast MIDAS regressions involve a (reduced form) single equation. As a consequence, MIDAS regressions might be less efficient, but also less prone to specification errors. In cases where the MIDAS regression is only an approximation, the approximation errors tend to be small. Mixed-Integer Linear Programming(MILP) This paper proposes an optimization strategy to assist utility operators to recover power distribution systems after large outages. Specifically, a novel mixed-integer linear programming (MILP) model is developed for co-optimizing crews, resources, and network operations. The MILP model coordinates the damage isolation, network reconfiguration, distributed generator re-dispatch, and crew/resource logistics. We consider two different types of crews, namely, line crews for damage repair and tree crews for obstacle removal. We also model the repair resource logistic constraints. Furthermore, a new algorithm is developed for solving the distribution system repair and restoration problem (DSRRP). The algorithm starts by solving DSRRP using an assignment-based method, then a neighborhood search method is designed to iteratively improve the solution. The proposed method is validated on the modified IEEE 123-bus distribution test system. Mixed-Integer Quadratic Optimization(MIQO) Learning directed acyclic graphs (DAGs) from data is a challenging task both in theory and in practice, because the number of possible DAGs scales superexponentially with the number of nodes. In this paper, we study the problem of learning an optimal DAG from continuous observational data. We cast this problem in the form of a mathematical programming model which can naturally incorporate a super-structure in order to reduce the set of possible candidate DAGs. We use the penalized negative log-likelihood score function with both $\ell_0$ and $\ell_1$ regularizations and propose a new mixed-integer quadratic optimization (MIQO) model, referred to as a layered network (LN) formulation. The LN formulation is a compact model, which enjoys as tight an optimal continuous relaxation value as the stronger but larger formulations under a mild condition. Computational results indicate that the proposed formulation outperforms existing mathematical formulations and scales better than available algorithms that can solve the same problem with only $\ell_1$ regularization. In particular, the LN formulation clearly outperforms existing methods in terms of computational time needed to find an optimal DAG in the presence of a sparse super-structure. MixHop Existing popular methods for semi-supervised learning with Graph Neural Networks (such as the Graph Convolutional Network) provably cannot learn a general class of neighborhood mixing relationships. To address this weakness, we propose a new model, MixHop, that can learn these relationships, including difference operators, by repeatedly mixing feature representations of neighbors at various distances. MixHop requires no additional memory or computational complexity, and outperforms on challenging baselines. In addition, we propose sparsity regularization that allows us to visualize how the network prioritizes neighborhood information across different graph datasets. Our analysis of the learned architectures reveals that neighborhood mixing varies per datasets. MixMatch Semi-supervised learning has proven to be a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets. In this work, we unify the current dominant approaches for semi-supervised learning to produce a new algorithm, MixMatch, that works by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeled data using MixUp. We show that MixMatch obtains state-of-the-art results by a large margin across many datasets and labeled data amounts. For example, on CIFAR-10 with 250 labels, we reduce error rate by a factor of 4 (from 38% to 11%) and by a factor of 2 on STL-10. We also demonstrate how MixMatch can help achieve a dramatically better accuracy-privacy trade-off for differential privacy. Finally, we perform an ablation study to tease apart which components of MixMatch are most important for its success. MixTrain There is an arms race to defend neural networks against adversarial examples. Notably, adversarially robust training and verifiably robust training are the most promising defenses. The adversarially robust training scales well but cannot provide provable robustness guarantee for the absence of attacks. We present an Interval Attack that reveals fundamental problems about the threat model used by adversarially robust training. On the contrary, verifiably robust training achieves sound guarantee, but it is computationally expensive and sacrifices accuracy, which prevents it being applied in practice. In this paper, we propose two novel techniques for verifiably robust training, stochastic output approximation and dynamic mixed training, to solve the aforementioned challenges. They are based on two critical insights: (1) soundness is only needed in a subset of training data; and (2) verifiable robustness and test accuracy are conflicting to achieve after a certain point of verifiably robust training. On both MNIST and CIFAR datasets, we are able to achieve similar test accuracy and estimated robust accuracy against PGD attacks within $14\times$ less training time compared to state-of-the-art adversarially robust training techniques. In addition, we have up to 95.2% verified robust accuracy as a bonus. Also, to achieve similar verified robust accuracy, we are able to save up to $5\times$ computation time and offer 9.2% test accuracy improvement compared to current state-of-the-art verifiably robust training techniques. Mixture Density Generative Adversarial Network(Mixture Density GAN) Generative Adversarial Networks have surprising ability for generating sharp and realistic images, though they are known to suffer from the so-called mode collapse problem. In this paper, we propose a new GAN variant called Mixture Density GAN that while being capable of generating high-quality images, overcomes this problem by encouraging the Discriminator to form clusters in its embedding space, which in turn leads the Generator to exploit these and discover different modes in the data. This is achieved by positioning Gaussian density functions in the corners of a simplex, using the resulting Gaussian mixture as a likelihood function over discriminator embeddings, and formulating an objective function for GAN training that is based on these likelihoods. We show that the optimum of our training objective is attained if and only if the generated and the real distribution match exactly. We further support our theoretical results with empirical evaluations on one synthetic and several real image datasets (CIFAR-10, CelebA, MNIST, and FashionMNIST). We demonstrate empirically (1) the quality of the generated images in Mixture Density GAN and their strong similarity to real images, as measured by the Fr\’echet Inception Distance (FID), which compares very favourably with state-of-the-art methods, and (2) the ability to avoid mode collapse and discover all data modes. Mixture Density Network The core idea is to have a Neural Net that predicts an entire (and possibly complex) distribution. In this example we’re predicting a mixture of gaussians distributions via its sufficient statistic. This means that the network knows what it doesn’t know: it will predict diffuse distributions in situations where the target variable is very noisy, and it will predict a much more peaky distribution in nearly deterministic parts. Mixture Generative Adversarial Network(MIXGAN) In this work, we present an interesting attempt on mixture generation: absorbing different image concepts (e.g., content and style) from different domains and thus generating a new domain with learned concepts. In particular, we propose a mixture generative adversarial network (MIXGAN). MIXGAN learns concepts of content and style from two domains respectively, and thus can join them for mixture generation in a new domain, i.e., generating images with content from one domain and style from another. MIXGAN overcomes the limitation of current GAN-based models which either generate new images in the same domain as they observed in training stage, or require off-the-shelf content templates for transferring or translation. Extensive experimental results demonstrate the effectiveness of MIXGAN as compared to related state-of-the-art GAN-based models. Mixture Hyper Long Short Term Memory Network Classifying human cognitive states from behavioral and physiological signals is a challenging problem with important applications in robotics. The problem is challenging due to the data variability among individual users, and sensor artefacts. In this work, we propose an end-to-end framework for real-time cognitive workload classification with mixture Hyper Long Short Term Memory Networks, a novel variant of HyperNetworks. Evaluating the proposed approach on an eye-gaze pattern dataset collected from simulated driving scenarios of different cognitive demands, we show that the proposed framework outperforms previous baseline methods and achieves 83.9\% precision and 87.8\% recall during test. We also demonstrate the merit of our proposed architecture by showing improved performance over other LSTM-based methods. Mixture Likelihood Ratio Test We explore the fundamental limits of heterogeneous distributed detection in an anonymous sensor network with n sensors and a single fusion center. The fusion center collects the single observation from each of the n sensors to detect a binary parameter. The sensors are clustered into multiple groups, and different groups follow different distributions under a given hypothesis. The key challenge for the fusion center is the anonymity of sensors — although it knows the exact number of sensors and the distribution of observations in each group, it does not know which group each sensor belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test, which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Optimality is shown by first arguing that there exists an optimal test that is symmetric, that is, it does not depend on the order of observations across the sensors, and then proving that the mixture likelihood ratio test is optimal among all symmetric tests. Second, we focus on the Neyman-Pearson setting and characterize the error exponent of the worst-case type-II error probability as n tends to infinity, assuming the number of sensors in each group is proportional to n. Finally, we generalize our result to find the collection of all achievable type-I and type-II error exponents, showing that the boundary of the region can be obtained by solving a convex optimization problem. Our results elucidate the price of anonymity in heterogeneous distributed detection. The results are also applied to distributed detection under Byzantine attacks, which hints that the conventional approach based on simple hypothesis testing might be too pessimistic. Mixture Model(MM) In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with “mixture distributions” relate to deriving the properties of the overall population from those of the sub-populations, “mixture models” are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information. Mixture of Experts(MoE) Mixture of experts refers to a machine learning technique where multiple experts (learners) are used to divide the problem space into homogeneous regions. An example from the computer vision domain is combining a neural network model for human detection with another for pose estimation. If the output is conditioned on multiple levels of probabilistic gating functions, the mixture is called a hierarchical mixture of experts. A gating network decides which expert to use for each input region. Learning thus consists of 1) learning the parameters of individual learners and 2) learning the parameters of the gating network. Globally Consistent Algorithms for Mixture of Experts Mixture of Meta-Learners(MxML) A meta-model is trained on a distribution of similar tasks such that it learns an algorithm that can quickly adapt to a novel task with only a handful of labeled examples. Most of current meta-learning methods assume that the meta-training set consists of relevant tasks sampled from a single distribution. In practice, however, a new task is often out of the task distribution, yielding a performance degradation. One way to tackle this problem is to construct an ensemble of meta-learners such that each meta-learner is trained on different task distribution. In this paper we present a method for constructing a mixture of meta-learners (MxML), where mixing parameters are determined by the weight prediction network (WPN) optimized to improve the few-shot classification performance. Experiments on various datasets demonstrate that MxML significantly outperforms state-of-the-art meta-learners, or their naive ensemble in the case of out-of-distribution as well as in-distribution tasks. ML Health Deployment of machine learning (ML) algorithms in production for extended periods of time has uncovered new challenges such as monitoring and management of real-time prediction quality of a model in the absence of labels. However, such tracking is imperative to prevent catastrophic business outcomes resulting from incorrect predictions. The scale of these deployments makes manual monitoring prohibitive, making automated techniques to track and raise alerts imperative. We present a framework, ML Health, for tracking potential drops in the predictive performance of ML models in the absence of labels. The framework employs diagnostic methods to generate alerts for further investigation. We develop one such method to monitor potential problems when production data patterns do not match training data distributions. We demonstrate that our method performs better than standard ‘distance metrics’, such as RMSE, KL-Divergence, and Wasserstein at detecting issues with mismatched data sets. Finally, we present a working system that incorporates the ML Health approach to monitor and manage ML deployments within a realistic full production ML lifecycle. ML.NET 1. Machine Learning made for .NET: ML.NET is a machine learning framework built for .NET developers. Use your .NET and C# or F# skills to easily integrate custom machine learning into your applications without any prior expertise in developing or tuning machine learning models. 2. Open source and cross-platform: ML.NET is open source and runs on Windows, Linux, and macOS. Our public release is still in-development, and we want your help! Join the community and contribute your ideas to help us shape what comes next. 3. Proven and extensible: Use the same framework behind recognized Microsoft features like Windows Hello, Bing Ads, and PowerPoint Design Ideas to power your own applications. We’re building ML.NET as an extensible framework, with support for Light GBM, Accord.NET, CNTK, and TensorFlow coming soon. 4. Cover your developer scenarios: Enhance your .NET apps with sentiment analysis, price prediction, fraud detection, and more using custom models built with ML.NET. Machine Learning at Microsoft with ML.NET MLFlow MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions: • Tracking experiments to record and compare parameters and results (MLflow Tracking). • Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects). • Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models). MLflow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI. For convenience, the project also includes a Python API. MLJAR MLJAR is a platform for rapid prototyping, development and deploying pattern recognition algorithms. It works with many data types – basically all data are arrays 🙂 mljar MLModelScope The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform frameworks, models, and system stacks but lacks standard tools to facilitate the evaluation and measurement of model. Due to the absence of such tools, the current practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error prone — stifling the adoption of the innovations. We propose MLModelScope — a hardware/software agnostic platform to facilitate the evaluation, measurement, and introspection of ML models within AI pipelines. MLModelScope aids application developers in discovering and experimenting with models, data scientists developers in replicating and evaluating for publishing models, and system architects in understanding the performance of AI workloads. We describe the design and implementation of MLModelScope and show how it is able to give users a holistic view into the execution of models within AI pipelines. Using AlexNet as a case study, we demonstrate how MLModelScope aids in identifying deviation in accuracy, helps in pin pointing the source of system bottlenecks, and automates the evaluation and performance aggregation of models across frameworks and systems. MLog We demonstrate MLOG, a high-level language that integrates machine learning into data management systems. Unlike existing machine learning frameworks (e.g., TensorFlow, Theano, and Caffe), MLOG is declarative, in the sense that the system manages all data movement, data persistency, and machine-learning related optimizations (such as data batching) automatically. Our interactive demonstration will show audience how this is achieved based on the novel notion of tensoral views (TViews), which are similar to relational views but operate over tensors with linear algebra. With MLOG, users can succinctly specify not only simple models such as SVM (in just two lines), but also sophisticated deep learning models that are not supported by existing in-database analytics systems (e.g., MADlib, PAL, and SciDB), as a series of cascaded TViews. Given the declarative nature of MLOG, we further demonstrate how query/program optimization techniques can be leveraged to translate MLOG programs into native TensorFlow programs. The performance of the automatically generated Tensor- Flow programs is comparable to that of hand-optimized ones. MLPerf The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users. ML-Schema The ML-Schema, proposed by the W3C Machine Learning Schema Community Group, is a top-level ontology that provides a set of classes, properties, and restrictions for representing and interchanging information on machine learning algorithms, datasets, and experiments. It can be easily extended and specialized and it is also mapped to other more domain-specific ontologies developed in the area of machine learning and data mining. In this paper we overview existing state-of-the-art machine learning interchange formats and present the first release of ML-Schema, a canonical format resulted of more than seven years of experience among different research institutions. We argue that exposing semantics of machine learning algorithms, models, and experiments through a canonical format may pave the way to better interpretability and to realistically achieve the full interoperability of experiments regardless of platform or adopted workflow solution. MLWeaving Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods. MMALFM Although the latent factor model achieves good accuracy in rating prediction, it suffers from many problems including cold-start, non-transparency, and suboptimal results for individual user-item pairs. In this paper, we exploit textual reviews and item images together with ratings to tackle these limitations. Specifically, we first apply a proposed multi-modal aspect-aware topic model (MATM) on text reviews and item images to model users’ preferences and items’ features from different aspects, and also estimate the aspect importance of a user towards an item. Then the aspect importance is integrated into a novel aspect-aware latent factor model (ALFM), which learns user’s and item’s latent factors based on ratings. In particular, ALFM introduces a weight matrix to associate those latent factors with the same set of aspects in MATM, such that the latent factors could be used to estimate aspect ratings. Finally, the overall rating is computed via a linear combination of the aspect ratings, which are weighted by the corresponding aspect importance. To this end, our model could alleviate the data sparsity problem and gain good interpretability for recommendation. Besides, every aspect rating is weighted by its aspect importance, which is dependent on the targeted user’s preferences and the targeted item’s features. Therefore, it is expected that the proposed method can model a user’s preferences on an item more accurately for each user-item pair. Comprehensive experimental studies have been conducted on the Yelp 2017 Challenge dataset and Amazon product datasets to demonstrate the effectiveness of our method. MnasNet Designing convolutional neural networks (CNN) models for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant effort has been dedicated to design and improve mobile models on all three dimensions, it is challenging to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated neural architecture search approach for designing resource-constrained mobile CNN models. We propose to explicitly incorporate latency information into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike in previous work, where mobile latency is considered via another, often inaccurate proxy (e.g., FLOPS), in our experiments, we directly measure real-world inference latency by executing the model on a particular platform, e.g., Pixel phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that permits layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our model achieves 74.0% top-1 accuracy with 76ms latency on a Pixel phone, which is 1.5x faster than MobileNetV2 (Sandler et al. 2018) and 2.4x faster than NASNet (Zoph et al. 2018) with the same top-1 accuracy. On the COCO object detection task, our model family achieves both higher mAP quality and lower latency than MobileNets. MNIST Database(MNIST) The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. MOA MOA is the most popular open source framework for data stream mining, with a very active growing community (blog). It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems. mobile Question Answering(mQA) In this paper, we present a novel proposal for Question An- swering through mobile devices. Thus, an architecture for a mobile Ques- tion Answering system based on WAP technologies is deployed. The ar- chitecture propose moves the issue of Question Answering to the context of mobility. This paradigm ensures that QA is seen as an activity that provides entertainment and excitement pleasure. This characteristic gives to QA an added value. Furthermore, the method for answering de¯nition questions is very precise. It could answer almost 90% of the questions; moreover, it never replies wrong or unsupported answers. Considering that the mobile-phone has had a boom in the last years and that a lot of people already have mobile telephones (approximately 3.5 billions), we propose an architecture for a new mobile system that makes QA some- thing natural and e®ective for work in all ¯elds of development. This obeys to that the new mobile technology can help us to achieve our perspectives of growth. This system provides to user with a permanent communication in anytime, anywhere and any device (PDA’s, cell-phone, NDS, etc.). MobiRNN In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine Translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models can be run locally on mobile devices. Existing work on porting deep learning models to mobile devices focus on Convolution Neural Networks (CNNs) and cannot be applied directly to RNN models. In response, we present MobiRNN, a mobile-specific optimization framework that implements GPU offloading specifically for mobile GPUs. Evaluations using an RNN model for activity recognition shows that MobiRNN does significantly decrease the latency of running RNN models on phones. MobiVSR Visual speech recognition (VSR) is the task of recognizing spoken language from video input only, without any audio. VSR has many applications as an assistive technology, especially if it could be deployed in mobile devices and embedded systems. The need of intensive computational resources and large memory footprint are two of the major obstacles in developing neural network models for VSR in a resource constrained environment. We propose a novel end-to-end deep neural network architecture for word level VSR called MobiVSR with a design parameter that aids in balancing the model’s accuracy and parameter count. We use depthwise-separable 3D convolution for the first time in the domain of VSR and show how it makes our model efficient. MobiVSR achieves an accuracy of 73\% on a challenging Lip Reading in the Wild dataset with 6 times fewer parameters and 20 times lesser memory footprint than the current state of the art. MobiVSR can also be compressed to 6 MB by applying post training quantization. MOCHA Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multi-task learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systems-aware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multi-task learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on real-world federated datasets. MoCHIN As one type of complex networks widely-seen in real-world application, heterogeneous information networks (HINs) often encapsulate higher-order interactions that crucially reflect the complex nature among nodes and edges in real-world data. Modeling higher-order interactions in HIN facilitates the user-guided clustering problem by providing an informative collection of signals. At the same time, network motifs have been used extensively to reveal higher-order interactions and network semantics in homogeneous networks. Thus, it is natural to extend the use of motifs to HIN, and we tackle the problem of user-guided clustering in HIN by using motifs. We highlight the benefits of comprehensively modeling higher-order interactions instead of decomposing the complex relationships to pairwise interaction. We propose the MoCHIN model which is applicable to arbitrary forms of HIN motifs, which is often necessary for the application scenario in HINs due to their rich and diverse semantics encapsulated in the heterogeneity. To overcome the curse of dimensionality since the tensor size grows exponentially as the number of nodes increases in our model, we propose an efficient inference algorithm for MoCHIN. In our experiment, MoCHIN surpasses all baselines in three evaluation tasks under different metrics. The advantage of our model when the supervision is weak is also discussed in additional experiments. modAL modAL is a modular active learning framework for Python, aimed to make active learning research and practice simpler. Its distinguishing features are (i) clear and modular object oriented design (ii) full compatibility with scikit-learn models and workflows. These features make fast prototyping and easy extensibility possible, aiding the development of real-life active learning pipelines and novel algorithms as well. modAL is fully open source, hosted on GitHub at https://…/modAL. To assure code quality, extensive unit tests are provided and continuous integration is applied. In addition, a detailed documentation with several tutorials are also available for ease of use. The framework is available in PyPI and distributed under the MIT license. Modality Dropout(m-drop) ➚ “Learning to Recommend with Missing Modalities” Mode Aware Data Flow(MADF) In real-time systems, the application’s behavior has to be predictable at compile-time to guarantee timing constraints. However, modern streaming applications which exhibit adaptive behavior due to mode switching at run-time, may degrade system predictability due to unknown behavior of the application during mode transitions. Therefore, proper temporal analysis during mode transitions is imperative to preserve system predictability. To this end, in this paper, we initially introduce Mode Aware Data Flow (MADF) which is our new predictable Model of Computation (MoC) to efficiently capture the behavior of adaptive streaming applications. Then, as an important part of the operational semantics of MADF, we propose the Maximum-Overlap Offset (MOO) which is our novel protocol for mode transitions. The main advantage of this transition protocol is that, in contrast to self-timed transition protocols, it avoids timing interference between modes upon mode transitions. As a result, any mode transition can be analyzed independently from the mode transitions that occurred in the past. Based on this transition protocol, we propose a hard real-time analysis as well to guarantee timing constraints by avoiding processor overloading during mode transitions. Therefore, using this protocol, we can derive a lower bound and an upper bound on the earliest starting time of the tasks in the new mode during mode transitions in such a way that hard real-time constraints are respected. Mode Normalization Normalization methods are a central building block in the deep learning toolbox. They accelerate and stabilize training, while decreasing the dependence on manually tuned learning rate schedules. When learning from multi-modal distributions, the effectiveness of batch normalization (BN), arguably the most prominent normalization method, is reduced. As a remedy, we propose a more flexible approach: by extending the normalization to more than a single mean and variance, we detect modes of data on-the-fly, jointly normalizing samples that share common features. We demonstrate that our method outperforms BN and other widely used normalization techniques in several experiments, including single and multi-task datasets. Mode of Computing The Turing Machine is the paradigmatic case of computing machines, but there are others, such as Artificial Neural Networks, Table Computing, Relational-Indeterminate computing and diverse forms of analogical computing, each of which based on a particular underlying intuition of the phenomenon of computing. This variety can be captured in terms of system levels, re-interpreting and generalizing Newell’s hierarchy, which includes the knowledge level at the top and the symbol level immediately below it. In this re-interpretation the knowledge level consists of human knowledge and the symbol level is generalized into a new level that here is called The Mode of Computing. Each computing paradigm uses a particular mode, and a central question for Cognition is what is the mode of natural computing. The mode of computing provides a novel perspective on the phenomena of computing, the representational and non-representational views of cognition, and consciousness. Model Agnostic Meta Learning(MAML) We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies. How to train your MAML Model Average Double Robust(MA-DR) Estimates average treatment effects using model average double robust (MA-DR) estimation. The MA-DR estimator is defined as weighted average of double robust estimators, where each double robust estimator corresponds to a specific choice of the outcome model and the propensity score model. The MA-DR estimator extend the desirable double robustness property by achieving consistency under the much weaker assumption that either the true propensity score model or the true outcome model be within a specified, possibly large, class of models. madr Model Averaging Model Based Clustering for Mixed Data(clustMD) A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded. Model Based Machine Learning(MBML) Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation formodel-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. Model Cards Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation. Model Confidence Set(MCS) The Model Confidence Set (MCS) procedure was recently developed by Hansen et al. (2011). The Hansen’s procedure consists on a sequence of tests which permits to construct a set of ‘superior’ models, where the null hypothesis of Equal Predictive Ability (EPA) is not rejected at a certain confidence level. The EPA statistic tests is calculated for an arbitrary loss function, meaning that we could test models on various aspects, for example punctual forecasts. MCS Model Explanation System(MES) We propose a general model explanation system (MES) for ‘explaining’ the output of black box classifiers. In this introduction we use the motivating example of a classifier trained to detect fraud in a credit card transaction history. The key aspect is that we provide explanations applicable to a single prediction, rather than provide an interpretable set of parameters. The labels in the provided examples are usually negative. Hence, we focus on explaining positive predictions (alerts). In many classification applications, but especially in fraud detection, there is an expectation of false positives. Alerts are given to a human analyst before any further action is taken. Analysts often insist on understanding ‘why’ there was an alert, since an opaque alert makes it difficult for them to proceed. Analogous scenarios occur in computer vision , credit risk , spam detection , etc. Furthermore, the MES framework is useful for model criticism. In the world of generative models, practitioners often generate synthetic data from a trained model to get an idea of ‘what the model is doing’. Our MES framework augments such tools. As an added benefit, MES is applicable to completely non-probabilistic black boxes that only provide hard labels. In Section 3 we use MES to visualize the decisions of a face recognition system. Model Features A key question in Reinforcement Learning is which representation an agent can learn to efficiently reuse knowledge between different tasks. Recently the Successor Representation was shown to have empirical benefits for transferring knowledge between tasks with shared transition dynamics. This paper presents Model Features: a feature representation that clusters behaviourally equivalent states and that is equivalent to a Model-Reduction. Further, we present a Successor Feature model which shows that learning Successor Features is equivalent to learning a Model-Reduction. A novel optimization objective is developed and we provide bounds showing that minimizing this objective results in an increasingly improved approximation of a Model-Reduction. Further, we provide transfer experiments on randomly generated MDPs which vary in their transition and reward functions but approximately preserve behavioural equivalence between states. These results demonstrate that Model Features are suitable for transfer between tasks with varying transition and reward functions. Model Independent Neural Decoder(MIND) Standard decoding approaches rely on model-based channel estimation methods to compensate for varying channel effects, which degrade in performance whenever there is a model mismatch. Recently proposed Deep learning based neural decoders address this problem by leveraging a model-free approach via gradient-based training. However, they require large amounts of data to retrain to achieve the desired adaptivity, which becomes intractable in practical systems. In this paper, we propose a new decoder: Model Independent Neural Decoder (MIND), which builds on the top of neural decoders and equips them with a fast adaptation capability to varying channels. This feature is achieved via the methodology of Model-Agnostic Meta-Learning (MAML). Here the decoder: (a) learns a ‘good’ parameter initialization in the meta-training stage where the model is exposed to a set of archetypal channels and (b) updates the parameter with respect to the observed channel in the meta-testing phase using minimal adaptation data and pilot bits. Building on top of existing state-of-the-art neural Convolutional and Turbo decoders, MIND outperforms the static benchmarks by a large margin and shows minimal performance gap when compared to the neural (Convolutional or Turbo) decoders designed for that particular channel. In addition, MIND also shows strong learning capability for channels not exposed during the meta training phase. Model Management Deep Neural Network(MMdnn) MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch and CoreML. A comprehensive, cross-framework solution to convert, visualize and diagnosis deep neural network models. The ‘MM’ in MMdnn stands for model management and ‘dnn’ is an acronym for deep neural network. Basically, it converts many DNN models that trained by one framework into others. The major features include: · Model File Converter Converting DNN models between frameworks · Model Code Snippet Generator Generating training or inference code snippet for frameworks · Model Visualization Visualizing DNN network architecture and parameters for frameworks · Model compatibility testing (On-going) This project is designed and developed by Microsoft Research (MSR). We also encourage researchers and students leverage this project to analysis DNN models and we welcome any new ideas to extend this project. Model Performance Predictor(MPP) Operations is a key challenge in the domain of machine learning pipeline deployments involving monitoring and management of real-time prediction quality. Typically, metrics like accuracy, RMSE etc., are used to track the performance of models in deployment. However, these metrics cannot be calculated in production due to the absence of labels. We propose using an ML algorithm, Model Performance Predictor (MPP), to track the performance of the models in deployment. We argue that an ensemble of such metrics can be used to create a score representing the prediction quality in production. This in turn facilitates formulation and customization of ML alerts, that can be escalated by an operations team to the data science team. Such a score automates monitoring and enables ML deployments at scale. Model Predictive Control Model predictive control (MPC) is an advanced method of process control that is used to control a process while satisfying a set of constraints. It has been in use in the process industries in chemical plants and oil refineries since the 1980s. In recent years it has also been used in power system balancing models and in power electronics. Model predictive controllers rely on dynamic models of the process, most often linear empirical models obtained by system identification. The main advantage of MPC is the fact that it allows the current timeslot to be optimized, while keeping future timeslots in account. This is achieved by optimizing a finite time-horizon, but only implementing the current timeslot and then optimizing again, repeatedly, thus differing from Linear-Quadratic Regulator (LQR). Also MPC has the ability to anticipate future events and can take control actions accordingly. Proportional-Integral-Derivative (PID) controllers do not have this predictive ability. MPC is nearly universally implemented as a digital control, although there is research into achieving faster response times with specially designed analog circuitry. Model Reference Adaptive Controller(MRAC) In this paper, we present a hybrid direct-indirect model reference adaptive controller (MRAC), to address a class of problems with matched and unmatched uncertainties. In the proposed architecture, the unmatched uncertainty is estimated online through a companion observer model. Upon convergence of the observer, the unmatched uncertainty estimate is remodeled into a state dependent linear form to augment the nominal system dynamics. Meanwhile, a direct adaptive controller designed for a switching system cancels the effect of matched uncertainty in the system and achieves reference model tracking. We demonstrate that the proposed hybrid controller can handle a broad class of nonlinear systems with both matched and unmatched uncertainties Model Selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice. Konishi & Kitagawa (2008, p.75) state, ‘The majority of the problems in statistical inference can be considered to be problems related to statistical modeling’. Relatedly, Sir David Cox (2006, p.197) has said, ‘How translation from subject-matter problem to statistical model is done is often the most critical part of an analysis’. Model Selection Algorithm Using Binary Ant Colony Optimization(MS-BACO) Stabilizing the complexity of Feedforward Neural Networks (FNNs) for the given approximation task can be managed by defining an appropriate model magnitude which is also greatly correlated with the generalization quality and computational efficiency. However, deciding on the right level of model complexity can be highly challenging in FNN applications. In this paper, a new Model Selection algorithm using Binary Ant Colony Optimization (MS-BACO) is proposed in order to achieve the optimal FNN model in terms of neural complexity and cross-entropy error. MS-BACO is a meta-heuristic algorithm that treats the problem as a combinatorial optimization problem. By quantifying both the amount of correlation exists among hidden neurons and the sensitivity of the FNN output to the hidden neurons using a sample-based sensitivity analysis method called, extended Fourier amplitude sensitivity test, the algorithm mostly tends to select the FNN model containing hidden neurons with most distinct hyperplanes and high contribution percentage. Performance of the proposed algorithm with three different designs of heuristic information is investigated. Comparison of the findings verifies that the newly introduced algorithm is able to provide more compact and accurate FNN model. Model to Learn Compact Embedding(MCNE) Network embedding, as a promising way of the network representation learning, is capable of supporting various subsequent network mining and analysis tasks, and has attracted growing research interests recently. Traditional approaches assign each node with an independent continuous vector, which will cause huge memory overhead for large networks. In this paper we propose a novel multi-hot compact embedding strategy to effectively reduce memory cost by learning partially shared embeddings. The insight is that a node embedding vector is composed of several basis vectors, which can significantly reduce the number of continuous vectors while maintain similar data representation ability. Specifically, we propose a MCNE model to learn compact embeddings from pre-learned node features. A novel component named compressor is integrated into MCNE to tackle the challenge that popular back-propagation optimization cannot propagate through discrete samples. We further propose an end-to-end model MCNE$_{t}$ to learn compact embeddings from the input network directly. Empirically, we evaluate the proposed models over three real network datasets, and the results demonstrate that our proposals can save about 90\% of memory cost of network embeddings without significantly performance decline. Model, MetaModel and Anomaly Detection(M3A) Alice’ is submitting one web search per five minutes, for three hours in a row – is it normal? How to detect abnormal search behaviors, among Alice and other users? Is there any distinct pattern in Alice’s (or other users’) search behavior? We studied what is probably the largest, publicly available, query log that contains more than 30 million queries from 0.6 million users. In this paper, we present a novel, user-and group-level framework, M3A: Model, MetaModel and Anomaly detection. For each user, we discover and explain a surprising, bi-modal pattern of the inter-arrival time (IAT) of landed queries (queries with user click-through). Specifically, the model Camel-Log is proposed to describe such an IAT distribution; we then notice the correlations among its parameters at the group level. Thus, we further propose the metamodel Meta-Click, to capture and explain the two-dimensional, heavy-tail distribution of the parameters. Combining Camel-Log and Meta-Click, the proposed M3A has the following strong points: (1) the accurate modeling of marginal IAT distribution, (2) quantitative interpretations, and (3) anomaly detection. Model-Agnostic Meta-Learning(MAML) ➘ “Model Independent Neural Decoder” What is Model-Agnostic Meta-learning ModelarDB To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly installed. However, due to the big amounts of data produced, only simple aggregates are stored. This removes outliers and hides fluctuations that could indicate problems. As a solution we propose compressing time series with dimensions using a model-based method we name Multi-model Group Compression (MMGC). MMGC adaptively compresses groups of correlated time series with dimensions using an extensible set of models within a user-defined error bound (possibly zero). To partition time series into groups, we propose a set of primitives for efficiently describing correlation for data sets of varying sizes. We also propose efficient query processing algorithms for executing multi-dimensional aggregate queries on models instead of data points. Last, we provide an open-source implementation of our methods as extensions to the model-based Time Series Management System (TSMS) ModelarDB. ModelarDB interfaces with the stock versions of Apache Spark and Apache Cassandra and thus can reuse existing infrastructure. Through an evaluation we show that, compared to widely used systems, our extended ModelarDB provides up to 11 times faster ingestion due to high compression, 65 times better compression due to the adaptivity of MMGC, 92 times faster aggregate queries as they are executed on models, and close to linear scalability while also being extensible and supporting online query processing. Model-Augmented Neural neTwork with Incoherent k-space Sampling(MANTIS) Quantitative mapping of magnetic resonance (MR) parameters have been shown as valuable methods for improved assessment of a range of diseases. Due to the need to image an anatomic structure multiple times, parameter mapping usually requires long scan times compared to conventional static imaging. Therefore, accelerated parameter mapping is highly-desirable and remains a topic of great interest in the MR research community. While many recent deep learning methods have focused on highly efficient image reconstruction for conventional static MR imaging, applications of deep learning for dynamic imaging and in particular accelerated parameter mapping have been limited. The purpose of this work was to develop and evaluate a novel deep learning-based reconstruction framework called Model-Augmented Neural neTwork with Incoherent k-space Sampling (MANTIS) for efficient MR parameter mapping. Our approach combines end-to-end CNN mapping with k-space consistency using the concept of cyclic loss to further enforce data and model fidelity. Incoherent k-space sampling is used to improve reconstruction performance. A physical model is incorporated into the proposed framework, so that the parameter maps can be efficiently estimated directly from undersampled images. The performance of MANTIS was demonstrated for the spin-spin relaxation time (T2) mapping of the knee joint. Compared to conventional reconstruction approaches that exploited image sparsity, MANTIS yielded lower errors and higher similarity with respect to the reference in the T2 estimation. Our study demonstrated that the proposed MANTIS framework, with a combination of end-to-end CNN mapping, signal model-augmented data consistency, and incoherent k-space sampling, represents a promising approach for efficient MR parameter mapping. MANTIS can potentially be extended to other types of parameter mapping with appropriate models. Model-Averaged Confidence Intervals MuMIn Model-Averaged Tail Area Wald Confidence Interval(MATA-Wald) MATA Model-averaged Wald Confidence Intervals Model-Based Active EXploration(MAX) Efficient exploration is an unsolved problem in Reinforcement Learning. We introduce Model-Based Active eXploration (MAX), an algorithm that actively explores the environment. It minimizes data required to comprehensively model the environment by planning to observe novel events, instead of merely reacting to novelty encountered by chance. Non-stationarity induced by traditional exploration bonus techniques is avoided by constructing fresh exploration policies only at time of action. In semi-random toy environments where directed exploration is critical to make progress, our algorithm is at least an order of magnitude more efficient than strong baselines. Model-Based Approximate Query Processing Interactive visualizations are arguably the most important tool to explore, understand and convey facts about data. In the past years, the database community has been working on different techniques for Approximate Query Processing (AQP) that aim to deliver an approximate query result given a fixed time bound to support interactive visualizations better. However, classical AQP approaches suffer from various problems that limit the applicability to support the ad-hoc exploration of a new data set: (1) Classical AQP approaches that perform online sampling can support ad-hoc exploration queries but yield low quality if executed over rare subpopulations. (2) Classical AQP approaches that rely on offline sampling can use some form of biased sampling to mitigate these problems but require a priori knowledge of the workload, which is often not realistic if users want to explore a new database. In this paper, we present a new approach to AQP called Model-based Approximate Query Processing that leverages generative models learned over the complete database to answer SQL queries at interactive speeds. Different from classical AQP approaches, generative models allow us to compute responses to ad-hoc queries and deliver high-quality estimates also over rare subpopulations at the same time. In our experiments with real and synthetic data sets, we show that Model-based AQP can in many scenarios return more accurate results in a shorter runtime. Furthermore, we think that our techniques of using generative models presented in this paper can not only be used for AQP in databases but also has applications for other database problems including Query Optimization as well as Data Cleaning. Model-Based Clustering Sample observations arise from a distribution that is a mixture of two or more components. Each component is described by a density function and has an associated probability or \weight” in the mixture. In principle, we can adopt any probability model for the components, but typically we will assume that components are p-variate normal distributions. (This does not necessarily mean things are easy: inference in tractable, however.) Thus, the probability model for clustering will often be a mixture of multivariate normal distributions. Each component in the mixture is what we call a cluster. mclust,SelvarMix Model-based Clustering via Adaptive Projection(MCAP) Mixture models are a standard approach to dealing with heterogeneous data with non-i.i.d. structure. However, when the dimension $p$ is large relative to sample size $n$ and where either or both of means and covariances/graphical models may differ between the latent groups, mixture models face statistical and computational difficulties and currently available methods cannot realistically go beyond $p \! \sim \! 10^4$ or so. We propose an approach called Model-based Clustering via Adaptive Projections (MCAP). Instead of estimating mixtures in the original space, we work with a low-dimensional representation obtained by linear projection. The projection dimension itself plays an important role and governs a type of bias-variance tradeoff with respect to recovery of the relevant signals. MCAP sets the projection dimension automatically in a data-adaptive manner, using a proxy for the assignment risk. Combining a full covariance formulation with the adaptive projection allows detection of both mean and covariance signals in very high dimensional problems. We show real-data examples in which covariance signals are reliably detected in problems with $p \! \sim \! 10^4$ or more, and simulations going up to $p = 10^6$. In some examples, MCAP performs well even when the mean signal is entirely removed, leaving differential covariance structure in the high-dimensional space as the only signal. Across a number of regimes, MCAP performs as well or better than a range of existing methods, including a recently-proposed $\ell_1$-penalized approach; and performance remains broadly stable with increasing dimension. MCAP can be run ‘out of the box’ and is fast enough for interactive use on large-$p$ problems using standard desktop computing resources. Model-Based Meta-Policy-Optimization(MB-MPO) Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience. Model-Based Pricing(MBP) Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers’ revenue and buyers’ affordability and efficiency. In this paper, we propose a model-based pricing (MBP) framework, which instead of pricing the data, directly prices ML model instances. We first formally describe the desired properties of the MBP framework, with a focus on avoiding arbitrage. Next, we show a concrete realization of the MBP framework via a noise injection approach, which provably satisfies the desired formal properties. Based on the proposed framework, we then provide algorithmic solutions on how the seller can assign prices to models under different market scenarios (such as to maximize revenue). Finally, we conduct extensive experiments, which validate that the MBP framework can provide high revenue to the seller, high affordability to the buyer, and also operate on low runtime cost. Model-Based Priors for Model-Free Reinforcement Learning(MBMF) Reinforcement Learning is divided in two main paradigms: model-free and model-based. Each of these two paradigms has strengths and limitations, and has been successfully applied to real world domains that are appropriate to its corresponding strengths. In this paper, we present a new approach aimed at bridging the gap between these two paradigms. We aim to take the best of the two paradigms and combine them in an approach that is at the same time data-efficient and cost-savvy. We do so by learning a probabilistic dynamics model and leveraging it as a prior for the intertwined model-free optimization. As a result, our approach can exploit the generality and structure of the dynamics model, but is also capable of ignoring its inevitable inaccuracies, by directly incorporating the evidence provided by the direct observation of the cost. As a proof-of-concept, we demonstrate on simulated tasks that our approach outperforms purely model-based and model-free approaches, as well as the approach of simply switching from a model-based to a model-free setting. Model-Based Task Transfer Learning(MBTTL) A model-based task transfer learning (MBTTL) method is presented. We consider a constrained nonlinear dynamical system and assume that a dataset of state and input pairs that solve a task T1 is available. Our objective is to find a feasible state-feedback policy for a second task, T1, by using stored data from T2. Our approach applies to tasks T2 which are composed of the same subtasks as T1, but in different order. In this paper we formally introduce the definition of subtask, the MBTTL problem and provide examples of MBTTL in the fields of autonomous cars and manipulators. Then, a computationally efficient approach to solve the MBTTL problem is presented along with proofs of feasibility for constrained linear dynamical systems. Simulation results show the effectiveness of the proposed method. Model-Based Value Expansion Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning. ModelDepot A platform for discovering, sharing, and discussing easy to use and pre-trained machine learning models. Model-Guided Iterative Sample Selection(MISS) Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally, the goal of solving the SSO problem is to achieve statistical accuracy, computational efficiency and broad applicability all at the same time. Existing approaches either make idealistic assumptions on the statistical properties of the query, or completely disregard them. This may result in overemphasizing only one of the three goals while neglect the others. To overcome these limitations, we first examine carefully the statistical properties shared by common analytical queries. Then, based on the properties, we propose a linear model describing the relationship between sample sizes and the approximation errors of a query, which is called the error model. Then, we propose a Model-guided Iterative Sample Selection (MISS) framework to solve the SSO problem generally. Afterwards, based on the MISS framework, we propose a concrete algorithm, called $L^2$Miss, to find optimal sample sizes under the $L^2$ norm error metric. Moreover, we extend the $L^2$Miss algorithm to handle other error metrics. Finally, we show theoretically and empirically that the $L^2$Miss algorithm and its extensions achieve satisfactory accuracy and efficiency for a considerably wide range of analytical queries. Model-Implied Instrumental Variable(MIIV) Model-implied instrumental variables are the observed variables in the model that can serve as instrumental variables in a given equation. ➚ “Instrumental Variable” MIIVsem Model-Implied Instrumental Variable – Generalized Method of Moments(MIIV-GMM) The common maximum likelihood (ML) estimator for structural equation models (SEMs) has optimal asymptotic properties under ideal conditions (e.g., correct structure, no excess kurtosis, etc.) that are rarely met in practice. This paper proposes model-implied instrumental variable – generalized method of moments (MIIV-GMM) estimators for latent variable SEMs that are more robust than ML to violations of both the model structure and distributional assumptions. Under less demanding assumptions, the MIIV-GMM estimators are consistent, asymptotically unbiased, asymptotically normal, and have an asymptotic covariance matrix. They are ‘distribution-free,’ robust to heteroscedasticity, and have overidentification goodness-of-fit J-tests with asymptotic chi-square distributions. In addition, MIIV-GMM estimators are ‘scalable’ in that they can estimate and test the full model or any subset of equations, and hence allow better pinpointing of those parts of the model that fit and do not fit the data. An empirical example illustrates MIIV-GMM estimators. Two simulation studies explore their finite sample properties and find that they perform well across a range of sample sizes. Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging(MIIV-2SBMA) Model-Implied Instrumental Variable Two-Stage Least Squares (MIIV-2SLS) is a limited information, equation-by-equation, non-iterative estimator for latent variable models. Associated with this estimator are equation specific tests of model misspecification. We propose an extension to the existing MIIV-2SLS estimator that utilizes Bayesian model averaging which we term Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging (MIIV-2SBMA). MIIV-2SBMA accounts for uncertainty in optimal instrument set selection, and provides powerful instrument specific tests of model misspecification and instrument strength. We evaluate the performance of MIIV-2SBMA against MIIV-2SLS in a simulation study and show that it has comparable performance in terms of parameter estimation. Additionally, our instrument specific overidentification tests developed within the MIIV-2SBMA framework show increased power to detect model misspecification over the traditional equation level tests of model misspecification. Finally, we demonstrate the use of MIIV-2SBMA using an empirical example. Model-Implied Instrumental Variable Two-Stage Least Squares(MIIV-2SLS) Modeling Higher Order Network Effects(MOHONE) Many knowledge graph embedding methods operate on triples and are therefore implicitly limited by a very local view of the entire knowledge graph. We present a new framework MOHONE to effectively model higher order network effects in knowledge-graphs, thus enabling one to capture varying degrees of network connectivity (from the local to the global). Our framework is generic, explicitly models the network scale, and captures two different aspects of similarity in networks: (a) shared local neighborhood and (b) structural role-based similarity. First, we introduce methods that learn network representations of entities in the knowledge graph capturing these varied aspects of similarity. We then propose a fast, efficient method to incorporate the information captured by these network representations into existing knowledge graph embeddings. We show that our method consistently and significantly improves the performance on link prediction of several different knowledge-graph embedding methods including TRANSE, TRANSD, DISTMULT, and COMPLEX(by at least 4 points or 17% in some cases). ModelOps ModelOps is a DevOps variation that is purpose-fit for analytics. Forsgren and Humble, in their excellent DevOps book, ACCELERATE, demonstrate that faster software deployment and shorter delays between code commit and promotion correlate with higher software quality, superior employee satisfaction, and superior results. Based on extensive study, they list what’s required for continuous delivery of software. While data science and software development methods differ in important respects, Forsgren and Humble’s list is nonetheless foundational: version control, deployment automation, continuous integration, test automation, test data management, shift left on security, loosely coupled architecture, empowered teams (choosing their own tools), monitoring, and proactive notification. Production IT is a complex, fast-moving, high volume, governed and locked down universe where data scientists’ models have to be embedded into work streams that connect to data pipelines and endpoint applications and packaged with right-sized compute resources. The ModelOps team is the nexus of communication between data scientists, data engineers, application owners, and infrastructure owners. It coordinates proper hand-offs and execution of go-live protocol. ModelOps’ responsibilities include workflow automation, version management, promotions, compute resource management, monitoring, and scaling and tuning. For organizations that are serious about machine learning and analytics at scale, ModelOps is a must-have. Model-Protected Multi-Task Learning Multi-task learning (MTL) refers to the paradigm of learning multiple related tasks together. By contrast, single-task learning (STL) learns each individual task independently. MTL often leads to better trained models because they can leverage the commonalities among related tasks. However, because MTL algorithms will ‘transmit’ information on different models across different tasks, MTL poses a potential security risk. Specifically, an adversary may participate in the MTL process through a participating task, thereby acquiring the model information for another task. Previously proposed privacy-preserving MTL methods protect data instances rather than models, and some of them may underperform in comparison with STL methods. In this paper, we propose a privacy-preserving MTL framework to prevent the information on each model from leaking to other models based on a perturbation of the covariance matrix of the model matrix, and we study two popular MTL approaches for instantiation, namely, MTL approaches for learning the low-rank and group-sparse patterns of the model matrix. Our methods are built upon tools for differential privacy. Privacy guarantees and utility bounds are provided. Heterogeneous privacy budgets are considered. Our algorithms can be guaranteed not to underperform comparing with STL methods. Experiments demonstrate that our algorithms outperform existing privacy-preserving MTL methods on the proposed model-protection problem. Moderated Network Model(MNW) Pairwise network models such as the Gaussian Graphical Model (GGM) are a powerful and intuitive way to analyze dependencies in multivariate data. A key assumption of the GGM is that each pairwise interaction is independent of the values of all other variables. However, in psychological research this is often implausible. In this paper, we extend the GGM by allowing each pairwise interaction between two variables to be moderated by (a subset of) all other variables in the model, and thereby introduce a Moderated Network Model (MNM). We show how to construct the MNW and propose an L1-regularized nodewise regression approach to estimate it. We provide performance results in a simulation study and show that MNMs outperform the split-sample based methods Network Comparison Test (NCT) and Fused Graphical Lasso (FGL) in detecting moderation effects. Finally, we provide a fully reproducible tutorial on how to estimate MNMs with the R-package mgm and discuss possible issues with model misspecification. Moderated Regression ➘ “Moderation” pequod Moderation In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zero-order correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation. pequod Modha-Spangler Clustering Modha-Spangler clustering, which uses a brute-force strategy to maximize the cluster separation simultaneously in the continuous and categorical variables. kamila Modified Generative Adversarial Network(MSGAN) Correcting measured detector-level distributions to particle-level is essential to make data usable outside the experimental collaborations. The term unfolding is used to describe this procedure. A new method of unfolding the data using a modified Generative Adversarial Network (MSGAN) is presented here. Applied to various distributions, it is demonstrated to perform at par with, or better than, currently used methods. Modified Multidimensional Scaling Multidimensional scaling is an important dimension reduction tool in statistics and machine learning. Yet few theoretical results characterizing its statistical performance exist, not to mention any in high dimensions. By considering a unified framework that includes low, moderate and high dimensions, we study multidimensional scaling in the setting of clustering noisy data. Our results suggest that, in order to achieve consistent estimation of the embedding scheme, the classical multidimensional scaling needs to be modified, especially when the noise level increases. To this end, we propose {\it modified multidimensional scaling} which applies a nonlinear transformation to the sample eigenvalues. The nonlinear transformation depends on the dimensionality, sample size and unknown moment. We show that modified multidimensional scaling followed by various clustering algorithms can achieve exact recovery, i.e., all the cluster labels can be recovered correctly with probability tending to one. Numerical simulations and two real data applications lend strong support to our proposed methodology. As a byproduct, we unify and improve existing results on the $\ell_{\infty}$ bound for eigenvectors under only low bounded moment conditions. This can be of independent interest. Modified Sequential Probability Ratio Test(MSPRT) In a MSPRT design, the maximum sample size of an experiment is fixed prior to the start of an experiment, the alternative hypothesis used to define the rejection region of the test is derived from the size of the test (Type I error), the maximum available sample size (N), and the targeted Type 2 error (equal to 1 minus the power) is also prespecified. Given these values, the MSPRT is defined in a manner very similar to Wald’s initial proposal. This test can reduce the average sample size required to perform statistical hypothesis tests at the specified levels of significance and power. MSPRT ModSpace Mango Solutions have developed a configurable software application to allow statisticians, programmers and analysts to centralise and manage the often-complex statistical knowledge (held in SAS, R, Matlab and other languages, documents, data, images etc). The application was designed to provide a centralised platform for analysts to store, share and reuse complex analytical IP in an approach which helps enforce business and coding standards and promote collaboration and continual improvement within teams. ModSpace has proved especially valuable for teams working in diverse geographic locations as it promotes increased interaction between sites and individuals. The easy to use tool contains intuitive searching capabilities, enabling analysts to re-use their code and reduce the duplication of effort. The system also supports quality assurance with the use of audit trails, version control and an archiving functionality, which allows valuable historic information to be accessed without interfering with day to day activities. The system can be configured for different coding style templates which promote standards and can identify current/legacy and customer specific standards. Managers are also able to take advantage of the powerful reporting environment which allows them to track usage within their teams, spot trends and identify areas of process improvement. http://…/#sthash.ZGls4IJx.dpuf Modular Attention Network(MAttNet) In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: language-based attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Modular Centrality Identifying influential nodes in a network is a fundamental issue due to its wide applications, such as accelerating information diffusion or halting virus spreading. Many measures based on the network topology have emerged over the years to identify influential nodes such as Betweenness, Closeness, and Eigenvalue centrality. However, although most real-world networks are modular, few measures exploit this property. Recent works have shown that it has a significant effect on the dynamics on networks. In a modular network, a node has two types of influence: a local influence (on the nodes of its community) through its intra-community links and a global influence (on the nodes in other communities) through its inter-community links. Depending of the strength of the community structure, these two components are more or less influential. Based on this idea, we propose to extend all the standard centrality measures defined for networks with no community structure to modular networks. The so-called ‘Modular centrality’ is a two dimensional vector. Its first component quantifies the local influence of a node in its community while the second component quantifies its global influence on the other communities of the network. In order to illustrate the effectiveness of the Modular centrality extensions, comparison with their scalar counterpart are performed in an epidemic process setting. Simulation results using the Susceptible-Infected-Recovered (SIR) model on synthetic networks with controlled community structure allows getting a clear idea about the relation between the strength of the community structure and the major type of influence (global/local). Furthermore, experiments on real-world networks demonstrate the merit of this approach. Modular Generative Adversarial Network(ModularGAN) Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains. However, most existing methods have limited scalability and robustness, since they require building independent models for each pair of domains in question. This leads to two significant shortcomings: (1) the need to train exponential number of pairwise models, and (2) the inability to leverage data from other domains when training a particular pairwise mapping. Inspired by recent work on module networks, this paper proposes ModularGAN for multi-domain image generation and image-to-image translation. ModularGAN consists of several reusable and composable modules that carry on different functions (e.g., encoding, decoding, transformations). These modules can be trained simultaneously, leveraging data from all domains, and then combined to construct specific GAN networks at test time, according to the specific image translation task. This leads to ModularGAN’s superior flexibility of generating (or translating to) an image in any desired domain. Experimental results demonstrate that our model not only presents compelling perceptual results but also outperforms state-of-the-art methods on multi-domain facial attribute transfer. Modular Meta-Learning Many prediction problems, such as those that arise in the context of robotics, have a simplifying underlying structure that could accelerate learning. In this paper, we present a strategy for learning a set of neural network modules that can be combined in different ways. We train different modular structures on a set of related tasks and generalize to new tasks by composing the learned modules in new ways. We show this improves performance in two robotics-related problems. Modular Ontology Design Library(MODL) Pattern-based, modular ontologies have several beneficial properties that lend themselves to FAIR data practices, especially as it pertains to Interoperability and Reusability. However, developing such ontologies has a high upfront cost, e.g. reusing a pattern is predicated upon being aware of its existence in the first place. Thus, to help overcome these barriers, we have developed MODL: a modular ontology design library. MODL is a curated collection of well-documented ontology design patterns, drawn from a wide variety of interdisciplinary use-cases. In this paper we present MODL as a resource, discuss its use, and provide some examples of its contents. Modular, Optimal Learning Testing Environment(MOLTE) We address the relative paucity of empirical testing of learning algorithms (of any type) by introducing a new public-domain, Modular, Optimal Learning Testing Environment (MOLTE) for Bayesian ranking and selection problem, stochastic bandits or sequential experimental design problems. The Matlab-based simulator allows the comparison of a number of learning policies (represented as a series of .m modules) in the context of a wide range of problems (each represented in its own .m module) which makes it easy to add new algorithms and new test problems. State-of-the-art policies and various problem classes are provided in the package. The choice of problems and policies is guided through a spreadsheet-based interface. Different graphical metrics are included. MOLTE is designed to be compatible with parallel computing to scale up from local desktop to clusters and clouds. We offer MOLTE as an easy-to-use tool for the research community that will make it possible to perform much more comprehensive testing, spanning a broader selection of algorithms and test problems. We demonstrate the capabilities of MOLTE through a series of comparisons of policies on a starter library of test problems. We also address the problem of tuning and constructing priors that have been largely overlooked in optimal learning literature. We envision MOLTE as a modest spur to provide researchers an easy environment to study interesting questions involved in optimal learning. Modularity Modularity is one measure of the structure of networks or graphs. It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. However, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities. Biological networks, including animal brains, exhibit a high degree of modularity. Modulated Policy Hierarchies(MPH) Solving tasks with sparse rewards is a main challenge in reinforcement learning. While hierarchical controllers are an intuitive approach to this problem, current methods often require manual reward shaping, alternating training phases, or manually defined sub tasks. We introduce modulated policy hierarchies (MPH), that can learn end-to-end to solve tasks from sparse rewards. To achieve this, we study different modulation signals and exploration for hierarchical controllers. Specifically, we find that communicating via bit-vectors is more efficient than selecting one out of multiple skills, as it enables mixing between them. To facilitate exploration, MPH uses its different time scales for temporally extended intrinsic motivation at each level of the hierarchy. We evaluate MPH on the robotics tasks of pushing and sparse block stacking, where it outperforms recent baselines. Module Graphical Lasso(MGL) We propose module graphical lasso (MGL), an aggressive dimensionality reduction and network estimation technique for a highdimensional Gaussian graphical model (GGM). MGL achieves scalability, interpretability and robustness by exploiting the modularity property of many real-world networks. Variables are organized into tightly coupled modules and a graph structure is estimated to determine the conditional independencies among modules. MGL iteratively learns the module assignment of variables, the latent variables, each corresponding to a module, and the parameters of the GGM of the latent variables. In synthetic data experiments, MGL outperforms the standard graphical lasso and three other methods that incorporate latent variables into GGMs. Moment Alignment Network(MAN) This research strives for natural language moment retrieval in long, untrimmed video streams. The problem nevertheless is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed approach on two challenging public benchmarks Charades-STA and DiDeMo, where our MAN significantly outperforms the state-of-the-art by a large margin. Moment Matching Method The moment-matching methods are also called the Krylov subspace methods, as well as Padé approximation methods. They belong to the Projection based MOR methods. These methods are applicable to non-parametric linear time invariant systems, often descriptor systems … momentchi2 Momentum-added Stochastic Solver(MaSS) In this paper we introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parameterized networks. Our method is simple and efficient to implement and does not require changing parameters or computing full gradients in the course of optimization. We provide a detailed theoretical analysis for convergence and parameter selection including their dependence on the mini-batch size in the quadratic case. We also provide theoretical convergence results for a more general convex setting. We provide an experimental evaluation showing strong performance of our method in comparison to Adam and SGD for several standard architectures of deep networks including ResNet, convolutional and fully connected networks. We also show its performance for convex kernel machines. Monalytics To effectively manage large-scale data centers and utility clouds, operators must understand current system and application behaviors. This requires continuous monitoring along with online analysis of the data captured by the monitoring system. As a result, there is a need to move to systems in which both tasks can be performed in an integrated fashion, thereby better able to drive online system management. Coining the term ‘monalytics’ to refer to the combined monitoring and analysis systems used for managing large-scale data center systems, this paper articulates principles for monalytics systems, describes software approaches for implementing them, and provides experimental evaluations justifying principles and implementation approach. Specific technical contributions include consideration of scalability across both ‘space’ and ‘time’, the ability to dynamically deploy and adjust monalytics functionality at multiple levels of abstraction in target systems, and the capability to operate across the range of application to hypervisor layers present in large-scale data center or cloud computing systems. Our monalytics implementation targets virtualized systems and cloud infrastructures, via the integration of its functionality into the Xen hypervisor. MongoDB MongoDB (from humongous) is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and open-source software. First developed by the software company 10gen (now MongoDB Inc.) in October 2007 as a component of a planned platform as a service product, the company shifted to an open source development model in 2009, with 10gen offering commercial support and other services. Since then, MongoDB has been adopted as backend software by a number of major websites and services, including Craigslist, eBay, Foursquare, SourceForge, Viacom, and The New York Times among others. As of 2014, MongoDB was the most popular NoSQL database system. Monica Can you remember the names of the children of all your friends? Can you remember the wedding anniversary of your brother? Can you tell the last time you called your grand mother and what you talked about? Monica lets you quickly and easily log all those information so you can be a better friend, family member or spouse. MonoCorpus MonoCorpus is a note taking app for software and machine learning engineers meant to encourage learning, sharing, and easier development. Increase documentation for yourself and your team without slowing your velocity. Take notes as part of your process instead of dedicating time to writing them. Monoidal Network In this paper we define and study the notion of a monoidal network, which consists of a commutative ring $R$ and a collection of groups $\Gamma_I$, indexed by the ideals of $R$, with $\Gamma_I$ acting on the quotient $R/I$ and satisfying a certain lifting condition. The examination of these objects is largely motivated by, and initially arose from, the study of the union-closed sets conjecture. This connection is made precise and other aspects of these structures are investigated. Monotone Data Augmentation(MDA) An efficient monotone data augmentation (MDA) algorithm is proposed for missing data imputation for incomplete multivariate nonnormal data that may contain variables of different types, and are modeled by a sequence of regression models including the linear, binary logistic, multinomial logistic, proportional odds, Poisson, negative binomial, skew-normal, skew-t regressions or a mixture of these models. The MDA algorithm is applied to the sensitivity analyses of longitudinal trials with nonignorable dropout using the controlled pattern imputations that assume the treatment effect reduces or disappears after subjects in the experimental arm discontinue the treatment. We also describe a heuristic approach to implement the controlled imputation, in which the fully conditional specification method is used to impute the intermediate missing data to create a monotone missing pattern, and the missing data after dropout are then imputed according to the assumed nonignorable mechanisms. The proposed methods are illustrated by simulation and real data analyses. Monotonic Classification Monotonic classification problems mean that both feature values and class labels are ordered and monotonicity relationships exist between some features and the decision label. Monotonic classification: an overview on algorithms, performance measures and data sets Monotonic Optimal Binning(MOB) Monotonic Optimal Binning (MOB) for Consumer Credit Risk Scorecard Development Monte Carlo Dependency Estimation(MCDE) Estimating the dependency of variables is a fundamental task in data analysis. Identifying the relevant attributes in databases leads to better data understanding and also improves the performance of learning algorithms, both in terms of runtime and quality. In data streams, dependency monitoring provides key insights into the underlying process, but is challenging. In this paper, we propose Monte Carlo Dependency Estimation (MCDE), a theoretical framework to estimate multivariate dependency in static and dynamic data. MCDE quantifies dependency as the average discrepancy between marginal and conditional distributions via Monte Carlo simulations. Based on this framework, we present Mann-Whitney P (MWP), a novel dependency estimator. We show that MWP satisfies a number of desirable properties and can accommodate any kind of numerical data. We demonstrate the superiority of our estimator by comparing it to the state-of-the-art multivariate dependency measures. Monte Carlo Fusion This paper proposes a new theory and methodology to tackle the problem of unifying distributed analyses and inferences on shared parameters from multiple sources, into a single coherent inference. This surprisingly challenging problem arises in many settings (for instance, expert elicitation, multi-view learning, distributed ‘big data’ problems etc.), but to-date the framework and methodology proposed in this paper (Monte Carlo Fusion) is the first general approach which avoids any form of approximation error in obtaining the unified inference. In this paper we focus on the key theoretical underpinnings of this new methodology, and simple (direct) Monte Carlo interpretations of the theory. There is considerable scope to tailor the theory introduced in this paper to particular application settings (such as the big data setting), construct efficient parallelised schemes, understand the approximation and computational efficiencies of other such unification paradigms, and explore new theoretical and methodological directions. Monte Carlo Graph Search(MCGS) Recently, there have been great interests in Monte Carlo Tree Search (MCTS) in AI research. Although the sequential version of MCTS has been studied widely, its parallel counterpart still lacks systematic study. This leads us to the following questions: \emph{how to design efficient parallel MCTS (or more general cases) algorithms with rigorous theoretical guarantee? Is it possible to achieve linear speedup?} In this paper, we consider the search problem on a more general acyclic one-root graph (namely, Monte Carlo Graph Search (MCGS)), which generalizes MCTS. We develop a parallel algorithm (P-MCGS) to assign multiple workers to investigate appropriate leaf nodes simultaneously. Our analysis shows that P-MCGS algorithm achieves linear speedup and that the sample complexity is comparable to its sequential counterpart. Monte Carlo Tree Search(MCTS) In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm of making decisions in some decision processes, most notably employed in game playing. The leading example of its use is in contemporary computer Go programs, but it is also used in other board games, as well as real-time video games and non-deterministic games such as poker. A Survey of Monte Carlo Tree Search Methods ➘ “Monte Carlo Graph Search” Montreal Data License(MDL) This paper provides a taxonomy for the licensing of data in the fields of artificial intelligence and machine learning. The paper’s goal is to build towards a common framework for data licensing akin to the licensing of open source software. Increased transparency and resolving conceptual ambiguities in existing licensing language are two noted benefits of the approach proposed in the paper. In parallel, such benefits may help foster fairer and more efficient markets for data through bringing about clearer tools and concepts that better define how data can be used in the fields of AI and ML. The paper’s approach is summarized in a new family of data license language – \textit{the Montreal Data License (MDL)}. Alongside this new license, the authors and their collaborators have developed a web-based tool to generate license language espousing the taxonomies articulated in this paper. MOOC Replication Framework(MORF) The MOOC Replication Framework (MORF) is a novel software system for feature extraction, model training/testing, and evaluation of predictive dropout models in Massive Open Online Courses (MOOCs). MORF makes large-scale replication of complex machine-learned models tractable and accessible for researchers, and enables public research on privacy-protected data. It does so by focusing on the high-level operations of an \emph{extract-train-test-evaluate} workflow, and enables researchers to encapsulate their implementations in portable, fully reproducible software containers which are executed on data with a known schema. MORF’s workflow allows researchers to use data in analysis without providing them access to the underlying data directly, preserving privacy and data security. During execution, containers are sandboxed for security and data leakage and parallelized for efficiency, allowing researchers to create and test new models rapidly, on large-scale multi-institutional datasets that were previously inaccessible to most researchers. MORF is provided both as a Python API (the MORF Software), for institutions to use on their own MOOC data) or in a platform-as-a-service (PaaS) model with a web API and a high-performance computing environment (the MORF Platform). MOPLS-N Multi-Objective Optimization (MOO) is very difficult for expensive functions because most current MOO methods rely on a large number of function evaluations to get an accurate solution. We address this problem with surrogate approximation and parallel computation. We develop an MOO algorithm MOPLS-N for expensive functions that combines iteratively updated surrogate approximations of the objective functions with a structure for efficiently selecting a population of $N$ points so that the expensive objectives for all points are simultaneously evaluated on $N$ processors in each iteration. MOPLS incorporates Radial Basis Function (RBF) approximation, Tabu Search and local candidate search around multiple points to strike a balance between exploration, exploitation and diversification during each algorithm iteration. Eleven test problems (with 8 to 24 decision variables and two real-world watershed problems are used to compare performance of MOPLS to ParEGO, GOMORS, Borg, MOEA/D, and NSGA-III on a limited budget of evaluations with between 1 (serial) and 64 processors. MOPLS in serial is better than all non-RBF serial methods tested. Parallel speedup of MOPLS is higher than all other parallel algorithms with 16 and 64 processors. With both algorithms on 64 processors MOPLS is at least 2 times faster than NSGA-III on the watershed problems. Moran’s I In statistics, Moran’s I is a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. Spatial autocorrelation is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one-dimensional autocorrelation because spatial correlation is multi-dimensional (i.e. 2 or 3 dimensions of space) and multi-directional. Irescale Morphed Learning The concern of potential privacy violation has prevented efficient use of big data for improving deep learning based applications. In this paper, we propose Morphed Learning, a privacy-preserving technique for deep learning based on data morphing that, allows data owners to share their data without leaking sensitive privacy information. Morphed Learning allows the data owners to send securely morphed data and provides the server with an Augmented Convolutional layer to train the network on morphed data without performance loss. Morphed Learning has these three features: (1) Strong protection against reverse-engineering on the morphed data; (2) Acceptable computational and data transmission overhead with no correlation to the depth of the neural network; (3) No degradation of the neural network performance. Theoretical analyses on CIFAR-10 dataset and VGG-16 network show that our method is capable of providing 10^89 morphing possibilities with only 5% computational overhead and 10% transmission overhead under limited knowledge attack scenario. Further analyses also proved that our method can offer same resilience against full knowledge attack if more resources are provided. Morpheo Morpheo is a transparent and secure machine learning platform collecting and analysing large datasets. It aims at building state-of-the art prediction models in various fields where data are sensitive. Indeed, it offers strong privacy of data and algorithm, by preventing anyone to read the data, apart from the owner and the chosen algorithms. Computations in Morpheo are orchestrated by a blockchain infrastructure, thus offering total traceability of operations. Morpheo aims at building an attractive economic ecosystem around data prediction by channelling crypto-money from prediction requests to useful data and algorithms providers. Morpheo is designed to handle multiple data sources in a transfer learning approach in order to mutualize knowledge acquired from large datasets for applications with smaller but similar datasets. Morphing Network ➘ “Text Morphing” MorphNet We introduce MorphNet, a single model that combines morphological analysis and disambiguation. Traditionally, analysis of morphologically complex languages has been performed in two stages: (i) A morphological analyzer based on finite-state transducers produces all possible morphological analyses of a word, (ii) A statistical disambiguation model picks the correct analysis based on the context for each word. MorphNet uses a sequence-to-sequence recurrent neural network to combine analysis and disambiguation. We show that when trained with text labeled with correct morphological analyses, MorphNet obtains state-of-the art or comparable results for nine different datasets in seven different languages. MortonNet We present a self-supervised task on point clouds, in order to learn meaningful point-wise features that encode local structure around each point. Our self-supervised network, named MortonNet, operates directly on unstructured/unordered point clouds. Using a multi-layer RNN, MortonNet predicts the next point in a point sequence created by a popular and fast Space Filling Curve, the Morton-order curve. The final RNN state (coined Morton feature) is versatile and can be used in generic 3D tasks on point clouds. In fact, we show how Morton features can be used to significantly improve performance (+3% for 2 popular semantic segmentation algorithms) in the task of semantic segmentation of point clouds on the challenging and large-scale S3DIS dataset. We also show how MortonNet trained on S3DIS transfers well to another large-scale dataset, vKITTI, leading to an improvement over state-of-the-art of 3.8%. Finally, we use Morton features to train a much simpler and more stable model for part segmentation in ShapeNet. Our results show how our self-supervised task results in features that are useful for 3D segmentation tasks, and generalize well to other datasets. Mother Compact Recurrent Memory(MCRM) LSTMs and GRUs are the most common recurrent neural network architectures used to solve temporal sequence problems. The two architectures have differing data flows dealing with a common component called the cell state also referred to as the memory. We attempt to enhance the memory by presenting a biologically inspired modification that we call the Mother Compact Recurrent Memory MCRM. MCRMs are a type of a nested LSTM-GRU architecture where the cell state is the GRU’s hidden state. The relationship between the womb and the fetus is analogous to the relationship between the LSTM and GRU inside MCRM in that the fetus is connected to its womb through the umbilical cord. The umbilical cord consists of two arteries and one vein. The two arteries are considered as an input to the fetus which is analogous to the concatenation of the forget gate and input gate from the LSTM. The vein is the output from the fetus which plays the role of the hidden state of the GRU. Because MCRMs has this type of nesting, MCRMs have a compact memory pattern consisting of neurons that act explicitly in both long-term and short-term fashions. For some specific tasks, empirical results show that MCRMs outperform previously used architectures. Motif Convolutional Network(MCN) Following the success of deep convolutional networks in various vision and speech related tasks, researchers have started investigating generalizations of the well-known technique for graph-structured data. A recently-proposed method called Graph Convolutional Networks has been able to achieve state-of-the-art results in the task of node classification. However, since the proposed method relies on localized first-order approximations of spectral graph convolutions, it is unable to capture higher-order interactions between nodes in the graph. In this work, we propose a motif-based graph attention model, called Motif Convolutional Networks (MCNs), which generalizes past approaches by using weighted multi-hop motif adjacency matrices to capture higher-order neighborhoods. A novel attention mechanism is used to allow each individual node to select the most relevant neighborhood to apply its filter. Experiments show that our proposed method is able to achieve state-of-the-art results on the semi-supervised node classification task. Motif Correlation Clustering Motivated by applications in social and biological network analysis, we introduce a new form of agnostic clustering termed~\emph{motif correlation clustering}, which aims to minimize the cost of clustering errors associated with both edges and higher-order network structures. The problem may be succinctly described as follows: Given a complete graph $G$, partition the vertices of the graph so that certain predetermined important’ subgraphs mostly lie within the same cluster, while less relevant’ subgraphs are allowed to lie across clusters. Our contributions are as follows: We first introduce several variants of motif correlation clustering and then show that these clustering problems are NP-hard. We then proceed to describe polynomial-time clustering algorithms that provide constant approximation guarantees for the problems at hand. Despite following the frequently used LP relaxation and rounding procedure, the algorithms involve a sophisticated and carefully designed neighborhood growing step that combines information about both edge and motif structures. We conclude with several examples illustrating the performance of the developed algorithms on synthetic and real networks. Motion Planning Network(MPNet) Fast and efficient motion planning algorithms are crucial for many state-of-the-art robotics applications such as self-driving cars. Existing motion planning methods such as RRT*, A*, and D*, become ineffective as their computational complexity increases exponentially with the dimensionality of the motion planning problem. To address this issue, we present a neural network-based novel planning algorithm which generates end-to-end collision-free paths irrespective of the obstacles’ geometry. The proposed method, called MPNet (Motion Planning Network), comprises of a Contractive Autoencoder which encodes the given workspaces directly from a point cloud measurement, and a deep feedforward neural network which takes the workspace encoding, start and goal configuration, and generates end-to-end feasible motion trajectories for the robot to follow. We evaluate MPNet on multiple planning problems such as planning of a point-mass robot, rigid-body, and 7 DOF Baxter robot manipulators in various 2D and 3D environments. The results show that MPNet is not only consistently computationally efficient in all 2D and 3D environments but also show remarkable generalization to completely unseen environments. The results also show that computation time of MPNet consistently remains less than 1 second which is significantly lower than existing state-of-the-art motion planning algorithms. Furthermore, through transfer learning, the MPNet trained in one scenario (e.g., indoor living places) can also quickly adapt to new scenarios (e.g., factory floors) with a little amount of data. Motion Transformation Variational Auto-Encoder(MT-VAE) Long-term human motion can be represented as a series of motion modes—motion sequences that capture short-term temporal dynamics—with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis. Mountain Plot A mountain plot (or “folded empirical cumulative distribution plot”) is created by computing a percentile for each ranked difference between a new method and a reference method. To get a folded plot, the following transformation is performed for all percentiles above 50: percentile = 100 – percentile. These percentiles are then plotted against the differences between the two methods (Krouwer & Monti, 1995). The mountain plot is a useful complementary plot to the Bland & Altman plot. In particular, the mountain plot offers the following advantages: · It is easier to find the central 95% of the data, even when the data are not Normally distributed. · Different distributions can be compared more easily. mountainplot Movie Intelligent Recommender Agent(MIRA) The human mind is still an unknown process of neuroscience in many aspects. Nevertheless, for decades the scientific community has proposed computational models that try to simulate their parts, specific applications, or their behavior in different situations. The most complete model in this line is undoubtedly the LIDA model, proposed by Stan Franklin with the aim of serving as a generic computational architecture for several applications. The present project is inspired by the LIDA model to apply it to the process of movie recommendation, the model called MIRA (Movie Intelligent Recommender Agent) presented percentages of precision similar to a traditional model when submitted to the same assay conditions. Moreover, the proposed model reinforced the precision indexes when submitted to tests with volunteers, proving once again its performance as a cognitive model, when executed with small data volumes. Considering that the proposed model achieved a similar behavior to the traditional models under conditions expected to be similar for natural systems, it can be said that MIRA reinforces the applicability of LIDA as a path to be followed for the study and generation of computational agents inspired by neural behaviors. Moving Average In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is a type of finite impulse response filter. Variations include: simple, and cumulative, or weighted forms (described below). seismicRoll Moving Average Convergence Divergence(MACD) MACD, short for moving average convergence/divergence, is a trading indicator used in technical analysis of stock prices, created by Gerald Appel in the late 1970s. It is supposed to reveal changes in the strength, direction, momentum, and duration of a trend in a stock’s price. The MACD indicator (or ‘oscillator’) is a collection of three time series calculated from historical price data, most often the closing price. These three series are: the MACD series proper, the ‘signal’ or ‘average’ series, and the ‘divergence’ series which is the difference between the two. The MACD series is the difference between a ‘fast’ (short period) exponential moving average (EMA), and a ‘slow’ (longer period) EMA of the price series. The average series is an EMA of the MACD series itself. Moving Average Convergence Divergence (MACD) Moving Horizon Estimation Moving horizon estimation (MHE) is an optimization approach that uses a series of measurements observed over time, containing noise (random variations) and other inaccuracies, and produces estimates of unknown variables or parameters. Unlike deterministic approaches like the Kalman filter, MHE requires an iterative approach that relies on linear programming or nonlinear programming solvers to find a solution. MHE reduces to the Kalman filter under certain simplifying conditions. A critical evaluation of the extended Kalman filter and MHE found improved performance of MHE with the only cost of improvement being the increased computational expense. Because of the computational expense, MHE has generally been applied to systems where there are greater computational resources and moderate to slow system dynamics. However, in the literature there are some methods to accelerate this method. M-PACT Action classification is a widely known and popular task that offers an approach towards video understanding. The absence of an easy-to-use platform containing state-of-the-art (SOTA) models presents an issue for the community. Given that individual research code is not written with an end user in mind and in certain cases code is not released, even for published articles, the importance of a common unified platform capable of delivering results while removing the burden of developing an entire system cannot be overstated. To try and overcome these issues, we develop a tensorflow-based unified platform to abstract away unnecessary overheads in terms of an end-to-end pipeline setup in order to allow the user to quickly and easily prototype action classification models. With the use of a consistent coding style across different models and seamless data flow between various submodules, the platform lends itself to the quick generation of results on a wide range of SOTA methods across a variety of datasets. All of these features are made possible through the use of fully pre-defined training and testing blocks built on top of a small but powerful set of modular functions that handle asynchronous data loading, model initializations, metric calculations, saving and loading of checkpoints, and logging of results. The platform is geared towards easily creating models, with the minimum requirement being the definition of a network architecture and preprocessing steps from a large custom selection of layers and preprocessing functions. M-PACT currently houses four SOTA activity classification models which include, I3D, C3D, ResNet50+LSTM and TSN. The classification performance achieved by these models are, 43.86% for ResNet50+LSTM on HMDB51 while C3D and TSN achieve 93.66% and 85.25% on UCF101 respectively. MPDCompress Deep neural networks (DNNs) have become the state-of-the-art technique for machine learning tasks in various applications. However, due to their size and the computational complexity, large DNNs are not readily deployable on edge devices in real-time. To manage complexity and accelerate computation, network compression techniques based on pruning and quantization have been proposed and shown to be effective in reducing network size. However, such network compression can result in irregular matrix structures that are mismatched with modern hardware-accelerated platforms, such as graphics processing units (GPUs) designed to perform the DNN matrix multiplications in a structured (block-based) way. We propose MPDCompress, a DNN compression algorithm based on matrix permutation decomposition via random mask generation. In-training application of the masks molds the synaptic weight connection matrix to a sub-graph separation format. Aided by the random permutations, a hardware-desirable block matrix is generated, allowing for a more efficient implementation and compression of the network. To show versatility, we empirically verify MPDCompress on several network models, compression rates, and image datasets. On the LeNet 300-100 model (MNIST dataset), Deep MNIST, and CIFAR10, we achieve 10 X network compression with less than 1% accuracy loss compared to non-compressed accuracy performance. On AlexNet for the full ImageNet ILSVRC-2012 dataset, we achieve 8 X network compression with less than 1% accuracy loss, with top-5 and top-1 accuracies of 79.6% and 56.4%, respectively. Finally, we observe that the algorithm can offer inference speedups across various hardware platforms, with 4 X faster operation achieved on several mobile GPUs. mQAPViz Modern digital products and services are instrumental in understanding users activities and behaviors. In doing so, we have to extract relevant relationships and patterns from extensive data collections efficiently. Data visualization algorithms are essential tools in transforming data into narratives. Unfortunately, very few visualization algorithms can handle a significant amount of data. In this study, we address the visualization of large-scale datasets as a multi-objective optimization problem. We propose mQAPViz, a divide-and-conquer multi-objective optimization algorithm to compute large-scale data visualizations. Our method employs the Multi-Objective Quadratic Assignment Problem (mQAP) as the mathematical foundation to solve the visualization task at hand. The algorithm applies advanced machine learning sampling techniques and efficient data structures to scale to millions of data objects. The divide-and-conquer strategy can efficiently handle millions of objects which the algorithm allocates onto a layout that allows the visualization of a whole dataset. Experimental results on real-world and large datasets demonstrate that mQAPViz is a competitive alternative to compute large-scale visualizations that we can employ to inform the development and improvement of digital applications. MQCAL Multiple query criteria active learning (MQCAL) methods have a higher potential performance than conventional active learning methods in which only one criterion is deployed for sample selection. A central issue related to MQCAL methods concerns the development of an integration criteria strategy (ICS) that makes full use of all criteria. The conventional ICS adopted in relevant research all facilitate the desired effects, but several limitations still must be addressed. For instance, some of the strategies are not sufficiently scalable during the design process, and the number and type of criteria involved are dictated. Thus, it is challenging for the user to integrate other criteria into the original process unless modifications are made to the algorithm. Other strategies are too dependent on empirical parameters, which can only be acquired by experience or cross-validation and thus lack generality; additionally, these strategies are counter to the intention of active learning, as samples need to be labeled in the validation set before the active learning process can begin. To address these limitations, we propose a novel MQCAL method for classification tasks that employs a third strategy via weighted rank aggregation. The proposed method serves as a heuristic means to select high-value samples of high scalability and generality and is implemented through a three-step process: (1) the transformation of the sample selection to sample ranking and scoring, (2) the computation of the self-adaptive weights of each criterion, and (3) the weighted aggregation of each sample rank list. Ultimately, the sample at the top of the aggregated ranking list is the most comprehensively valuable and must be labeled. Several experiments generating 257 wins, 194 ties and 49 losses against other state-of-the-art MQCALs are conducted to verify that the proposed method can achieve superior results. MQGrad One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies. MR3 Recommender systems (RSs) provide an effective way of alleviating the information overload problem by selecting personalized items for different users. Latent factors based collaborative filtering (CF) has become the popular approaches for RSs due to its accuracy and scalability. Recently, online social networks and user-generated content provide diverse sources for recommendation beyond ratings. Although {\em social matrix factorization} (Social MF) and {\em topic matrix factorization} (Topic MF) successfully exploit social relations and item reviews, respectively, both of them ignore some useful information. In this paper, we investigate the effective data fusion by combining the aforementioned approaches. First, we propose a novel model {\em \mbox{MR3}} to jointly model three sources of information (i.e., ratings, item reviews, and social relations) effectively for rating prediction by aligning the latent factors and hidden topics. Second, we incorporate the implicit feedback from ratings into the proposed model to enhance its capability and to demonstrate its flexibility. We achieve more accurate rating prediction on real-life datasets over various state-of-the-art methods. Furthermore, we measure the contribution from each of the three data sources and the impact of implicit feedback from ratings, followed by the sensitivity analysis of hyperparameters. Empirical studies demonstrate the effectiveness and efficacy of our proposed model and its extension. MRAttractor Detecting groups of users, who have similar opinions, interests, or social behavior, has become an important task for many applications. A recent study showed that dynamic distance based Attractor, a community detection algorithm, outperformed other community detection algorithms such as Spectral clustering, Louvain and Infomap, achieving higher Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). However, Attractor often takes long time to detect communities, requiring many iterations. To overcome the drawback and handle large-scale graphs, in this paper we propose MRAttractor, an advanced version of Attractor to be runnable on a MapReduce framework. In particular, we (i) apply a sliding window technique to reduce the running time, keeping the same community detection quality; (ii) design and implement the Attractor algorithm for a MapReduce framework; and (iii) evaluate MRAttractor’s performance on synthetic and real-world datasets. Experimental results show that our algorithm significantly reduced running time and was able to handle large-scale graphs. MRNet-Product2Vec E-commerce websites such as Amazon, Alibaba, Flipkart, and Walmart sell billions of products. Machine learning (ML) algorithms involving products are often used to improve the customer experience and increase revenue, e.g., product similarity, recommendation, and price estimation. The products are required to be represented as features before training an ML algorithm. In this paper, we propose an approach called MRNet-Product2Vec for creating generic embeddings of products within an e-commerce ecosystem. We learn a dense and low-dimensional embedding where a diverse set of signals related to a product are explicitly injected into its representation. We train a Discriminative Multi-task Bidirectional Recurrent Neural Network (RNN), where the input is a product title fed through a Bidirectional RNN and at the output, product labels corresponding to fifteen different tasks are predicted. The task set includes several intrinsic characteristics about a product such as price, weight, size, color, popularity, and material. We evaluate the proposed embedding quantitatively and qualitatively. We demonstrate that they are almost as good as sparse and extremely high-dimensional TF-IDF representation in spite of having less than 3% of the TF-IDF dimension. We also use a multimodal autoencoder for comparing products from different language-regions and show preliminary yet promising qualitative results. MS MARCO Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer. MSplit LBI It is one typical and general topic of learning a good embedding model to efficiently learn the representation coefficients between two spaces/subspaces. To solve this task, $L_{1}$ regularization is widely used for the pursuit of feature selection and avoiding overfitting, and yet the sparse estimation of features in $L_{1}$ regularization may cause the underfitting of training data. $L_{2}$ regularization is also frequently used, but it is a biased estimator. In this paper, we propose the idea that the features consist of three orthogonal parts, \emph{namely} sparse strong signals, dense weak signals and random noise, in which both strong and weak signals contribute to the fitting of data. To facilitate such novel decomposition, \emph{MSplit} LBI is for the first time proposed to realize feature selection and dense estimation simultaneously. We provide theoretical and simulational verification that our method exceeds $L_{1}$ and $L_{2}$ regularization, and extensive experimental results show that our method achieves state-of-the-art performance in the few-shot and zero-shot learning. MT-Net ➘ “T-Net” m-TSNE Multivariate time series (MTS) have become increasingly common in healthcare domains where human vital signs and laboratory results are collected for predictive diagnosis. Recently, there have been increasing efforts to visualize healthcare MTS data based on star charts or parallel coordinates. However, such techniques might not be ideal for visualizing a large MTS dataset, since it is difficult to obtain insights or interpretations due to the inherent high dimensionality of MTS. In this paper, we propose ‘m-TSNE’: a simple and novel framework to visualize high-dimensional MTS data by projecting them into a low-dimensional (2-D or 3-D) space while capturing the underlying data properties. Our framework is easy to use and provides interpretable insights for healthcare professionals to understand MTS data. We evaluate our visualization framework on two real-world datasets and demonstrate that the results of our m-TSNE show patterns that are easy to understand while the other methods’ visualization may have limitations in interpretability. M-UCB Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. In this paper, we consider a scenario in which the arms’ reward distributions may change in a piecewise-stationary fashion at unknown time steps. By connecting change-detection techniques with classic UCB algorithms, we motivate and propose a learning algorithm called M-UCB, which can detect and adapt to changes, for the considered scenario. We also establish an $O(\sqrt{MKT\log T})$ regret bound for M-UCB, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. % and $\Delta$ is the gap between the expected rewards of the optimal and best suboptimal arms. Comparison with the best available lower bound shows that M-UCB is nearly optimal in $T$ up to a logarithmic factor. We also compare M-UCB with state-of-the-art algorithms in a numerical experiment based on a public Yahoo! dataset. In this experiment, M-UCB achieves about $50 \%$ regret reduction with respect to the best performing state-of-the-art algorithm. mu-Forcing It has been previously observed that training Variational Recurrent Autoencoders (VRAE) for text generation suffers from serious uninformative latent variables problem. The model would collapse into a plain language model that totally ignore the latent variables and can only generate repeating and dull samples. In this paper, we explore the reason behind this issue and propose an effective regularizer based approach to address it. The proposed method directly injects extra constraints on the posteriors of latent variables into the learning process of VRAE, which can flexibly and stably control the trade-off between the KL term and the reconstruction term, making the model learn dense and meaningful latent representations. The experimental results show that the proposed method outperforms several strong baselines and can make the model learn interpretable latent variables and generate diverse meaningful sentences. Furthermore, the proposed method can perform well without using other strategies, such as KL annealing. Muller Plot A Muller plot combines information about the succession of different OTUs (genotypes, phenotypes, species, …) and information about dynamics of their abundances (populations or frequencies) over time. Muller plots may be used to visualize evolutionary dynamics. They may be also employed in the study of diversity and its dynamics; that is, how diversity emerges and how changes over time. An example of a Muller plot (produced by the MullerPlot package in R) showing the evolutionary dynamics of an artificial community They are called Muller plots in honor of Hermann Joseph Muller, who used them to explain his idea of Muller’s ratchet. ggmuller Multi Agent System(MAS) A multi-agent system (M.A.S.) is a computerized system composed of multiple interacting intelligent agents within an environment. Multi-agent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Intelligence may include some methodic, functional, procedural approach, algorithmic search or reinforcement learning. Although there is considerable overlap, a multi-agent system is not always the same as an agent-based model (ABM). The goal of an ABM is to search for explanatory insight into the collective behavior of agents (which don’t necessarily need to be “intelligent”) obeying simple rules, typically in natural systems, rather than in solving specific practical or engineering problems. The terminology of ABM tends to be used more often in the sciences, and MAS in engineering and technology. Topics where multi-agent systems research may deliver an appropriate approach include online trading, disaster response, and modelling social structures. Multi Attribute Utility Theory(MAUT) mau Multi Expression Programming(MEP) In this paper a new evolutionary paradigm, called Multi-Expression Programming (MEP), intended for solving computationally difficult problems is proposed. A new encoding method is designed. MEP individuals are linear entities that encode complex computer programs. In this paper MEP is used for solving some computationally difficult problems like symbolic regression, game strategy discovering, and for generating heuristics. Other exciting applications of MEP are suggested. Some of them are currently under development. MEP is compared with Gene Expression Programming (GEP) by using a well-known test problem. For the considered problems MEP performs better than GEP. Evolving TSP heuristics using Multi Expression Programming Multi Preference Closure(MP-closure) The paper describes a preferential approach for dealing with exceptions in KLM preferential logics, based on the rational closure. It is well known that the rational closure does not allow an independent handling of the inheritance of different defeasible properties of concepts. Several solutions have been proposed to face this problem and the lexicographic closure is the most notable one. In this work, we consider an alternative closure construction, called the Multi Preference closure (MP-closure), that has been first considered for reasoning with exceptions in DLs. Here, we reconstruct the notion of MP-closure in the propositional case and we show that it is a natural variant of Lehmann’s lexicographic closure. Abandoning Maximal Entropy (an alternative route already considered but not explored by Lehmann) leads to a construction which exploits a different lexicographic ordering w.r.t. the lexicographic closure, and determines a preferential consequence relation rather than a rational consequence relation. We show that, building on the MP-closure semantics, rationality can be recovered, at least from the semantic point of view, resulting in a rational consequence relation which is stronger than the rational closure, but incomparable with the lexicographic closure. We also show that the MP-closure is stronger than the Relevant Closure. Multi Screen Penalty(MSP) We propose a multi-step method, called Multi Screen Penalty (MSP), to estimate high-dimensional sparse linear models. MSP uses a series of small and adaptive penalty to iteratively estimate the regression coefficients. This structure is shown to greatly improve the model selection and estimation accuracy, i.e., it precisely selects the true model when the irrepresentable condition fails; under mild regularity conditions, MSP estimator can achieve the rate $\sqrt{q \log n /n}$ for the upper bound of l_2-norm error. At each step, we restrict the selection of MSP only on the reduced parameter space obtained from the last step; this makes its computational complexity is roughly comparable to Lasso. This algorithm is found to be stable and reaches to high accuracy over a range of small tuning parameters, hence deletes the cross-validation segment. Numerical comparisons show that the method works effectively both in model selection and estimation and nearly uniformly outperform others. We apply MSP and other methods to financial data. MSP is successful in assets selection and produces more stable and lower rates of fitted/predicted errors. Multi-Advisor Reinforcement Learning This article deals with a novel branch of Separation of Concerns, called Multi-Advisor Reinforcement Learning (MAd-RL), where a single-agent RL problem is distributed to $n$ learners, called advisors. Each advisor tries to solve the problem with a different focus. Their advice is then communicated to an aggregator, which is in control of the system. For the local training, three off-policy bootstrapping methods are proposed and analysed: local-max bootstraps with the local greedy action, rand-policy bootstraps with respect to the random policy, and agg-policy bootstraps with respect to the aggregator’s greedy policy. MAd-RL is positioned as a generalisation of Reinforcement Learning with Ensemble methods. An experiment is held on a simplified version of the Ms. Pac-Man Atari game. The results confirm the theoretical relative strengths and weaknesses of each method. Multi-Agent Actor-Critic Algorithm Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments. We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents. Multi-Agent Common Knowledge Reinforcement Learning(MACKRL) In multi-agent reinforcement learning, centralised policies can only be executed if agents have access to either the global state or an instantaneous communication channel. An alternative approach that circumvents this limitation is to use centralised training of a set of decentralised policies. However, such policies severely limit the agents’ ability to coordinate. We propose multi-agent common knowledge reinforcement learning (MACKRL), which strikes a middle ground between these two extremes. Our approach is based on the insight that, even in partially observable settings, subsets of agents often have some common knowledge that they can exploit to coordinate their behaviour. Common knowledge can arise, e.g., if all agents can reliably observe things in their own field of view and know the field of view of other agents. Using this additional information, it is possible to find a centralised policy that conditions only on agents’ common knowledge and that can be executed in a decentralised fashion. A resulting challenge is then to determine at what level agents should coordinate. While the common knowledge shared among all agents may not contain much valuable information, there may be subgroups of agents that share common knowledge useful for coordination. MACKRL addresses this challenge using a hierarchical approach: at each level, a controller can either select a joint action for the agents in a given subgroup, or propose a partition of the agents into smaller subgroups whose actions are then selected by controllers at the next level. While action selection involves sampling hierarchically, learning updates are based on the probability of the joint action, calculated by marginalising across the possible decisions of the hierarchy. We show promising results on both a proof-of-concept matrix game and a multi-agent version of StarCraft II Micromanagement. Multi-Agent Inverse Reinforcement Learning(MIRL) Learning the reward function of an agent by observing its behavior is termed inverse reinforcement learning and has applications in learning from demonstration or apprenticeship learning. We introduce the problem of multiagent inverse reinforcement learning, where reward functions of multiple agents are learned by observing their uncoordinated behavior. A centralized controller then learns to coordinate their behavior by optimizing a weighted sum of reward functions of all the agents. We evaluate our approach on a traffic-routing domain, in which a controller coordinates actions of multiple traffic signals to regulate traffic density. We show that the learner is not only able to match but even significantly outperform the expert. Multi-agent Inverse Reinforcement Learning for General-sum Stochastic Games Multi-Agent Path Finding Explanation of the hot topic ‘multi-agent path finding’. Multi-Agent Recurrent Deterministic Policy Gradient(MA-RDPG) Ranking is a fundamental and widely studied problem in scenarios such as search, advertising, and recommendation. However, joint optimization for multi-scenario ranking, which aims to improve the overall performance of several ranking strategies in different scenarios, is rather untouched. Separately optimizing each individual strategy has two limitations. The first one is lack of collaboration between scenarios meaning that each strategy maximizes its own objective but ignores the goals of other strategies, leading to a sub-optimal overall performance. The second limitation is the inability of modeling the correlation between scenarios meaning that independent optimization in one scenario only uses its own user data but ignores the context in other scenarios. In this paper, we formulate multi-scenario ranking as a fully cooperative, partially observable, multi-agent sequential decision problem. We propose a novel model named Multi-Agent Recurrent Deterministic Policy Gradient (MA-RDPG) which has a communication component for passing messages, several private actors (agents) for making actions for ranking, and a centralized critic for evaluating the overall performance of the co-working actors. Each scenario is treated as an agent (actor). Agents collaborate with each other by sharing a global action-value function (the critic) and passing messages that encodes historical information across scenarios. The model is evaluated with online settings on a large E-commerce platform. Results show that the proposed model exhibits significant improvements against baselines in terms of the overall performance. Multiagent Reinforcement Learning(MARL) To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (non-stationary) environment. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents Multiagent Soft Q-learning Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Q-learning, which can be seen as the analogue of applying Q-learning to continuous controls. We compare our method to MADDPG, a state-of-the-art approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space. Multi-Agent Systems(MAS) A multi-agent system (M.A.S.) is a computerized system composed of multiple interacting intelligent agents within an environment. Multi-agent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Intelligence may include some methodic, functional, procedural approach, algorithmic search or reinforcement learning. Although there is considerable overlap, a multi-agent system is not always the same as an agent-based model (ABM). The goal of an ABM is to search for explanatory insight into the collective behavior of agents (which don’t necessarily need to be ‘intelligent’) obeying simple rules, typically in natural systems, rather than in solving specific practical or engineering problems. The terminology of ABM tends to be used more often in the sciences, and MAS in engineering and technology. Topics where multi-agent systems research may deliver an appropriate approach include online trading, disaster response, and modelling social structures. Multi-Armed Bandit In probability theory, the multi-armed bandit problem (sometimes called the K- or N-armed bandit problem) is the problem a gambler faces at a row of slot machines, sometimes known as ‘one-armed bandits’, when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. ➚ “Gittins Index” Multi-Cell LSTM Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network, Recurrent neural network’ (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup. MultI-class learNing Algorithm for data Streams(MINAS) Novelty detection has been presented in the literature as one-class problem. In this case, new examples are classified as either belonging to the target class or not. The examples not explained by the model are detected as belonging to a class named novelty. However, novelty detection is much more general, especially in data streams scenarios, where the number of classes might be unknown before learning and new classes can appear any time. In this case, the novelty concept is composed by different classes. This work presents a new algorithm to address novelty detection in data streams multi-class problems, the MINAS algorithm. Moreover, we also present a new experimental methodology to evaluate novelty detection methods in multi-class problems. The data used in the experiments include artificial and real data sets. Experimental results show that MINAS is able to discover novelties in multi-class problems. ➘ “Novelty Detection” Multiclass Universum SVM(MU-SVM) We introduce Universum learning for multiclass problems and propose a novel formulation for multiclass universum SVM (MU-SVM). We also propose an analytic span bound for model selection with almost 2-4x faster computation times than standard resampling techniques. We empirically demonstrate the efficacy of the proposed MUSVM formulation on several real world datasets achieving > 20% improvement in test accuracies compared to multi-class SVM. Multicollinearity In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, the ordinary least-squares estimator \hat{\beta} = (X’X)^{-1}X’y does not exist. Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase ‘no multicollinearity’ is sometimes used to mean the absence of perfect multicollinearity, which is an exact (non-stochastic) linear relation among the regressors. Multi-Context Label Embedding(MCLE) Label embedding plays an important role in zero-shot learning. Side information such as attributes, semantic text representations, and label hierarchy are commonly used as the label embedding in zero-shot classification tasks. However, the label embedding used in former works considers either only one single context of the label, or multiple contexts without dependency. Therefore, different contexts of the label may not be well aligned in the embedding space to preserve the relatedness between labels, which will result in poor interpretability of the label embedding. In this paper, we propose a Multi-Context Label Embedding (MCLE) approach to incorporate multiple label contexts, e.g., label hierarchy and attributes, within a unified matrix factorization framework. To be specific, we model each single context by a matrix factorization formula and introduce a shared variable to capture the dependency among different contexts. Furthermore, we enforce sparsity constraint on our multi-context framework to strengthen the interpretability of the learned label embedding. Extensive experiments on two real-world datasets demonstrate the superiority of our MCLE in label description and zero-shot image classification. multiDA We introduce a new method of performing high dimensional discriminant analysis, which we call multiDA. We achieve this by constructing a hybrid model that seamlessly integrates a multiclass diagonal discriminant analysis model and feature selection components. Our feature selection component naturally simplifies to weights which are simple functions of likelihood ratio statistics allowing natural comparisons with traditional hypothesis testing methods. We provide heuristic arguments suggesting desirable asymptotic properties of our algorithm with regards to feature selection. We compare our method with several other approaches, showing marked improvements in regard to prediction accuracy, interpretability of chosen features, and algorithm run time. We demonstrate such strengths of our model by showing strong classification performance on publicly available high dimensional datasets, as well as through multiple simulation studies. We make an R package available implementing our approach. Multi-Differential Fairness Auditor(MDFA) Machine learning algorithms are increasingly involved in sensitive decision-making process with adversarial implications on individuals. This paper presents mdfa, an approach that identifies the characteristics of the victims of a classifier’s discrimination. We measure discrimination as a violation of multi-differential fairness. Multi-differential fairness is a guarantee that a black box classifier’s outcomes do not leak information on the sensitive attributes of a small group of individuals. We reduce the problem of identifying worst-case violations to matching distributions and predicting where sensitive attributes and classifier’s outcomes coincide. We apply mdfa to a recidivism risk assessment classifier and demonstrate that individuals identified as African-American with little criminal history are three-times more likely to be considered at high risk of violent recidivism than similar individuals but not African-American. MultiDimensional Feature Selection(MDFS) Identification of informative variables in an information system is often performed using simple one-dimensional filtering procedures that discard information about interactions between variables. Such approach may result in removing some relevant variables from consideration. Here we present an R package MDFS (MultiDimensional Feature Selection) that performs identification of informative variables taking into account synergistic interactions between multiple descriptors and the decision variable. MDFS is an implementation of an algorithm based on information theory. Computational kernel of the package is implemented in C++. A high-performance version implemented in CUDA C is also available. The applications of MDFS are demonstrated using the well-known Madelon dataset that has synergistic variables by design. The dataset comes from the UCI Machine Learning Repository. It is shown that multidimensional analysis is more sensitive than one-dimensional tests and returns more reliable rankings of importance. Multi-dimensional Graph Convolutional Network(mGCN) Convolutional neural networks (CNNs) leverage the great power in representation learning on regular grid data such as image and video. Recently, increasing attention has been paid on generalizing CNNs to graph or network data which is highly irregular. Some focus on graph-level representation learning while others aim to learn node-level representations. These methods have been shown to boost the performance of many graph-level tasks such as graph classification and node-level tasks such as node classification. Most of these methods have been designed for single-dimensional graphs where a pair of nodes can only be connected by one type of relation. However, many real-world graphs have multiple types of relations and they can be naturally modeled as multi-dimensional graphs with each type of relation as a dimension. Multi-dimensional graphs bring about richer interactions between dimensions, which poses tremendous challenges to the graph convolutional neural networks designed for single-dimensional graphs. In this paper, we study the problem of graph convolutional networks for multi-dimensional graphs and propose a multi-dimensional convolutional neural network model mGCN aiming to capture rich information in learning node-level representations for multi-dimensional graphs. Comprehensive experiments on real-world multi-dimensional graphs demonstrate the effectiveness of the proposed framework. Multi-Dimensional Recurrent Neural Network(MDRNN) Some of the properties that make RNNs suitable for one dimensional sequence learning tasks, are also desirable in multidimensional domains. This paper introduces multi-dimensional recurrent neural networks (MDRNNs), thereby extending the potential applicability of RNNs to vision, video processing, medical imaging and many other areas. Multidimensional Scaling(MDS) Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot. Multi-Dimensional Utility-Oriented Sequential Useful Patterns(MDUS) Knowledge extraction from database is the fundamental task in database and data mining community, which has been applied to a wide range of real-world applications and situations. Different from the support-based mining models, the utility-oriented mining framework integrates the utility theory to provide more informative and useful patterns. Time-dependent sequence data is commonly seen in real life. Sequence data has been widely utilized in many applications, such as analyzing sequential user behavior on the Web, influence maximization, route planning, and targeted marketing. Unfortunately, all the existing algorithms lose sight of the fact that the processed data not only contain rich features (e.g., occur quantity, risk, profit, etc.), but also may be associated with multi-dimensional auxiliary information, e.g., transaction sequence can be associated with purchaser profile information. In this paper, we first formulate the problem of utility mining across multi-dimensional sequences, and propose a novel framework named MDUS to extract Multi-Dimensional Utility-oriented Sequential useful patterns. Two algorithms respectively named MDUS_EM and MDUS_SD are presented to address the formulated problem. The former algorithm is based on database transformation, and the later one performs pattern joins and a searching method to identify desired patterns across multi-dimensional sequences. Extensive experiments are carried on five real-life datasets and one synthetic dataset to show that the proposed algorithms can effectively and efficiently discover the useful knowledge from multi-dimensional sequential databases. Moreover, the MDUS framework can provide better insight, and it is more adaptable to real-life situations than the current existing models. Multi-Directional Recurrent Neural Network(M-RNN) Missing data is a ubiquitous problem. It is especially challenging in medical settings because many streams of measurements are collected at different – and often irregular – times. Accurate estimation of those missing measurements is critical for many reasons, including diagnosis, prognosis and treatment. Existing methods address this estimation problem by interpolating within data streams or imputing across data streams (both of which ignore important information) or ignoring the temporal aspect of the data and imposing strong assumptions about the nature of the data-generating process and/or the pattern of missing data (both of which are especially problematic for medical data). We propose a new approach, based on a novel deep learning architecture that we call a Multi-directional Recurrent Neural Network (M-RNN) that interpolates within data streams and imputes across data streams. We demonstrate the power of our approach by applying it to five real-world medical datasets. We show that it provides dramatically improved estimation of missing measurements in comparison to 11 state-of-the-art benchmarks (including Spline and Cubic Interpolations, MICE, MissForest, matrix completion and several RNN methods); typical improvements in Root Mean Square Error are between 35% – 50%. Additional experiments based on the same five datasets demonstrate that the improvements provided by our method are extremely robust. Multi-Discriminator Generative Adversarial Network(MDGAN) A recent technical breakthrough in the domain of machine learning is the discovery and the multiple applications of Generative Adversarial Networks (GANs). Those generative models are computationally demanding, as a GAN is composed of two deep neural networks, and because it trains on large datasets. A GAN is generally trained on a single server. In this paper, we address the problem of distributing GANs so that they are able to train over datasets that are spread on multiple workers. MD-GAN is exposed as the first solution for this problem: we propose a novel learning procedure for GANs so that they fit this distributed setup. We then compare the performance of MD-GAN to an adapted version of Federated Learning to GANs, using the MNIST and CIFAR10 datasets. MD-GAN exhibits a reduction by a factor of two of the learning complexity on each worker node, while providing better performances than federated learning on both datasets. We finally discuss the practical implications of distributing GANs. Multi-Distance Support Matrix Machine(MDSMM) Real-world data such as digital images, MRI scans and electroencephalography signals are naturally represented as matrices with structural information. Most existing classifiers aim to capture these structures by regularizing the regression matrix to be low-rank or sparse. Some other methodologies introduce factorization technique to explore nonlinear relationships of matrix data in kernel space. In this paper, we propose a multi-distance support matrix machine (MDSMM), which provides a principled way of solving matrix classification problems. The multi-distance is introduced to capture the correlation within matrix data, by means of intrinsic information in rows and columns of input data. A complex hyperplane is established upon these values to separate distinct classes. We further study the generalization bounds for i.i.d. processes and non i.i.d. process based on both SVM and SMM classifiers. For typical hypothesis classes where matrix norms are constrained, MDSMM achieves a faster learning rate than traditional classifiers. We also provide a more general approach for samples without prior knowledge. We demonstrate the merits of the proposed method by conducting exhaustive experiments on both simulation study and a number of real-word datasets. Multi-Domain Adversarial Learning Approach(MuLANN) Multi-domain learning (MDL) aims at obtaining a model with minimal average risk across multiple domains. Our empirical motivation is automated microscopy data, where cultured cells are imaged after being exposed to known and unknown chemical perturbations, and each dataset displays significant experimental bias. This paper presents a multi-domain adversarial learning approach, MuLANN, to leverage multiple datasets with overlapping but distinct class sets, in a semi-supervised setting. Our contributions include: i) a bound on the average- and worst-domain risk in MDL, obtained using the H-divergence; ii) a new loss to accommodate semi-supervised multi-domain learning and domain adaptation; iii) the experimental validation of the approach, improving on the state of the art on two standard image benchmarks, and a novel bioimage dataset, Cell. Multi-Domain and Multi-Modality Event Dataset(MMED) In this work, we construct and release a multi-domain and multi-modality event dataset (MMED), containing 25,165 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, Google News, CNN News.) and 76,516 image posts shared on Flickr social media, which are annotated according to 412 real-world events. The dataset is collected to explore the problem of organizing heterogeneous data contributed by professionals and amateurs in different data domains, and the problem of transferring event knowledge obtained from one data domain to heterogeneous data domain, thus summarizing the data with different contributors. We hope that the release of the MMED dataset can stimulate innovate research on related challenging problems, such as event discovery, cross-modal (event) retrieval, and visual question answering, etc. Multi-Domain Dictionary Learning(MDDL) In this paper, we propose the multi-domain dictionary learning (MDDL) to make dictionary learning-based classification more robust to data representing in different domains. We use adversarial neural networks to generate data in different styles, and collect all the generated data into a miscellaneous dictionary. To tackle the dictionary learning with many samples, we compute the weighting matrix that compress the miscellaneous dictionary from multi-sample per class to single sample per class. We show that the time complexity solving the proposed MDDL with weighting matrix is the same as solving the dictionary with single sample per class. Moreover, since the weighting matrix could help the solver rely more on the training data, which possibly lie in the same domain with the testing data, the classification could be more accurate. Multi-Entity Bayesian Network(MEBN) Multi-Entity Bayesian Network (MEBN) is a knowledge representation formalism combining Bayesian Networks (BN) with First-Order Logic (FOL). MEBN has sufficient expressive power for general-purpose knowledge representation and reasoning. Developing a MEBN model to support a given application is a challenge, requiring definition of entities, relationships, random variables, conditional dependence relationships, and probability distributions. When available, data can be invaluable both to improve performance and to streamline development. By far the most common format for available data is the relational database (RDB). Relational databases describe and organize data according to the Relational Model (RM). Developing a MEBN model from data stored in an RDB therefore requires mapping between the two formalisms. Multi-Entity Bayesian Network Relational Model(MEBN-RM) Multi-Entity Bayesian Network (MEBN) is a knowledge representation formalism combining Bayesian Networks (BN) with First-Order Logic (FOL). MEBN has sufficient expressive power for general-purpose knowledge representation and reasoning. Developing a MEBN model to support a given application is a challenge, requiring definition of entities, relationships, random variables, conditional dependence relationships, and probability distributions. When available, data can be invaluable both to improve performance and to streamline development. By far the most common format for available data is the relational database (RDB). Relational databases describe and organize data according to the Relational Model (RM). Developing a MEBN model from data stored in an RDB therefore requires mapping between the two formalisms. This paper presents MEBN-RM, a set of mapping rules between key elements of MEBN and RM. We identify links between the two languages (RM and MEBN) and define four levels of mapping from elements of RM to elements of MEBN. These definitions are implemented in the MEBN-RM algorithm, which converts a relational schema in RM to a partial MEBN model. Through this research, the software has been released as a MEBN-RM open-source software tool. The method is illustrated through two example use cases using MEBN-RM to develop MEBN models: a Critical Infrastructure Defense System and a Smart Manufacturing System. Multifaceted Privacy Recent works in social network stream analysis show that a user’s online persona attributes (e.g., gender, ethnicity, political interest, location, etc.) can be accurately inferred from the topics the user writes about or engages with. Attribute and preference inferences have been widely used to serve personalized recommendations, directed ads, and to enhance the user experience in social networks. However, revealing a user’s sensitive attributes could represent a privacy threat to some individuals. Microtargeting (e.g.,Cambridge Analytica scandal), surveillance, and discriminating ads are examples of threats to user privacy caused by sensitive attribute inference. In this paper, we propose Multifaceted privacy, a novel privacy model that aims to obfuscate a user’s sensitive attributes while publicly preserving the user’s public persona. To achieve multifaceted privacy, we build Aegis, a prototype client-centric social network stream processing system that helps preserve multifaceted privacy, and thus allowing social network users to freely express their online personas without revealing their sensitive attributes of choice. Aegis allows social network users to control which persona attributes should be publicly revealed and which ones should be kept private. For this, Aegis continuously suggests topics and hashtags to social network users to post in order to obfuscate their sensitive attributes and hence confuse content-based sensitive attribute inferences. The suggested topics are carefully chosen to preserve the user’s publicly revealed persona attributes while hiding their private sensitive persona attributes. Our experiments show that adding as few as 0 to 4 obfuscation posts (depending on how revealing the original post is) successfully hides the user specified sensitive attributes without changing the user’s public persona attributes. Multifractal Detrended Fluctuation Analysis(MFDFA) Fractal structures are found in biomedical time series from a wide range of physiological phenomena. The multifractal spectrum identifies the deviations in fractal structure within time periods with large and small fluctuations. MFDFA Multi-Frequecy Long Short-Term Memory(mLSTM) ➘ “Multilevel Wavelet Decomposition Network” Multi-Function Recurrent Units(MuFuRU) Recurrent neural networks such as the GRU and LSTM found wide adoption in natural language processing and achieve state-of-the-art results for many tasks. These models are characterized by a memory state that can be written to and read from by applying gated composition operations to the current input and the previous state. However, they only cover a small subset of potentially useful compositions. We propose Multi-Function Recurrent Units (MuFuRUs) that allow for arbitrary differentiable functions as composition operations. Furthermore, MuFuRUs allow for an input- and state-dependent choice of these composition operations that is learned. Our experiments demonstrate that the additional functionality helps in different sequence modeling tasks, including the evaluation of propositional logic formulae, language modeling and sentiment analysis. Multi-Head Asymmetric Hashing(MAH) Extremely low bit (e.g., 4-bit) hashing is in high demand for retrieval and network compression, yet it could hardly guarantee a manageable convergence or performance due to its severe information loss and shrink of discrete solution space. In this paper, we propose a novel Collaborative Learning strategy for high-quality low-bit deep hashing. The core idea is to distill bit-specific representations for low-bit codes with a group of hashing learners, where hash codes of various length actively interact by sharing and accumulating knowledge. To achieve that, an asymmetric hashing framework with two variants of multi-head embedding structures is derived, termed as Multi-head Asymmetric Hashing (MAH), leading to great efficiency of training and querying. Multiple views from different embedding heads provide supplementary guidance as well as regularization for extremely low bit hashing, hence making convergence faster and more stable. Extensive experiments on three benchmark datasets have been conducted to verify the superiority of the proposed MAH, and show that 8-bit hash codes generated by MAH achieve 94.4% of MAP score, which significantly surpasses the performance of 48-bit codes by the state-of-the-arts for image retrieval. Multi-hop Assortativity Several social, medical, engineering and biological challenges rely on discovering the functionality of networks from their structure and node metadata, when is available. For example, in chemoinformatics one might want to detect whether a molecule is toxic based on structure and atomic types, or discover the research field for scientific collaboration networks. Existing techniques rely on counting or measuring structural patterns that are known to show large variations from network to network, such as number of triangles, or the assortativity of node metadata. We introduce the concept of multi-hop assortativity, that captures the similarity of node situated at the extremities of a randomly selected path of a given length. We show that multi-hop assortativity unifies various existing concepts and offers a versatile family of fingerprints to characterize networks. These fingerprints allow in turn to recover the functionalities of a network, with the help of the machine learning toolbox. Our method is evaluated empirically on established social and chemoinformatic network benchmarks. Results reveal that our assortativity based features are competitive providing highly accurate results often outperforming state of the art methods for the network classification task Multi-Input Fully-Convolutional Network(MIFCN) In recent years, there has been a growing interest in applying convolutional neural networks (CNNs) to low-level vision tasks such as denoising and super-resolution. Optical coherence tomography (OCT) images are inevitably affected by noise, due to the coherent nature of the image formation process. In this paper, we take advantage of the progress in deep learning methods and propose a new method termed multi-input fully-convolutional networks (MIFCN) for denoising of OCT images. Despite recently proposed natural image denoising CNNs, our proposed architecture allows exploiting high degrees of correlation and complementary information among neighboring OCT images through pixel by pixel fusion of multiple FCNs. We also show how the parameters of the proposed architecture can be learned by optimizing a loss function that is specifically designed to take into account consistency between the overall output and the contribution of each input image. We compare the proposed MIFCN method quantitatively and qualitatively with the state-of-the-art denoising methods on OCT images of normal and age-related macular degeneration eyes. Multi-Instance Learning(MIL) In machine learning, multiple-instance learning (MIL) is a variation on supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept. Take image classification for example in Amores (2013). Given an image, we want to know its target class based on its visual content. For instance, the target class might be ‘beach’, where the image contains both ‘sand’ and ‘water’. In MIL terms, the image is described as a bag X = , where eachX_i is the feature vector (called instance) extracted from the corresponding i-th region in the image and N is the total regions (instances) partitioning the image. The bag is labeled positive (‘beach’) if it contains both ‘sand’ region instances and ‘water’ region instances. Multiple-instance learning was originally proposed under this name by Dietterich, Lathrop & Lozano-Pérez (1997), but earlier examples of similar research exist, for instance in the work on handwritten digit recognition by Keeler, Rumelhart & Leow (1990). Recent reviews of the MIL literature include Amores (2013), which provides an extensive review and comparative study of the different paradigms, and Foulds & Frank (2010), which provides a thorough review of the different assumptions used by different paradigms in the literature. Examples of where MIL is applied are: · Molecule activity · Predicting binding sites of Calmodulin binding proteins · Predicting function for alternatively spliced isoforms Li, Menon & et al. (2014),Eksi et al. (2013) · Image classification Maron & Ratan (1998) · Text or document categorization Kotzias et al. (2015) · Predicting functional binding sites of MicroRNA targets Bandyopadhyay, Ghosh & et al. (2015) Numerous researchers have worked on adapting classical classification techniques, such as support vector machines or boosting, to work within the context of multiple-instance learning. Multiple Instance Learning: Algorithms and Applications Stable multi-instance learning visa causal inference Multi-Item Gamma Poisson Shrinker(MGPS) MGPS is a disproportionality method that utilizes an empirical Bayesian model to detect the magnitude of drug-event associations in drug safety databases. MGPS calculates adjusted reporting ratios for pairs of drug event combinations. The adjusted reporting ratio values are termed the EBGM or the ‘Empirical Bayes Geometric Mean.’ EBGM values indicate the strength of the reporting relationship between a particular drug and event pair. openEBGM Multi-Kernel Correntropy(MKC) As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learning performance, where the kernel function is a mixture Gaussian kernel, namely a linear combination of several zero-mean Gaussian kernels with different widths. In both correntropy and mixture correntropy, the center of the kernel function is, however, always located at zero. In the present work, to further improve the learning performance, we propose the concept of multi-kernel correntropy (MKC), in which each component of the mixture Gaussian kernel can be centered at a different location. The properties of the MKC are investigated and an efficient approach is proposed to determine the free parameters in MKC. Experimental results show that the learning algorithms under the maximum multi-kernel correntropy criterion (MMKCC) can outperform those under the original maximum correntropy criterion (MCC) and the maximum mixture correntropy criterion (MMCC). Multilabel Feature Selection(ML-FS) Multilabel feature selection: A comprehensive review and guiding experiments Multi-Lane Capsule Network(MLCN) We introduce Multi-Lane Capsule Networks (MLCN), which are a separable and resource efficient organization of Capsule Networks (CapsNet) that allows parallel processing, while achieving high accuracy at reduced cost. A MLCN is composed of a number of (distinct) parallel lanes, each contributing to a dimension of the result, trained using the routing-by-agreement organization of CapsNet. Our results indicate similar accuracy with a much reduced cost in number of parameters for the Fashion-MNIST and Cifar10 datsets. They also indicate that the MLCN outperforms the original CapsNet when using a proposed novel configuration for the lanes. MLCN also has faster training and inference times, being more than two-fold faster than the original CapsNet in the same accelerator. Multi-Layer Convolutional Sparse Coding(ML-CSC) The recently proposed Multi-Layer Convolutional Sparse Coding (ML-CSC) model, consisting of a cascade of convolutional sparse layers, provides a new interpretation of Convolutional Neural Networks (CNNs). Under this framework, the computation of the forward pass in a CNN is equivalent to a pursuit algorithm aiming to estimate the nested sparse representation vectors — or feature maps — from a given input signal. Despite having served as a pivotal connection between CNNs and sparse modeling, a deeper understanding of the ML-CSC is still lacking: there are no pursuit algorithms that can serve this model exactly, nor are there conditions to guarantee a non-empty model. While one can easily obtain signals that approximately satisfy the ML-CSC constraints, it remains unclear how to simply sample from the model and, more importantly, how one can train the convolutional filters from real data. In this work, we propose a sound pursuit algorithm for the ML-CSC model by adopting a projection approach. We provide new and improved bounds on the stability of the solution of such pursuit and we analyze different practical alternatives to implement this in practice. We show that the training of the filters is essential to allow for non-trivial signals in the model, and we derive an online algorithm to learn the dictionaries from real data, effectively resulting in cascaded sparse convolutional layers. Last, but not least, we demonstrate the applicability of the ML-CSC model for several applications in an unsupervised setting, providing competitive results. Our work represents a bridge between matrix factorization, sparse dictionary learning and sparse auto-encoders, and we analyze these connections in detail. Multi-Layer Fast ISTA(ML-FISTA) Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent Multi-Layer Convolutional Sparse Coding (ML-CSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multi-layer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multi-layer Iterative Soft Thresholding (ML-ISTA) and multi-layer Fast ISTA (ML-FISTA) converge to the global optimum of our multi-layer formulation at a rate of $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feed-forward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multi-layer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feed-forward ones. We demonstrate the emerging constructions by training them in an end-to-end manner, consistently improving the performance of classical networks without introducing extra filters or parameters. Multi-Layer Iterative Soft Thresholding(ML-ISTA) Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent Multi-Layer Convolutional Sparse Coding (ML-CSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multi-layer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multi-layer Iterative Soft Thresholding (ML-ISTA) and multi-layer Fast ISTA (ML-FISTA) converge to the global optimum of our multi-layer formulation at a rate of $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feed-forward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multi-layer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feed-forward ones. We demonstrate the emerging constructions by training them in an end-to-end manner, consistently improving the performance of classical networks without introducing extra filters or parameters. Multi-Layer K-Means(MLKM) Data-target association is an important step in multi-target localization for the intelligent operation of un- manned systems in numerous applications such as search and rescue, traffic management and surveillance. The objective of this paper is to present an innovative data association learning approach named multi-layer K-means (MLKM) based on leveraging the advantages of some existing machine learning approaches, including K-means, K-means++, and deep neural networks. To enable the accurate data association from different sensors for efficient target localization, MLKM relies on the clustering capabilities of K-means++ structured in a multi-layer framework with the error correction feature that is motivated by the backpropogation that is well-known in deep learning research. To show the effectiveness of the MLKM method, numerous simulation examples are conducted to compare its performance with K-means, K-means++, and deep neural networks. Multi-layer Relation Network Relational Networks (RN) as introduced by Santoro et al. (2017) have demonstrated strong relational reasoning capabilities with a rather shallow architecture. Its single-layer design, however, only considers pairs of information objects, making it unsuitable for problems requiring reasoning across a higher number of facts. To overcome this limitation, we propose a multi-layer relation network architecture which enables successive refinements of relational information through multiple layers. We show that the increased depth allows for more complex relational reasoning by applying it to the bAbI 20 QA dataset, solving all 20 tasks with joint training and surpassing the state-of-the-art results. Multilayer Switch Network Multilayer switch networks are proposed as artificial generators of high-dimensional discrete data (e.g., binary vectors, categorical data, natural language, network log files, and discrete-valued time series). Unlike deconvolution networks which generate continuous-valued data and which consist of upsampling filters and reverse pooling layers, multilayer switch networks are composed of adaptive switches which model conditional distributions of discrete random variables. An interpretable, statistical framework is introduced for training these nonlinear networks based on a maximum-likelihood objective function. To learn network parameters, stochastic gradient descent is applied to the objective. This direct optimization is stable until convergence, and does not involve back-propagation over separate encoder and decoder networks, or adversarial training of dueling networks. While training remains tractable for moderately sized networks, Markov-chain Monte Carlo (MCMC) approximations of gradients are derived for deep networks which contain latent variables. The statistical framework is evaluated on synthetic data, high-dimensional binary data of handwritten digits, and web-crawled natural language data. Aspects of the model’s framework such as interpretability, computational complexity, and generalization ability are discussed. Multi-layer Time Series Periodic Pattern Recognition(PTSP) ➘ “Time Series Data Compression and Abstraction” Multi-Layer Vector Approximate Message Passing(ML-VAMP) Deep generative networks provide a powerful tool for modeling complex data in a wide range of applications. In inverse problems that use these networks as generative priors on data, one must often perform inference of the inputs of the networks from the outputs. Inference is also required for sampling during stochastic training on these generative models. This paper considers inference in a deep stochastic neural network where the parameters (e.g., weights, biases and activation functions) are known and the problem is to estimate the values of the input and hidden units from the output. While several approximate algorithms have been proposed for this task, there are few analytic tools that can provide rigorous guarantees in the reconstruction error. This work presents a novel and computationally tractable output-to-input inference method called Multi-Layer Vector Approximate Message Passing (ML-VAMP). The proposed algorithm, derived from expectation propagation, extends earlier AMP methods that are known to achieve the replica predictions for optimality in simple linear inverse problems. Our main contribution shows that the mean-squared error (MSE) of ML-VAMP can be exactly predicted in a certain large system limit (LSL) where the numbers of layers is fixed and weight matrices are random and orthogonally-invariant with dimensions that grow to infinity. ML-VAMP is thus a principled method for output-to-input inference in deep networks with a rigorous and precise performance achievability result in high dimensions. Multi-level Abstraction Object-oriented Predictor(MAOP) Object-based approaches for learning action-conditioned dynamics has demonstrated promise for generalization and interpretability. However, existing approaches suffer from structural limitations and optimization difficulties for common environments with multiple dynamic objects. In this paper, we present a novel self-supervised learning framework, called Multi-level Abstraction Object-oriented Predictor (MAOP), which employs a three-level learning architecture that enables efficient object-based dynamics learning from raw visual observations. We also design a spatial-temporal relational reasoning mechanism for MAOP to support instance-level dynamics learning and handle partial observability. Our results show that MAOP significantly outperforms previous methods in terms of sample efficiency and generalization over novel environments for learning environment models. We also demonstrate that learned dynamics models enable efficient planning in unseen environments, comparable to true environment models. In addition, MAOP learns semantically and visually interpretable disentangled representations. Multi-Level Evolution(MLE) Natural lifeforms specialise to their environmental niches across many levels; from low-level features such as DNA and proteins, through to higher-level artefacts including eyes, limbs, and overarching body plans. We propose Multi-Level Evolution (MLE), a bottom-up automatic process that designs robots across multiple levels and niches them to tasks and environmental conditions. MLE concurrently explores constituent molecular and material ‘building blocks’, as well as their possible assemblies into specialised morphological and sensorimotor configurations. MLE provides a route to fully harness a recent explosion in available candidate materials and ongoing advances in rapid manufacturing processes. We outline a feasible MLE architecture that realises this vision, highlight the main roadblocks and how they may be overcome, and show robotic applications to which MLE is particularly suited. By forming a research agenda to stimulate discussion between researchers in related fields, we hope to inspire the pursuit of multi-level robotic design all the way from material to machine. Multilevel Model(MLM) Multilevel models (also hierarchical linear models, nested models, mixed models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parameters that vary at more than one level. These models can be seen as generalizations of linear models (in particular, linear regression), although they can also extend to non-linear models. These models became much more popular after sufficient computing power and software became available. Multi-Level Monte Carlo Variational Inference In many statistics and machine learning frameworks, stochastic optimization with high variance gradients has become an important problem. For example, the performance of Monte Carlo variational inference (MCVI) seriously depends on the variance of its stochastic gradient estimator. In this paper, we focused on this problem and proposed a novel framework of variance reduction using multi-level Monte Carlo (MLMC) method. The framework is naturally compatible with reparameterization gradient estimators, which are one of the efficient variance reduction techniques that use the reparameterization trick. We also proposed a novel MCVI algorithm for stochastic gradient estimation on MLMC method in which sample size $N$ is adaptively estimated according to the ratio of the variance and computational cost for each iteration. We furthermore proved that, in our method, the norm of the gradient could converge to $0$ asymptotically. Finally, we evaluated our method by comparing it with benchmark methods in several experiments and showed that our method was able to reduce gradient variance and sampling cost efficiently and be closer to the optimum value than the other methods were. Multilevel Networks Analysis Described in Lazega et al (2008) and in Lazega and Snijders (2016, ISBN:978-3-319-24520-1). multinets Multilevel Wavelet Decomposition Network(mWDN) Recent years have witnessed the unprecedented rising of time series from almost all kindes of academic and industrial fields. Various types of deep neural network models have been introduced to time series analysis, but the important frequency information is yet lack of effective modeling. In light of this, in this paper we propose a wavelet-based neural network structure called multilevel Wavelet Decomposition Network (mWDN) for building frequency-aware deep learning models for time series analysis. mWDN preserves the advantage of multilevel discrete wavelet decomposition in frequency learning while enables the fine-tuning of all parameters under a deep neural network framework. Based on mWDN, we further propose two deep learning models called Residual Classification Flow (RCF) and multi-frequecy Long Short-Term Memory (mLSTM) for time series classification and forecasting, respectively. The two models take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to the back propagation algorithm to learn all the parameters globally, which enables seamless embedding of wavelet-based frequency analysis into deep learning frameworks. Extensive experiments on 40 UCR datasets and a real-world user volume dataset demonstrate the excellent performance of our time series models based on mWDN. In particular, we propose an importance analysis method to mWDN based models, which successfully identifies those time-series elements and mWDN layers that are crucially important to time series analysis. This indeed indicates the interpretability advantage of mWDN, and can be viewed as an indepth exploration to interpretable deep learning. Multilinear Class-Specific Discriminant Analysis There has been a great effort to transfer linear discriminant techniques that operate on vector data to high-order data, generally referred to as Multilinear Discriminant Analysis (MDA) techniques. Many existing works focus on maximizing the inter-class variances to intra-class variances defined on tensor data representations. However, there has not been any attempt to employ class-specific discrimination criteria for the tensor data. In this paper, we propose a multilinear subspace learning technique suitable for applications requiring class-specific tensor models. The method maximizes the discrimination of each individual class in the feature space while retains the spatial structure of the input. We evaluate the efficiency of the proposed method on two problems, i.e. facial image analysis and stock price prediction based on limit order book data. Multilinear Compressive Learning Compressive Learning is an emerging topic that combines signal acquisition via compressive sensing and machine learning to perform inference tasks directly on a small number of measurements. Many data modalities naturally have a multi-dimensional or tensorial format, with each dimension or tensor mode representing different features such as the spatial and temporal information in video sequences or the spatial and spectral information in hyperspectral images. However, in existing compressive learning frameworks, the compressive sensing component utilizes either random or learned linear projection on the vectorized signal to perform signal acquisition, thus discarding the multi-dimensional structure of the signals. In this paper, we propose Multilinear Compressive Learning, a framework that takes into account the tensorial nature of multi-dimensional signals in the acquisition step and builds the subsequent inference model on the structurally sensed measurements. Our theoretical complexity analysis shows that the proposed framework is more efficient compared to its vector-based counterpart in both memory and computation requirement. With extensive experiments, we also empirically show that our Multilinear Compressive Learning framework outperforms the vector-based framework in object classification and face recognition tasks, and scales favorably when the dimensionalities of the original signals increase, making it highly efficient for high-dimensional multi-dimensional signals. Multilinear Dynamical System(MLDS) We propose a novel multilinear dynamical system (MLDS) in a transform domain, named $\mathcal{L}$-MLDS, to model tensor time series. With transformations applied to a tensor data, the latent multidimensional correlations among the frontal slices are built, and thus resulting in the computational independence in the transform domain. This allows the exact separability of the multi-dimensional problem into multiple smaller LDS problems. To estimate the system parameters, we utilize the expectation-maximization (EM) algorithm to determine the parameters of each LDS. Further, $\mathcal{L}$-MLDSs significantly reduce the model parameters and allows parallel processing. Our general $\mathcal{L}$-MLDS model is implemented based on different transforms: discrete Fourier transform, discrete cosine transform and discrete wavelet transform. Due to the nonlinearity of these transformations, $\mathcal{L}$-MLDS is able to capture the nonlinear correlations within the data unlike the MLDS \cite{rogers2013multilinear} which assumes multi-way linear correlations. Using four real datasets, the proposed $\mathcal{L}$-MLDS is shown to achieve much higher prediction accuracy than the state-of-the-art MLDS and LDS with an equal number of parameters under different noise models. In particular, the relative errors are reduced by $50\% \sim 99\%$. Simultaneously, $\mathcal{L}$-MLDS achieves an exponential improvement in the model’s training time than MLDS. Multi-Linear Multi-View Clustering(MMC) In many real-world applications, data are often unlabeled and comprised of different representations/views which often provide information complementary to each other. Although several multi-view clustering methods have been proposed, most of them routinely assume one weight for one view of features, and thus inter-view correlations are only considered at the view-level. These approaches, however, fail to explore the explicit correlations between features across multiple views. In this paper, we introduce a tensor-based approach to incorporate the higher-order interactions among multiple views as a tensor structure. Specifically, we propose a multi-linear multi-view clustering (MMC) method that can efficiently explore the full-order structural information among all views and reveal the underlying subspace structure embedded within the tensor. Extensive experiments on real-world datasets demonstrate that our proposed MMC algorithm clearly outperforms other related state-of-the-art methods. Multilinear Subspace Learning(MSL) Multilinear subspace learning (MSL) aims to learn a specific small part of a large space of multidimensional objects having a particular desired property. It is a dimensionality reduction approach for finding a low-dimensional representation with certain preferred characteristics of high-dimensional tensor data through direct mapping, without going through vectorization. The term tensor in MSL refers to multidimensional arrays. Examples of tensor data include images (2D/3D), video sequences (3D/4D), and hyperspectral cubes (3D/4D). The mapping from a high-dimensional tensor space to a low-dimensional tensor space or vector space is named as multilinear projection. MSL methods are higher-order generalizations of linear subspace learning methods such as principal component analysis (PCA), linear discriminant analysis (LDA) and canonical correlation analysis (CCA). In the literature, MSL is also referred to as tensor subspace learning or tensor subspace analysis. Research on MSL has progressed from heuristic exploration in 2000s (decade) to systematic investigation in 2010s. Multilingual Question Answering(mQA) In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a Long-Short Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, a LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate our mQA model. It contains over 120,000 images and 250,000 freestyle Chinese question-answer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset are evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer. We propose strategies to monitor the quality of this evaluation process. The experiments show that in 64.7% of cases, the human judges cannot distinguish our model from humans. The average score is 1.454 (1.918 for human). Multimapper Mapper is an algorithm that summarizes the topological information contained in a dataset and provides an insightful visualization. It takes as input a point cloud which is possibly high-dimensional, a filter function on it and an open cover on the range of the function. It returns the nerve simplicial complex of the pullback of the cover. Mapper can be considered a discrete approximation of the topological construct called Reeb space, as analysed in the $1$-dimensional case by [Carri et al.]. Despite its success in obtaining insights in various fields such as in [Kamruzzaman et al., 2016], Mapper is an ad hoc technique requiring lots of parameter tuning. There is also no measure to quantify goodness of the resulting visualization, which often deviates from the Reeb space in practice. In this paper, we introduce a new cover selection scheme for data that reduces the obscuration of topological information at both the computation and visualisation steps. To achieve this, we replace global scale selection of cover with a scale selection scheme sensitive to local density of data points. We also propose a method to detect some deviations in Mapper from Reeb space via computation of persistence features on the Mapper graph. Multi-Modal Aspect-Aware Topic Model(MATM) Although the latent factor model achieves good accuracy in rating prediction, it suffers from many problems including cold-start, non-transparency, and suboptimal results for individual user-item pairs. In this paper, we exploit textual reviews and item images together with ratings to tackle these limitations. Specifically, we first apply a proposed multi-modal aspect-aware topic model (MATM) on text reviews and item images to model users’ preferences and items’ features from different aspects, and also estimate the aspect importance of a user towards an item. Then the aspect importance is integrated into a novel aspect-aware latent factor model (ALFM), which learns user’s and item’s latent factors based on ratings. In particular, ALFM introduces a weight matrix to associate those latent factors with the same set of aspects in MATM, such that the latent factors could be used to estimate aspect ratings. Finally, the overall rating is computed via a linear combination of the aspect ratings, which are weighted by the corresponding aspect importance. To this end, our model could alleviate the data sparsity problem and gain good interpretability for recommendation. Besides, every aspect rating is weighted by its aspect importance, which is dependent on the targeted user’s preferences and the targeted item’s features. Therefore, it is expected that the proposed method can model a user’s preferences on an item more accurately for each user-item pair. Comprehensive experimental studies have been conducted on the Yelp 2017 Challenge dataset and Amazon product datasets to demonstrate the effectiveness of our method. Multimodal Attribute Extraction The broad goal of information extraction is to derive structured information from unstructured data. However, most existing methods focus solely on text, ignoring other types of unstructured data such as images, video and audio which comprise an increasing portion of the information on the web. To address this shortcoming, we propose the task of multimodal attribute extraction. Given a collection of unstructured and semi-structured contextual information about an entity (such as a textual description, or visual depictions) the task is to extract the entity’s underlying attributes. In this paper, we provide a dataset containing mixed-media data for over 2 million product items along with 7 million attribute-value pairs describing the items which can be used to train attribute extractors in a weakly supervised manner. We provide a variety of baselines which demonstrate the relative effectiveness of the individual modes of information towards solving the task, as well as study human performance. Multimodal Deep Gaussian Process We propose a novel Bayesian approach to modelling multimodal data generated by multiple independent processes, simultaneously solving the data association and induced supervised learning problems. Underpinning our approach is the use of Gaussian process priors which encode structure both on the functions and the associations themselves. The association of samples and functions are determined by taking both inputs and outputs into account while also obtaining a posterior belief about the relevance of the global components throughout the input space. We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors. We show results for an artificial data set, a noise separation problem, and a multimodal regression problem based on the cart-pole benchmark. Multimodal Deep Hashing Neural Decoder(MDHND) In this paper, we propose a novel three-stage multimodal deep hashing neural decoder (MDHND) architecture, which integrates a deep hashing framework with a neural network decoder (NND) to create an effective multibiometric authentication system. The MDHND consists of two separate modules: a multimodal deep hashing (MDH) module, which is used for feature-level fusion and binarization of multiple biometrics, and a neural network decoder (NND) module, which is used to refine the intermediate binary codes generated by the MDH and compensate for the difference between enrollment and probe biometrics (variations in pose, illumination, etc.). Use of NND helps to improve the performance of the overall multimodal authentication system. The MDHND framework is trained in 3 stages using joint optimization of the two modules. In Stage 1, the MDH parameters are trained and learned to generate a shared multimodal latent code; in Stage 2, the latent codes from Stage 1 are passed through a conventional error-correcting code (ECC) decoder to generate the ground truth to train a neural network decoder (NND); in Stage 3, the NND decoder is trained using the ground truth from Stage 2 and the MDH and NND are jointly optimized. Experimental results on a standard multimodal dataset demonstrate the superiority of our method relative to other current multimodal authentication systems. Furthermore, the proposed system can work in both identification and authentication modes. Multimodal Deep Network Embedding(MDNE) Network embedding is the process of learning low-dimensional representations for nodes in a network, while preserving node features. Existing studies only leverage network structure information and focus on preserving structural features. However, nodes in real-world networks often have a rich set of attributes providing extra semantic information. It has been demonstrated that both structural and attribute features are important for network analysis tasks. To preserve both features, we investigate the problem of integrating structure and attribute information to perform network embedding and propose a Multimodal Deep Network Embedding (MDNE) method. MDNE captures the non-linear network structures and the complex interactions among structures and attributes, using a deep model consisting of multiple layers of non-linear functions. Since structures and attributes are two different types of information, a multimodal learning method is adopted to pre-process them and help the model to better capture the correlations between node structure and attribute information. We employ both structural proximity and attribute proximity in the loss function to preserve the respective features and the representations are obtained by minimizing the loss function. Results of extensive experiments on four real-world datasets show that the proposed method performs significantly better than baselines on a variety of tasks, which demonstrate the effectiveness and generality of our method. Multimodal Dynamic Timetable Model(Multimodal DTM) We present multimodal DTM, a new model for multimodal journey planning in public (schedule-based) transport networks. Multimodal DTM constitutes an extension of the dynamic timetable model (DTM), developed originally for unimodal journey planning. Multimodal DTM exhibits a very fast query algorithm, meeting the request for real-time response to best journey queries and an extremely fast update algorithm for updating the timetable information in case of delays. In particular, an experimental study on real-world metropolitan networks demonstrates that our methods compare favorably with other state-of-the-art approaches when public transport along with unrestricted w.r.t. departing time traveling (walking and electric vehicles) is considered. Multimodal Fusion Architecture Search(MFAS) We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available. Multi-Modal Generative Adversarial Network(MM-GAN) Nowadays, an increasing number of customers are in favor of using E-commerce Apps to browse and purchase products. Since merchants are usually inclined to employ redundant and over-informative product titles to attract customers’ attention, it is of great importance to concisely display short product titles on limited screen of cell phones. Previous researchers mainly consider textual information of long product titles and lack of human-like view during training and evaluation procedure. In this paper, we propose a Multi-Modal Generative Adversarial Network (MM-GAN) for short product title generation, which innovatively incorporates image information, attribute tags from the product and the textual information from original long titles. MM-GAN treats short titles generation as a reinforcement learning process, where the generated titles are evaluated by the discriminator in a human-like view. Multimodal Intelligent inteRactIon for Autonomous systeMs(MIRIAM) We present MIRIAM (Multimodal Intelligent inteRactIon for Autonomous systeMs), a multimodal interface to support situation awareness of autonomous vehicles through chat-based interaction. The user is able to chat about the vehicle’s plan, objectives, previous activities and mission progress. The system is mixed initiative in that it pro-actively sends messages about key events, such as fault warnings. We will demonstrate MIRIAM using SeeByte’s SeeTrack command and control interface and Neptune autonomy simulator. Multimodal Interest-Related Item Similarity Model(Multimodal IRIS) Nowadays, the recommendation systems are applied in the fields of e-commerce, video websites, social networking sites, etc., which bring great convenience to people’s daily lives. The types of the information are diversified and abundant in recommendation systems, therefore the proportion of unstructured multimodal data like text, image and video is increasing. However, due to the representation gap between different modalities, it is intractable to effectively use unstructured multimodal data to improve the efficiency of recommendation systems. In this paper, we propose an end-to-end Multimodal Interest-Related Item Similarity model (Multimodal IRIS) to provide recommendations based on multimodal data source. Specifically, the Multimodal IRIS model consists of three modules, i.e., multimodal feature learning module, the Interest-Related Network (IRN) module and item similarity recommendation module. The multimodal feature learning module adds knowledge sharing unit among different modalities. Then IRN learn the interest relevance between target item and different historical items respectively. At last, the multimodal data feature learning, IRN and item similarity recommendation modules are unified into an integrated system to achieve performance enhancements and to accommodate the addition or absence of different modal data. Extensive experiments on real-world datasets show that, by dealing with the multimodal data which people may pay more attention to when selecting items, the proposed Multimodal IRIS significantly improves accuracy and interpretability on top-N recommendation task over the state-of-the-art methods. Multi-Modal Knowledge Graph(MMKG) We present MMKG, a collection of three knowledge graphs that contain both numerical features and (links to) images for all entities as well as entity alignments between pairs of KGs. Therefore, multi-relational link prediction and entity matching communities can benefit from this resource. We believe this data set has the potential to facilitate the development of novel multi-modal learning approaches for knowledge graphs.We validate the utility ofMMKG in the sameAs link prediction task with an extensive set of experiments. These experiments show that the task at hand benefits from learning of multiple feature types. Multimodal Learning The information in real world usually comes as different modalities. For example, images are usually associated with tags and text explanations; texts contain images to more clearly express the main idea of the article. Different modalities are characterized by very different statistical properties. For instance, images are usually represented as pixel intensities or outputs of feature extractors, while texts are represented as discrete word count vectors. Due to the distinct statistical properties of different information resources, it is very important to discover the relationship between different modalities. Multimodal learning is a good model to represent the joint representations of different modalities. The multimodal learning model is also capable to fill missing modality given the observed ones. The multimodal learning model combines two deep Boltzmann machines each corresponds to one modality. An additional hidden layer is placed on top of the two Boltzmann Machines to give the joint representation. Multimodal Machine Learning Our experience of the world is multimodal – we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research. Multimodal Named Entity Recognition(MNER) We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stories with fully annotated named entities). We then build upon the state-of-the-art Bi-LSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modality-attention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts from, adaptive to each sample and token. The proposed MNER model with modality attention significantly outperforms the state-of-the-art text-only NER models by successfully leveraging provided visual contexts, opening up potential applications of MNER on myriads of social media platforms. Multimodal Sequential Autoencoder(m-auto) ➚ “Learning to Recommend with Missing Modalities” multimodal sparse Bayesian dictionary learning(MSBDL) The purpose of this paper is to address the problem of learning dictionaries for multimodal datasets, i.e. datasets collected from multiple data sources. We present an algorithm called multimodal sparse Bayesian dictionary learning (MSBDL). The MSBDL algorithm is able to leverage information from all available data modalities through a joint sparsity constraint on each modality’s sparse codes without restricting the coefficients themselves to be equal. Our framework offers a considerable amount of flexibility to practitioners and addresses many of the shortcomings of existing multimodal dictionary learning approaches. Unlike existing approaches, MSBDL allows the dictionaries for each data modality to have different cardinality. In addition, MSBDL can be used in numerous scenarios, from small datasets to extensive datasets with large dimensionality. MSBDL can also be used in supervised settings and allows for learning multimodal dictionaries concurrently with classifiers for each modality. Multimodal Style Transfer(MST) An assumption widely used in recent neural style transfer methods is that image styles can be described by global statics of deep features like Gram or covariance matrices. Alternative approaches have represented styles by decomposing them into local pixel or neural patches. Despite the recent progress, most existing methods treat the semantic patterns of style image uniformly, resulting unpleasing results on complex styles. In this paper, we introduce a more flexible and general universal style transfer technique: multimodal style transfer (MST). MST explicitly considers the matching of semantic patterns in content and style images. Specifically, the style image features are clustered into sub-style components, which are matched with local content features under a graph cut formulation. A reconstruction network is trained to transfer each sub-style and render the final stylized result. Extensive experiments demonstrate the superior effectiveness, robustness and flexibility of MST. Multimodal Variational RNN(MVRNN) Multimodal learning has been lacking principled ways of combining information from different modalities and learning a low-dimensional manifold of meaningful representations. We study multimodal learning and sensor fusion from a latent variable perspective. We first present a regularized recurrent attention filter for sensor fusion. This algorithm can dynamically combine information from different types of sensors in a sequential decision making task. Each sensor is bonded with a modular neural network to maximize utility of its own information. A gating modular neural network dynamically generates a set of mixing weights for outputs from sensor networks by balancing utility of all sensors’ information. We design a co-learning mechanism to encourage co-adaption and independent learning of each sensor at the same time, and propose a regularization based co-learning method. In the second part, we focus on recovering the manifold of latent representation. We propose a co-learning approach using probabilistic graphical model which imposes a structural prior on the generative model: multimodal variational RNN (MVRNN) model, and derive a variational lower bound for its objective functions. In the third part, we extend the siamese structure to sensor fusion for robust acoustic event detection. We perform experiments to investigate the latent representations that are extracted; works will be done in the following months. Our experiments show that the recurrent attention filter can dynamically combine different sensor inputs according to the information carried in the inputs. We consider MVRNN can identify latent representations that are useful for many downstream tasks such as speech synthesis, activity recognition, and control and planning. Both algorithms are general frameworks which can be applied to other tasks where different types of sensors are jointly used for decision making. Multi-model and Multi-level Knowledge Distillation(M2KD) Incremental learning targets at achieving good performance on new categories without forgetting old ones. Knowledge distillation has been shown critical in preserving the performance on old classes. Conventional methods, however, sequentially distill knowledge only from the last model, leading to performance degradation on the old classes in later incremental learning steps. In this paper, we propose a multi-model and multi-level knowledge distillation strategy. Instead of sequentially distilling knowledge only from the last model, we directly leverage all previous model snapshots. In addition, we incorporate an auxiliary distillation to further preserve knowledge encoded at the intermediate feature levels. To make the model more memory efficient, we adapt mask based pruning to reconstruct all previous models with a small memory footprint. Experiments on standard incremental learning benchmarks show that our method preserves the knowledge on old classes better and improves the overall performance over standard distillation techniques. Multi-Model Ensemble via Adversarial Learning(MEAL) Often the best performing deep neural models are ensembles of multiple base-level networks. Unfortunately, the space required to store these many networks, and the time required to execute them at test-time, prohibits their use in applications where test sets are large (e.g., ImageNet). In this paper, we present a method for compressing large, complex trained ensembles into a single network, where knowledge from a variety of trained deep neural networks (DNNs) is distilled and transferred to a single DNN. In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously. The proposed ensemble method (MEAL) of transferring distilled knowledge with adversarial learning exhibits three important advantages: (1) the student network that learns the distilled knowledge with discriminators is optimized better than the original model; (2) fast inference is realized by a single forward pass, while the performance is even better than traditional ensembles from multi-original models; (3) the student network can learn the distilled knowledge from a teacher model that has arbitrary structures. Extensive experiments on CIFAR-10/100, SVHN and ImageNet datasets demonstrate the effectiveness of our MEAL method. On ImageNet, our ResNet-50 based MEAL achieves top-1/5 21.79%/5.99% val error, which outperforms the original model by 2.06%/1.14%. Code and models are available at: https://…/MEAL Multi-Model Group Compression(MMGC) To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly installed. However, due to the big amounts of data produced, only simple aggregates are stored. This removes outliers and hides fluctuations that could indicate problems. As a solution we propose compressing time series with dimensions using a model-based method we name Multi-model Group Compression (MMGC). MMGC adaptively compresses groups of correlated time series with dimensions using an extensible set of models within a user-defined error bound (possibly zero). To partition time series into groups, we propose a set of primitives for efficiently describing correlation for data sets of varying sizes. We also propose efficient query processing algorithms for executing multi-dimensional aggregate queries on models instead of data points. Last, we provide an open-source implementation of our methods as extensions to the model-based Time Series Management System (TSMS) ModelarDB. ModelarDB interfaces with the stock versions of Apache Spark and Apache Cassandra and thus can reuse existing infrastructure. Through an evaluation we show that, compared to widely used systems, our extended ModelarDB provides up to 11 times faster ingestion due to high compression, 65 times better compression due to the adaptivity of MMGC, 92 times faster aggregate queries as they are executed on models, and close to linear scalability while also being extensible and supporting online query processing. Multi-Motivation Behavior Modeling(MMBM) In recent years, reinforcement learning (RL) methods have been applied to model gameplay with great success, achieving super-human performance in various environments, such as Atari, Go, and Poker. However, those studies mostly focus on winning the game and have largely ignored the rich and complex human motivations, which are essential for understanding different players’ diverse behaviors. In this paper, we present a novel method called Multi-Motivation Behavior Modeling (MMBM) that takes the multifaceted human motivations into consideration and models the underlying value structure of the players using inverse RL. Our approach does not require the access to the dynamic of the system, making it feasible to model complex interactive environments such as massively multiplayer online games. MMBM is tested on the World of Warcraft Avatar History dataset, which recorded over 70,000 users’ gameplay spanning three years period. Our model reveals the significant difference of value structures among different player groups. Using the results of motivation modeling, we also predict and explain their diverse gameplay behaviors and provide a quantitative assessment of how the redesign of the game environment impacts players’ behaviors. MultiNet Representation learning of networks via embeddings has garnered popularity and has witnessed significant progress recently. Such representations have been effectively used for classic network-based machine learning tasks like link prediction, community detection, and network alignment. However, most existing network embedding techniques largely focus on developing distributed representations for traditional flat networks and are unable to capture representations for multilayer networks. Large scale networks such as social networks and human brain tissue networks, for instance, can be effectively captured in multiple layers. In this work, we propose Multi-Net a fast and scalable embedding technique for multilayer networks. Our work adds a new wrinkle to the the recently introduced family of network embeddings like node2vec, LINE, DeepWalk, SIGNet, sub2vec, graph2vec, and OhmNet. We demonstrate the usability of Multi-Net by leveraging it to reconstruct the friends and followers network on Twitter using network layers mined from the body of tweets, like mentions network and the retweet network. This is the Work-in-progress paper and our preliminary contribution for multilayer network embeddings. MultiNet++ Multi-task learning is commonly used in autonomous driving for solving various visual perception tasks. It offers significant benefits in terms of both performance and computational complexity. Current work on multi-task learning networks focus on processing a single input image and there is no known implementation of multi-task learning handling a sequence of images. In this work, we propose a multi-stream multi-task network to take advantage of using feature representations from preceding frames in a video sequence for joint learning of segmentation, depth, and motion. The weights of the current and previous encoder are shared so that features computed in the previous frame can be leveraged without additional computation. In addition, we propose to use the geometric mean of task losses as a better alternative to the weighted average of task losses. The proposed loss function facilitates better handling of the difference in convergence rates of different tasks. Experimental results on KITTI, Cityscapes and SYNTHIA datasets demonstrate that the proposed strategies outperform various existing multi-task learning solutions. Multi-Node2vec Learning interpretable features from complex multilayer networks is a challenging and important problem. The need for such representations is particularly evident in multilayer networks of the brain, where nodal characteristics may help model and differentiate regions of the brain according to individual, cognitive task, or disease. Motivated by this problem, we introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that automatically learns continuous node feature representations from multilayer networks. Multi-node2vec relies upon a second-order random walk sampling procedure that efficiently explores the inner- and intra-layer ties of the observed multilayer network is utilized to identify multilayer neighborhoods. Maximum likelihood estimators of the nodal features are identified through the use of the Skip-gram neural network model on the collection of sampled neighborhoods. We investigate the conditions under which multi-node2vec is an approximation of a closed-form matrix factorization problem. We demonstrate the efficacy of multi-node2vec on a multilayer functional brain network from resting state fMRI scans over a group of 74 healthy individuals. We find that multi-node2vec outperforms contemporary methods on complex networks, and that multi-node2vec identifies nodal characteristics that closely associate with the functional organization of the brain. Multinomial Probit Bayesian Additive Regression Trees(MPBART) mpbart Multi-Object Tracking(MOT) In this paper, we propose a unified Multi-Object Tracking (MOT) framework learning to make full use of long term and short term cues for handling complex cases in MOT scenes. Besides, for better association, we propose switcher-aware classification (SAC), which takes the potential identity-switch causer (switcher) into consideration. Specifically, the proposed framework includes a Single Object Tracking (SOT) sub-net to capture short term cues, a re-identification (ReID) sub-net to extract long term cues and a switcher-aware classifier to make matching decisions using extracted features from the main target and the switcher. Short term cues help to find false negatives, while long term cues avoid critical mistakes when occlusion happens, and the SAC learns to combine multiple cues in an effective way and improves robustness. The method is evaluated on the challenging MOT benchmarks and achieves the state-of-the-art results. Multi-Object Tracking and Segmentation(MOTS) This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 70,430 pixel masks for 1,084 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend existing multi-object tracking metrics to this new task. Moreover, we propose a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network. We demonstrate the value of our datasets by achieving improvements in performance when training on MOTS annotations. We believe that our datasets, metrics and baseline will become a valuable resource towards developing multi-object tracking approaches that go beyond 2D bounding boxes. Multi-Objective Automated Negotiation Based Online Feature Selection(MOANOFS) Feature Selection (FS) plays an important role in learning and classification tasks. The object of FS is to select the relevant and non-redundant features. Considering the huge amount number of features in real-world applications, FS methods using batch learning technique can’t resolve big data problem especially when data arrive sequentially. In this paper, we propose an online feature selection system which resolves this problem. More specifically, we treat the problem of online supervised feature selection for binary classification as a decision-making problem. A philosophical vision to this problem leads to a hybridization between two important domains: feature selection using online learning technique (OFS) and automated negotiation (AN). The proposed OFS system called MOANOFS (Multi-Objective Automated Negotiation based Online Feature Selection) uses two levels of decision. In the first level, from n learners (or OFS methods), we decide which are the k trustful ones (with high confidence or trust value). These elected k learners will participate in the second level. In this level, we integrate our proposed Multilateral Automated Negotiation based OFS (MANOFS) method to decide finally which is the best solution or which are relevant features. We show that MOANOFS system is applicable to different domains successfully and achieves high accuracy with several real-world applications. Index Terms: Feature selection, online learning, multi-objective automated negotiation, trust, classification, big data. Multiobjective Complex Systems This article focuses on the optimization of a complex system which is composed of several subsystems. On the one hand, these subsystems are subject to multiple objectives, local constraints as well as local variables, and they are associated with an own, subsystem-dependent decision maker. On the other hand, these subsystems are interconnected to each other by global variables or linking constraints. Due to these interdependencies, it is in general not possible to simply optimize each subsystem individually to improve the performance of the overall system. This article introduces a formal graph-based representation of such complex systems and generalizes the classical notions of feasibility and optimality to match this complex situation. Moreover, several algorithmic approaches are suggested and analyzed. Multi-Objective Deep Reinforcement Learning(MODRL) This paper presents a new multi-objective deep reinforcement learning (MODRL) framework based on deep Q-networks. We propose linear and non-linear methods to develop the MODRL framework that includes both single-policy and multi-policy strategies. The experimental results on a deep sea treasure environment indicate that the proposed approach is able to converge to the optimal Pareto solutions. The proposed framework is generic, which allows implementation of different deep reinforcement learning algorithms in various complex environments. Details of the framework implementation can be referred to http://…/drl.htm. Multiobjective Evolutionary Algorithm(MOEA) A multiobjective optimization problem involves several conflicting objectives and has a set of Pareto optimal solutions. By evolving a population of solutions, multiobjective evolutionary algorithms (MOEAs) are able to approximate the Pareto optimal set in a single run. MOEAs have attracted a lot of research effort during the last 20 years, and they are still one of the hottest research areas in the field of evolutionary computation. This paper surveys the development of MOEAs primarily during the last eight years. It covers algorithmic frameworks such as decomposition-based MOEAs (MOEA/Ds), memetic MOEAs, coevolutionary MOEAs, selection and offspring reproduction operators, MOEAs with specific search methods, MOEAs for multimodal problems, constraint handling and MOEAs, computationally expensive multiobjective optimization problems (MOPs), dynamic MOPs, noisy MOPs, combinatorial and discrete MOPs, benchmark problems, performance indicators, and applications. Multiobjective Evolutionary Algorithms based on Decomposition(MOEA/D) Multiobjective Evolutionary Algorithms based on Decomposition (MOEA/D) represent a widely used class of population-based metaheuristics for the solution of multicriteria optimization problems. We introduce the MOEADr package, which offers many of these variants as instantiations of a component-oriented framework. This approach contributes for easier reproducibility of existing MOEA/D variants from the literature, as well as for faster development and testing of new composite algorithms. The package offers an standardized, modular implementation of MOEA/D based on this framework, which was designed aiming at providing researchers and practitioners with a standard way to discuss and express MOEA/D variants. In this paper we introduce the design principles behind the MOEADr package, as well as its current components. Three case studies are provided to illustrate the main aspects of the package. Multi-Objective Evolutionary Federated Learning Federated learning is an emerging technique used to prevent the leakage of private information. Unlike centralized learning that needs to collect data from users and store them collectively on a cloud server, federated learning makes it possible to learn a global model while the data are distributed on the users’ devices. However, compared with the traditional centralized approach, the federated setting consumes considerable communication resources of the clients, which is indispensable for updating global models and prevents this technique from being widely used. In this paper, we aim to optimize the structure of the neural network models in federated learning using a multi-objective evolutionary algorithm to simultaneously minimize the communication costs and the global model test errors. A scalable method for encoding network connectivity is adapted to federated learning to enhance the efficiency in evolving deep neural networks. Experimental results on both multilayer perceptrons and convolutional neural networks indicate that the proposed optimization method is able to find optimized neural network models that can not only significantly reduce communication costs but also improve the learning performance of federated learning compared with the standard fully connected neural networks. Multi-Objective Neural Architecture Search(MONAS) Recent studies on neural architecture search have shown that automatically designed neural networks perform as good as human-designed architectures. While most existing works on neural architecture search aim at finding architectures that optimize for prediction accuracy. These methods may generate complex architectures consuming excessively high energy consumption, which is not suitable for computing environment with limited power budgets. We propose MONAS, a Multi-Objective Neural Architecture Search with novel reward functions that consider both prediction accuracy and power consumption when exploring neural architectures. MONAS effectively explores the design space and searches for architectures satisfying the given requirements. The experimental results demonstrate that the architectures found by MONAS achieve accuracy comparable to or better than the state-of-the-art models, while having better energy efficiency. Multi-Objective Nonnegative Matrix Factorization(MO-NMF) Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for analyzing nonnegative data. A key aspect of NMF is the choice of the objective function that depends on the noise model (or statistics of the noise) assumed on the data. In many applications, the noise model is unknown and difficult to estimate. In this paper, we define a multi-objective Nonnegative matrix factorization (MO-NMF) problem, where several objectives are combined within the same NMF model. We propose to use Lagrange duality to judiciously optimize for a set of weights to be used within the framework of the weighted-sum approach, that is, we minimize a single objective function which is a weighted sum of the all objective functions. We design a simple algorithm using multiplicative updates to minimize this weighted sum. We show how this can be used to find distributionally robust NMF solutions, that is, solutions that minimize the largest error among all objectives. We illustrate the effectiveness of this approach on synthetic, document and audio datasets. The results show that DR-NMF is robust to our incognizance of the noise model of the NMF problem. Multi-Objective Optimization Multi-objective optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, multiattribute optimization or Pareto optimization) is an area of multiple criteria decision making, that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. Multi-objective optimization has been applied in many fields of science, including engineering, economics and logistics where optimal decisions need to be taken in the presence of trade-offs between two or more conflicting objectives. Minimizing cost while maximizing comfort while buying a car, and maximizing performance whilst minimizing fuel consumption and emission of pollutants of a vehicle are examples of multi-objective optimization problems involving two and three objectives, respectively. In practical problems, there can be more than three objectives. For a nontrivial multi-objective optimization problem, no single solution exists that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions. A solution is called nondominated, Pareto optimal, Pareto efficient or noninferior, if none of the objective functions can be improved in value without degrading some of the other objective values. Without additional subjective preference information, all Pareto optimal solutions are considered equally good (as vectors cannot be ordered completely). Researchers study multi-objective optimization problems from different viewpoints and, thus, there exist different solution philosophies and goals when setting and solving them. The goal may be to find a representative set of Pareto optimal solutions, and/or quantify the trade-offs in satisfying the different objectives, and/or finding a single solution that satisfies the subjective preferences of a human decision maker (DM). Multiobjective Programming Multi-objective optimization (also known as multi-objective programming, vector optimization, multicriteria optimization, multiattribute optimization or Pareto optimization) is an area of multiple criteria decision making, that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. Multi-objective optimization has been applied in many fields of science, including engineering, economics and logistics (see the section on applications for detailed examples) where optimal decisions need to be taken in the presence of trade-offs between two or more conflicting objectives. Minimizing cost while maximizing comfort while buying a car, and maximizing performance whilst minimizing fuel consumption and emission of pollutants of a vehicle are examples of multi-objective optimization problems involving two and three objectives, respectively. In practical problems, there can be more than three objectives. For a nontrivial multi-objective optimization problem, there does not exist a single solution that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions. A solution is called nondominated, Pareto optimal, Pareto efficient or noninferior, if none of the objective functions can be improved in value without degrading some of the other objective values. Without additional subjective preference information, all Pareto optimal solutions are considered equally good (as vectors cannot be ordered completely). Researchers study multi-objective optimization problems from different viewpoints and, thus, there exist different solution philosophies and goals when setting and solving them. The goal may be to find a representative set of Pareto optimal solutions, and/or quantify the trade-offs in satisfying the different objectives, and/or finding a single solution that satisfies the subjective preferences of a human decision maker (DM). Multi-Objective Reinforced Evolution in Mobile Neural Architecture Search(MoreMNAS) Fabricating neural models for a wide range of mobile devices demands for specific design of networks due to highly constrained resources. Both evolution algorithms (EA) and reinforced learning methods (RL) have been introduced to address Neural Architecture Search, distinct efforts to integrate both categories have also been proposed. However, these combinations usually concentrate on a single objective such as error rate of image classification. They also fail to harness the very benefits from both sides. In this paper, we present a new multi-objective oriented algorithm called MoreMNAS (Multi-Objective Reinforced Evolution in Mobile Neural Architecture Search) by leveraging good virtues from both EA and RL. In particular, we incorporate a variant of multi-objective genetic algorithm NSGA-II, in which the search space is composed of various cells so that crossovers and mutations can be performed at the cell level. Moreover, reinforced control is mixed with random process to regulate arbitrary mutation, maintaining a delicate balance between exploration and exploitation. Therefore, not only does our method prevent the searched models from degrading during the evolution process, but it also makes better use of learned knowledge. Our preliminary experiments conducted in Super Resolution domain (SR) deliver rivalling models compared to some state-of-the-art methods with much less FLOPS. More results will be disclosed very soon Multi-Output Convolution Spectral Mixture Kernel(MOCSM) Multi-output Gaussian processes (MOGPs) are recently extended by using spectral mixture kernel, which enables expressively pattern extrapolation with a strong interpretation. In particular, Multi-Output Spectral Mixture kernel (MOSM) is a recent, powerful state of the art method. However, MOSM cannot reduce to the ordinary spectral mixture kernel (SM) when using a single channel. Moreover, when the spectral density of different channels is either very close or very far from each other in the frequency domain, MOSM generates unreasonable scale effects on cross weights which produces an incorrect description of the channel correlation structure. In this paper, we tackle these drawbacks and introduce a principled multi-output convolution spectral mixture kernel (MOCSM) framework. In our framework, we model channel dependencies through convolution of time and phase delayed spectral mixtures between different channels. Results of extensive experiments on synthetic and real datasets demontrate the advantages of MOCSM and its state of the art performance. Multi-Output Learning Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. Multi-Parameter Regression(MPR) It is standard practice for covariates to enter a parametric model through a single distributional parameter of interest, for example, the scale parameter in many standard survival models. Indeed, the well-known proportional hazards model is of this kind. In this paper we discuss a more general approach whereby covariates enter the model through more than one distributional parameter simultaneously (e.g., scale and shape parameters). We refer to this practice as ‘multi-parameter regression’ (MPR) modelling and explore its use in a survival analysis context. We find that multi-parameter regression leads to more flexible models which can offer greater insight into the underlying data generating process. To illustrate the concept, we consider the two-parameter Weibull model which leads to time-dependent hazard ratios, thus relaxing the typical proportional hazards assumption and motivating a new test of proportionality. A novel variable selection strategy is introduced for such multi-parameter regression models. It accounts for the correlation arising between the estimated regression coefficients in two or more linear predictors – a feature which has not been considered by other authors in similar settings. The methods discussed have been implemented in the mpr package in R. mpr Multi-Player Bandit We study stochastic multi-armed bandits with many players. The players do not know the number of players, cannot communicate with each other and if multiple players select a common arm they collide and none of them receive any reward. We consider the static scenario, where the number of players remains fixed, and the dynamic scenario, where the players enter and leave at any time. We provide algorithms based on a novel trekking approach’ that guarantees constant regret for the static case and sub-linear regret for the dynamic case with high probability. The trekking approach eliminates the need to estimate the number of players resulting in fewer collisions and improved regret performance compared to the state-of-the-art algorithms. We also develop an epoch-less algorithm that eliminates any requirement of time synchronization across the players provided each player can detect the presence of other players on an arm. We validate our theoretical guarantees using simulation based and real test-bed based experiments. Multi-Player Bandits: The Adversarial Case Multiple Block Convolutional Highway(MBCH) In the Text Classification areas of Sentiment Analysis, Subjectivity/Objectivity Analysis, and Opinion Polarity, Convolutional Neural Networks have gained special attention because of their performance and accuracy. In this work, we applied recent advances in CNNs and propose a novel architecture, Multiple Block Convolutional Highways (MBCH), which achieves improved accuracy on multiple popular benchmark datasets, compared to previous architectures. The MBCH is based on new techniques and architectures including highway networks, DenseNet, batch normalization and bottleneck layers. In addition, to cope with the limitations of existing pre-trained word vectors which are used as inputs for the CNN, we propose a novel method, Improved Word Vectors (IWV). The IWV improves the accuracy of CNNs which are used for text classification tasks. Multiple Block-Wise Imputation(MBI) For multi-source data, blocks of variable information from certain sources are likely missing. Existing methods for handling missing data do not take structures of block-wise missing data into consideration. In this paper, we propose a Multiple Block-wise Imputation (MBI) approach, which incorporates imputations based on both complete and incomplete observations. Specifically, for a given missing pattern group, the imputations in MBI incorporate more samples from groups with fewer observed variables in addition to the group with complete observations. We propose to construct estimating equations based on all available information, and optimally integrate informative estimating functions to achieve efficient estimators. We show that the proposed method has estimation and model selection consistency under both fixed-dimensional and high-dimensional settings. Moreover, the proposed estimator is asymptotically more efficient than the estimator based on a single imputation from complete observations only. In addition, the proposed method is not restricted to missing completely at random. Numerical studies and ADNI data application confirm that the proposed method outperforms existing variable selection methods under various missing mechanisms. Multiple Correspondence Analysis(MCA) In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA is an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. GDAtools Multiple Criteria Decision Making(MCDM) A novel approach for solving a multiple judge, multiple criteria decision making (MCDM) problem is proposed. The ranking of alternatives that are evaluated based on multiple criteria is difficult, since the presence of multiple criteria leads to a non-total order relation. This issue is handled by reinterpreting the MCDM problem as a multivariate statistics one and by solving it via set optimization methods. A function that ranks alternatives as well as additional functions that categorize alternatives into sets of ‘good’ and ‘bad’ choices are presented. Moreover, the paper shows that the properties of these functions ensure a logical and reasonable decision making process. Multiple Factor Analysis(MFA) Multiple factor analysis (MFA) is a factorial method devoted to the study of tables in which a group of individuals is described by a set of variables (quantitative and / or qualitative) structured in groups. It may be seen as an extension of: · Principal component analysis (PCA) when variables are quantitative, · Multiple correspondence analysis (MCA) when variables are qualitative, · Factor analysis of mixed data (FAMD) when the active variables belong to the two types. FactoMineR,MFAg Multiple Graph Adversarial Learning(MGAL) Recently, Graph Convolutional Networks (GCNs) have been widely studied for graph-structured data representation and learning. However, in many real applications, data are coming with multiple graphs, and it is non-trivial to adapt GCNs to deal with data representation with multiple graph structures. One main challenge for multi-graph representation is how to exploit both structure information of each individual graph and correlation information across multiple graphs simultaneously. In this paper, we propose a novel Multiple Graph Adversarial Learning (MGAL) framework for multi-graph representation and learning. MGAL aims to learn an optimal structure-invariant and consistent representation for multiple graphs in a common subspace via a novel adversarial learning framework, which thus incorporates both structure information of intra-graph and correlation information of inter-graphs simultaneously. Based on MGAL, we then provide a unified network for semi-supervised learning task. Promising experimental results demonstrate the effectiveness of MGAL model. Multiple Graph Optimized Convolutional Network(M-GOCN) Graph Convolutional Networks (GCNs) have been widely studied for graph data representation and learning tasks. Existing GCNs generally use a fixed single graph which may lead to weak suboptimal for data representation/learning and are also hard to deal with multiple graphs. To address these issues, we propose a novel Graph Optimized Convolutional Network (GOCN) for graph data representation and learning. Our GOCN is motivated based on our re-interpretation of graph convolution from a regularization/optimization framework. The core idea of GOCN is to formulate graph optimization and graph convolutional representation into a unified framework and thus conducts both of them cooperatively to boost their respective performance in GCN learning scheme. Moreover, based on the proposed unified graph optimization-convolution framework, we propose a novel Multiple Graph Optimized Convolutional Network (M-GOCN) to naturally address the data with multiple graphs. Experimental results demonstrate the effectiveness and benefit of the proposed GOCN and M-GOCN. Multiple Hypotheses Propagation for Video Object Segmentation(MHP-VOS) We address the problem of semi-supervised video object segmentation (VOS), where the masks of objects of interests are given in the first frame of an input video. To deal with challenging cases where objects are occluded or missing, previous work relies on greedy data association strategies that make decisions for each frame individually. In this paper, we propose a novel approach to defer the decision making for a target object in each frame, until a global view can be established with the entire video being taken into consideration. Our approach is in the same spirit as Multiple Hypotheses Tracking (MHT) methods, making several critical adaptations for the VOS problem. We employ the bounding box (bbox) hypothesis for tracking tree formation, and the multiple hypotheses are spawned by propagating the preceding bbox into the detected bbox proposals within a gated region starting from the initial object mask in the first frame. The gated region is determined by a gating scheme which takes into account a more comprehensive motion model rather than the simple Kalman filtering model in traditional MHT. To further design more customized algorithms tailored for VOS, we develop a novel mask propagation score instead of the appearance similarity score that could be brittle due to large deformations. The mask propagation score, together with the motion score, determines the affinity between the hypotheses during tree pruning. Finally, a novel mask merging strategy is employed to handle mask conflicts between objects. Extensive experiments on challenging datasets demonstrate the effectiveness of the proposed method, especially in the case of object missing. Multiple Independent Subspace Clustering(MISC) Multiple clustering aims at discovering diverse ways of organizing data into clusters. Despite the progress made, it’s still a challenge for users to analyze and understand the distinctive structure of each output clustering. To ease this process, we consider diverse clusterings embedded in different subspaces, and analyze the embedding subspaces to shed light into the structure of each clustering. To this end, we provide a two-stage approach called MISC (Multiple Independent Subspace Clusterings). In the first stage, MISC uses independent subspace analysis to seek multiple and statistical independent (i.e. non-redundant) subspaces, and determines the number of subspaces via the minimum description length principle. In the second stage, to account for the intrinsic geometric structure of samples embedded in each subspace, MISC performs graph regularized semi-nonnegative matrix factorization to explore clusters. It additionally integrates the kernel trick into matrix factorization to handle non-linearly separable clusters. Experimental results on synthetic datasets show that MISC can find different interesting clusterings from the sought independent subspaces, and it also outperforms other related and competitive approaches on real-world datasets. Multiple Instance Learning(MIL) We describe a novel weakly supervised deep learning framework that combines both the discriminative and generative models to learn meaningful representation in the multiple instance learning (MIL) setting. MIL is a weakly supervised learning problem where labels are associated with groups of instances (referred as bags) instead of individual instances. To address the essential challenge in MIL problems raised from the uncertainty of positive instances label, we use a discriminative model regularized by variational autoencoders (VAEs) to maximize the differences between latent representations of all instances and negative instances. As a result, the hidden layer of the variational autoencoder learns meaningful representation. This representation can effectively be used for MIL problems as illustrated by better performance on the standard benchmark datasets comparing to the state-of-the-art approaches. More importantly, unlike most related studies, the proposed framework can be easily scaled to large dataset problems, as illustrated by the audio event detection and segmentation task. Visualization also confirms the effectiveness of the latent representation in discriminating positive and negative classes. Multiple Instance Spatial Transformer Network(MIST) We propose a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts. The network learns to extract the most significant top-K patches, and feeds these patches to a task-specific network — e.g., auto-encoder or classifier — to solve a domain specific problem. The challenge in training such a network is the non-differentiable top-K selection process. To address this issue, we lift the training optimization problem by treating the result of top-K selection as a slack variable, resulting in a simple, yet effective, multi-stage training. Our method is able to learn to detect recurrent structures in the training dataset by learning to reconstruct images. It can also learn to localize structures when only knowledge on the occurrence of the object is provided, and in doing so it outperforms the state-of-the-art. Multiple Measurement Vectors(MMV) Multiple Measurement Vectors Problem: A Decoupling Property and its Applications Multiple Response Permutation Procedure(MRPP) Multiple Response Permutation Procedure (MRPP) provides a test of whether there is a significant difference between two or more groups of sampling units. vegan,Blossom Multiple Search Neuroevolution(MSN) This paper presents an evolutionary metaheuristic called Multiple Search Neuroevolution (MSN) to optimize deep neural networks. The algorithm attempts to search multiple promising regions in the search space simultaneously, maintaining sufficient distance between them. It is tested by training neural networks for two tasks, and compared with other optimization algorithms. The first task is to solve Global Optimization functions with challenging topographies. We found to MSN to outperform classic optimization algorithms such as Evolution Strategies, reducing the number of optimization steps performed by at least 2X. The second task is to train a convolutional neural network (CNN) on the popular MNIST dataset. Using 3.33% of the training set, MSN reaches a validation accuracy of 90%. Stochastic Gradient Descent (SGD) was able to match the same accuracy figure, while taking 7X less optimization steps. Despite lagging, the fact that the MSN metaheurisitc trains a 4.7M-parameter CNN suggests promise for future development. This is by far the largest network ever evolved using a pool of only 50 samples. Multiple Team Formation Problem(MTFP) Allocating of people in multiple projects is an important issue considering the efficiency of groups from the point of view of social interaction. In this paper, based on previous works, the Multiple Team Formation Problem (MTFP) based on sociometric techniques is formulated as an optimization problem taking into account the social interaction among team members. To solve the resulting optimization problem we propose a Genetic Algorithm due to the NP-hard nature of the problem. The social cohesion is an important issue that directly impacts the productivity of the work environment. So, maintaining an appropriate level of cohesion keeps a group together, which will bring positive impacts on the results of a project. The aim of the proposal is to ensure the best possible effectiveness from the point of view of social interaction. In this way, the presented algorithm serves as a decision-making tool for managers to build teams of people in multiple projects. In order to analyze the performance of the proposed method, computational experiments with benchmarks were performed and compared with the exhaustive method. The results are promising and show that the algorithm generally obtains near-optimal results within a short computational time. Multiple-Criteria Decision Analysis(MCDA) ➘ “Multiple-Criteria Decision Analysis” Multiple-Input Multiple-Output(MIMO) A Model-Driven Deep Learning Network for MIMO Detection Multiple-Kernel Dictionary Learning(MKD) There exist many approaches for description and recognition of unseen classes in datasets. Nevertheless, it becomes a challenging problem when we deal with multivariate time-series (MTS) (e.g., motion data), where we cannot apply the vectorial algorithms directly to the inputs. In this work, we propose a novel multiple-kernel dictionary learning (MKD) which learns semantic attributes based on specific combinations of MTS dimensions in the feature space. Hence, MKD can fully/partially reconstructs the unseen classes based on the training data (seen classes). Furthermore, we obtain sparse encodings for unseen classes based on the learned MKD attributes, and upon which we propose a simple but effective incremental clustering algorithm to categorize the unseen MTS classes in an unsupervised way. According to the empirical evaluation of our MKD framework on real benchmarks, it provides an interpretable reconstruction of unseen MTS data as well as a high performance regarding their online clustering. Multiple-Output Regression Predicting multivariate responses in multiple linear regression. Multi-output Decision Tree Regression Multiple Output Regression Multiplicative Integration(MI) We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models. Multiplicative Latent Force Model Bayesian modelling of dynamic systems must achieve a compromise between providing a complete mechanistic specification of the process while retaining the flexibility to handle those situations in which data is sparse relative to model complexity, or a full specification is hard to motivate. Latent force models achieve this dual aim by specifying a parsimonious linear evolution equation which an additive latent Gaussian process (GP) forcing term. In this work we extend the latent force framework to allow for multiplicative interactions between the GP and the latent states leading to more control over the geometry of the trajectories. Unfortunately inference is no longer straightforward and so we introduce an approximation based on the method of successive approximations and examine its performance using a simulation study. Multiplicative Weights Update(MWU) The multiplicative weights update method is an algorithmic technique most commonly used for decision making and prediction, and also widely deployed in game theory and algorithm design. The simplest use case is the problem of prediction from expert advice, in which a decision maker needs to iteratively decide on an expert whose advice to follow. The method assigns initial weights to the experts (usually identical initial weights), and updates these weights multiplicatively and iteratively according to the feedback of how well an expert performed: reducing it in case of poor performance, and increasing it otherwise. It was discovered repeatedly in very diverse fields such as machine learning (AdaBoost, Winnow, Hedge), optimization (solving linear programs), theoretical computer science (devising fast algorithm for LPs and SDPs), and game theory. Multiplicative Weights Updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always Multipolar Analytics The layer-cake best-practice model of analytics (operational systems and external data feeding data marts and a data warehouse, with BI tools as the cherry on the top) is rapidly becoming obsolete. It’s being replaced by a new, multi-polar model where data is collected and analyzed in multiple places, according to the type of data and analysis required: · New HTAP systems (traditional operational data and real-time analytics) · Traditional data warehouses (finance, budgets, corporate KPIs, etc.) · Hadoop/Spark (sensor and polystructured data, long-term storage and analysis) · Standalone BI systems (personal and departmental analytics, including spreadsheets) Multi-Probe Count An important question that arises in the study of high dimensional vector representations learned from data is: given a set $\mathcal{D}$ of vectors and a query $q$, estimate the number of points within a specified distance threshold of $q$. We develop two estimators, LSH Count and Multi-Probe Count that use locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via importance sampling. A key innovation is the ability to maintain a small number of hash tables via preprocessing data structures and algorithms that sample from multiple buckets in each hash table. We give bounds on the space requirements and sample complexity of our schemes, and demonstrate their effectiveness in experiments on a standard word embedding dataset. Multi-Range Reasoning Unit(MRU) We propose MRU (Multi-Range Reasoning Units), a new fast compositional encoder for machine comprehension (MC). Our proposed MRU encoders are characterized by multi-ranged gating, executing a series of parameterized contract-and-expand layers for learning gating vectors that benefit from long and short-term dependencies. The aims of our approach are as follows: (1) learning representations that are concurrently aware of long and short-term context, (2) modeling relationships between intra-document blocks and (3) fast and efficient sequence encoding. We show that our proposed encoder demonstrates promising results both as a standalone encoder and as well as a complementary building block. We conduct extensive experiments on three challenging MC datasets, namely RACE, SearchQA and NarrativeQA, achieving highly competitive performance on all. On the RACE benchmark, our model outperforms DFN (Dynamic Fusion Networks) by 1.5%-6% without using any recurrent or convolution layers. Similarly, we achieve competitive performance relative to AMANDA on the SearchQA benchmark and BiDAF on the NarrativeQA benchmark without using any LSTM/GRU layers. Finally, incorporating MRU encoders with standard BiLSTM architectures further improves performance, achieving state-of-the-art results. Multi-Reference Cosine The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to prevent indexing a web page multiple times. Furthermore, in the scoring phase, search engines need similarity measures to detect duplicated contents on web pages so as to increase the quality of their results. In this paper, a new approach to batch text similarity detection is proposed by combining some ideas from dimensionality reduction techniques and information gain theory. The new approach is focused on search engines need to detect duplicated and near-duplicated web pages. The new approach is evaluated on the NEWS20 dataset and the results show that the new approach is faster than the cosine text similarity algorithm in terms of speed and performance. On top of that, It is faster and more accurate than the other rival method, Simhash similarity algorithm. Multiregression Dynamic Models(MDM) Multiregression dynamic models are defined to preserve certain conditional independence structures over time across a multivariate time series. They are non-Gaussian and yet they can often be updated in closed form. The first two moments of their one-step-ahead forecast distribution can be easily calculated. Furthermore, they can be built to contain all the features of the univariate dynamic linear model and promise more efficient identification of causal structures in a time series than has been possible in the past multdyn Multi-Relevance Transfer Learning(MRTL) Transfer learning aims to faciliate learning tasks in a label-scarce target domain by leveraging knowledge from a related source domain with plenty of labeled data. Often times we may have multiple domains with little or no labeled data as targets waiting to be solved. Most existing efforts tackle target domains separately by modeling the `source-target’ pairs without exploring the relatedness between them, which would cause loss of crucial information, thus failing to achieve optimal capability of knowledge transfer. In this paper, we propose a novel and effective approach called Multi-Relevance Transfer Learning (MRTL) for this purpose, which can simultaneously transfer different knowledge from the source and exploits the shared common latent factors between target domains. Specifically, we formulate the problem as an optimization task based on a collective nonnegative matrix tri-factorization framework. The proposed approach achieves both source-target transfer and target-target leveraging by sharing multiple decomposed latent subspaces. Further, an alternative minimization learning algorithm is developed with convergence guarantee. Empirical study validates the performance and effectiveness of MRTL compared to the state-of-the-art methods. Multi-Resolution Flexible Irregular Time Series Network(Multi-FIT) Missing values, irregularly collected samples, and multi-resolution signals commonly occur in multivariate time series data, making predictive tasks difficult. These challenges are especially prevalent in the healthcare domain, where patients’ vital signs and electronic records are collected at different frequencies and have occasionally missing information due to the imperfections in equipment or patient circumstances. Researchers have handled each of these issues differently, often handling missing data through mean value imputation and then using sequence models over the multivariate signals while ignoring the different resolution of signals. We propose a unified model named Multi-resolution Flexible Irregular Time series Network (Multi-FIT). The building block for Multi-FIT is the FIT network. The FIT network creates an informative dense representation at each time step using signal information such as last observed value, time difference since the last observed time stamp and overall mean for the signal. Vertical FIT (FIT-V) is a variant of FIT which also models the relationship between different temporal signals while creating the informative dense representations for the signal. The multi-FIT model uses multiple FIT networks for sets of signals with different resolutions, further facilitating the construction of flexible representations. Our model has three main contributions: a.) it does not impute values but rather creates informative representations to provide flexibility to the model for creating task-specific representations b.) it models the relationship between different signals in the form of support signals c.) it models different resolutions in parallel before merging them for the final prediction task. The FIT, FIT-V and Multi-FIT networks improve upon the state-of-the-art models for three predictive tasks, including the forecasting of patient survival. Multiresolution Graph Attention Network A large number of deep learning models have been proposed for the text matching problem, which is at the core of various typical natural language processing (NLP) tasks. However, existing deep models are mainly designed for the semantic matching between a pair of short texts, such as paraphrase identification and question answering, and do not perform well on the task of relevance matching between short-long text pairs. This is partially due to the fact that the essential characteristics of short-long text matching have not been well considered in these deep models. More specifically, these methods fail to handle extreme length discrepancy between text pieces and neither can they fully characterize the underlying structural information in long text documents. In this paper, we are especially interested in relevance matching between a piece of short text and a long document, which is critical to problems like query-document matching in information retrieval and web searching. To extract the structural information of documents, an undirected graph is constructed, with each vertex representing a keyword and the weight of an edge indicating the degree of interaction between keywords. Based on the keyword graph, we further propose a Multiresolution Graph Attention Network to learn multi-layered representations of vertices through a Graph Convolutional Network (GCN), and then match the short text snippet with the graphical representation of the document with the attention mechanisms applied over each layer of the GCN. Experimental results on two datasets demonstrate that our graph approach outperforms other state-of-the-art deep matching models. Multi-Resolution Graph Neural Network(MR-GNN) Predicting interactions between structured entities lies at the core of numerous tasks such as drug regimen and new material design. In recent years, graph neural networks have become attractive. They represent structured entities as graphs and then extract features from each individual graph using graph convolution operations. However, these methods have some limitations: i) their networks only extract features from a fix-sized subgraph structure (i.e., a fix-sized receptive field) of each node, and ignore features in substructures of different sizes, and ii) features are extracted by considering each entity independently, which may not effectively reflect the interaction between two entities. To resolve these problems, we present MR-GNN, an end-to-end graph neural network with the following features: i) it uses a multi-resolution based architecture to extract node features from different neighborhoods of each node, and, ii) it uses dual graph-state long short-term memory networks (L-STMs) to summarize local features of each graph and extracts the interaction features between pairwise graphs. Experiments conducted on real-world datasets show that MR-GNN improves the prediction of state-of-the-art methods. Multi-Resolution Scanning(MRS) MRS MultiResUNet In recent years Deep Learning has brought about a breakthrough in Medical Image Segmentation. U-Net is the most prominent deep network in this regard, which has been the most popular architecture in the medical imaging community. Despite outstanding overall performance in segmenting multimodal medical images, from extensive experimentations on challenging datasets, we found out that the classical U-Net architecture seems to be lacking in certain aspects. Therefore, we propose some modifications to improve upon the already state-of-the-art U-Net model. Hence, following the modifications we develop a novel architecture MultiResUNet as the potential successor to the successful U-Net architecture. We have compared our proposed architecture MultiResUNet with the classical U-Net on a vast repertoire of multimodal medical images. Albeit slight improvements in the cases of ideal images, a remarkable gain in performance has been attained for challenging images. We have evaluated our model on five different datasets, each with their own unique challenges, and have obtained a relative improvement in performance of 10.15%, 5.07%, 2.63%, 1.41%, and 0.62% respectively. Multi-Robot Transfer Learning Multi-robot transfer learning allows a robot to use data generated by a second, similar robot to improve its own behavior. The potential advantages are reducing the time of training and the unavoidable risks that exist during the training phase. Transfer learning algorithms aim to find an optimal transfer map between different robots. In this paper, we investigate, through a theoretical study of single-input single-output (SISO) systems, the properties of such optimal transfer maps. We first show that the optimal transfer learning map is, in general, a dynamic system. The main contribution of the paper is to provide an algorithm for determining the properties of this optimal dynamic map including its order and regressors (i.e., the variables it depends on). The proposed algorithm does not require detailed knowledge of the robots’ dynamics, but relies on basic system properties easily obtainable through simple experimental tests. We validate the proposed algorithm experimentally through an example of transfer learning between two different quadrotor platforms. Experimental results show that an optimal dynamic map, with correct properties obtained from our proposed algorithm, achieves 60-70% reduction of transfer learning error compared to the cases when the data is directly transferred or transferred using an optimal static map. Multi-Round Distributed Linear-Type Estimator(MDL) The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire dataset in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator. Multi-Scale Affinity With Sparse Convolution(MASC) We propose a new approach for 3D instance segmentation based on sparse convolution and point affinity prediction, which indicates the likelihood of two points belonging to the same instance. The proposed network, built upon submanifold sparse convolution [3], processes a voxelized point cloud and predicts semantic scores for each occupied voxel as well as the affinity between neighboring voxels at different scales. A simple yet effective clustering algorithm segments points into instances based on the predicted affinity and the mesh topology. The semantic for each instance is determined by the semantic prediction. Experiments show that our method outperforms the state-of-the-art instance segmentation methods by a large margin on the widely used ScanNet benchmark [2]. We share our code publicly at https://…/MASC. Multiscale Artificial Neural Network(MsANN) Multigrid modeling algorithms are a technique used to accelerate relaxation models running on a hierarchy of similar graphlike structures. We introduce and demonstrate a new method for training neural networks which uses multilevel methods. Using an objective function derived from a graph-distance metric, we perform orthogonally-constrained optimization to find optimal prolongation and restriction maps between graphs. We compare and contrast several methods for performing this numerical optimization, and additionally present some new theoretical results on upper bounds of this type of objective function. Once calculated, these optimal maps between graphs form the core of Multiscale Artificial Neural Network (MsANN) training, a new procedure we present which simultaneously trains a hierarchy of neural network models of varying spatial resolution. Parameter information is passed between members of this hierarchy according to standard coarsening and refinement schedules from the multiscale modelling literature. In our machine learning experiments, these models are able to learn faster than default training, achieving a comparable level of error in an order of magnitude fewer training examples. MultiScale AutoEncoder(MSAE) We propose a MultiScale AutoEncoder (MSAE) based extreme image compression framework to offer visually pleasing reconstruction at a very low bitrate. Our method leverages the ‘priors’ at different resolution scale to improve the compression efficiency, and also employs the generative adversarial network(GAN) with multiscale discriminators to perform the end-to-end trainable rate-distortion optimization. We compare the perceptual quality of our reconstructions with traditional compression algorithms using High-Efficiency Video Coding(HEVC) based Intra Profile and JPEG2000 on the public Cityscapes and ADE20K datasets, demonstrating the significant subjective quality improvement. Multi-Scale Convolutional Recurrent Encoder-Decoder(MSCRED) Nowadays, multivariate time series data are increasingly collected in various real world systems, e.g., power plants, wearable devices, etc. Anomaly detection and diagnosis in multivariate time series refer to identifying abnormal status in certain time steps and pinpointing the root causes. Building such a system, however, is challenging since it not only requires to capture the temporal dependency in each time series, but also need encode the inter-correlations between different pairs of time series. In addition, the system should be robust to noise and provide operators with different levels of anomaly scores based upon the severity of different incidents. Despite the fact that a number of unsupervised anomaly detection algorithms have been developed, few of them can jointly address these challenges. In this paper, we propose a Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED), to perform anomaly detection and diagnosis in multivariate time series data. Specifically, MSCRED first constructs multi-scale (resolution) signature matrices to characterize multiple levels of the system statuses in different time steps. Subsequently, given the signature matrices, a convolutional encoder is employed to encode the inter-sensor (time series) correlations and an attention based Convolutional Long-Short Term Memory (ConvLSTM) network is developed to capture the temporal patterns. Finally, based upon the feature maps which encode the inter-sensor correlations and temporal information, a convolutional decoder is used to reconstruct the input signature matrices and the residual signature matrices are further utilized to detect and diagnose anomalies. Extensive empirical studies based on a synthetic dataset and a real power plant dataset demonstrate that MSCRED can outperform state-of-the-art baseline methods. Multi-Scale DCS Convolutional Neural Network(MS-DCSNet) With joint learning of sampling and recovery, the deep learning-based compressive sensing (DCS) has shown significant improvement in performance and running time reduction. Its reconstructed image, however, losses high-frequency content especially at low subrates. This happens similarly in the multi-scale sampling scheme which also samples more low-frequency components. In this paper, we propose a multi-scale DCS convolutional neural network (MS-DCSNet) in which we convert image signal using multiple scale-based wavelet transform, then capture it through convolution block by block across scales. The initial reconstructed image is directly recovered from multi-scale measurements. Multi-scale wavelet convolution is utilized to enhance the final reconstruction quality. The network is able to learn both multi-scale sampling and multi-scale reconstruction, thus results in better reconstruction quality. Multi-Scale Deep Compressive Sensing Network ➘ “Multi-Scale DCS Convolutional Neural Network” Multi-Scale Deep Neural Network(MSDNN) Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multi-scale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global high-level features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multi-scale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 state-of-the-art approaches. Multi-Scale Gradients GAN(MSG-GAN) Generative Adversarial Network (GAN) which is widely used for Image synthesis via generative modelling suffers peculiarly from training instability. One of the known reasons for this instability is the passage of uninformative gradients from the Discriminator to the Generator due to learning imbalance between them during training. In this work, we propose Multi-Scale Gradients Generative Adversarial Network (MSG-GAN), a simplistic but effective technique for addressing this problem; by allowing the flow of gradients from the Discriminator to the Generator at multiple scales. This results in the Generator acquiring the ability to synthesize synchronized images at multiple resolutions simultaneously. We also highlight a suite of techniques that together buttress the stability of training without excessive hyperparameter tuning. Our MSG-GAN technique is a generic mathematical framework which has multiple instantiations. We present an intuitive form of this technique which uses the concatenation operation in the Discriminator computations and empirically validate it through experiments on the CelebA-HQ, CIFAR10 and Oxford102 flowers datasets and by comparing it with some of the current state-of-the-art techniques. Multiscale Network(MS) This paper proposes a dimension reduction process for computing the Dijkstra’s shortest path algorithm in a complex network. This is done through a novel multiscale (MS) network decomposition into base-elements: links and landmark-nodes. All of them result to be essential for keeping all the network connectivity information and for speeding up the exact computation of the Dijkstra’s shortest path. The multiscale shortest path (MS-SP) algorithm shows to be advantageous when dealing with big-size utility networks in comparison with other shortest-path algorithms: unfeasible for the curse of the dimensionality for traditional approaches or providing approximate solution in other cases. The novel methodology is of high interest when it is computed on urban utility networks as it explodes several of their inherent properties. However, the proposal extends straightforwardly to another kind of networks. MS-SP has been successfully applied for 2 water utility networks (medium and big size). In both cases, MS-SP provides the exact solution that the obtained by applying the Dijkstra’s shortest path while showing its efficiency in terms of computational time. Multi-Scale Node Attention(MSNA) We introduce a novel approach to graph-level representation learning, which is to embed an entire graph into a vector space where the embeddings of two graphs preserve their graph-graph proximity. Our approach, UGRAPHEMB, is a general framework that provides a novel means to performing graph-level embedding in a completely unsupervised and inductive manner. The learned neural network can be considered as a function that receives any graph as input, either seen or unseen in the training set, and transforms it into an embedding. A novel graph-level embedding generation mechanism called Multi-Scale Node Attention (MSNA), is proposed. Experiments on five real graph datasets show that UGRAPHEMB achieves competitive accuracy in the tasks of graph classification, similarity ranking, and graph visualization. Multi-Scale Quasi-RNN How to better utilize sequential information has been extensively studied in the setting of recommender systems. To this end, architectural inductive biases such as Markov-Chains, Recurrent models, Convolutional networks and many others have demonstrated reasonable success on this task. This paper proposes a new neural architecture, multi-scale Quasi-RNN for next item Recommendation (QR-Rec) task. Our model provides the best of both worlds by exploiting multi-scale convolutional features as the compositional gating functions of a recurrent cell. The model is implemented in a multi-scale fashion, i.e., convolutional filters of various widths are implemented to capture different union-level features of input sequences which influence the compositional encoder. The key idea aims to capture the recurrent relations between different kinds of local features, which has never been studied previously in the context of recommendation. Through extensive experiments, we demonstrate that our model achieves state-of-the-art performance on 15 well-established datasets, outperforming strong competitors such as FPMC, Fossil and Caser absolutely by 0.57%-7.16% and relatively by 1.44%-17.65% in terms of MAP, Recall@10 and NDCG@10. Multiscale Shortest Path(MS-SP) ➘ “Multiscale Network” Multi-Scale, Deep Inception Convolutional Neural Network(MDCN) Object detection in challenging situations such as scale variation, occlusion, and truncation depends not only on feature details but also on contextual information. Most previous networks emphasize too much on detailed feature extraction through deeper and wider networks, which may enhance the accuracy of object detection to certain extent. However, the feature details are easily being changed or washed out after passing through complicated filtering structures. To better handle these challenges, the paper proposes a novel framework, multi-scale, deep inception convolutional neural network (MDCN), which focuses on wider and broader object regions by activating feature maps produced in the deep part of the network. Instead of incepting inner layers in the shallow part of the network, multi-scale inceptions are introduced in the deep layers. The proposed framework integrates the contextual information into the learning process through a single-shot network structure. It is computational efficient and avoids the hard training problem of previous macro feature extraction network designed for shallow layers. Extensive experiments demonstrate the effectiveness and superior performance of MDCN over the state-of-the-art models. Multiset Dimension We introduce a variation of the metric dimension, called the multiset dimension. The representation multiset of a vertex $v$ with respect to $W$ (which is a subset of the vertex set of a graph $G$), $r_m (v|W)$, is defined as a multiset of distances between $v$ and the vertices in $W$ together with their multiplicities. If $r_m (u |W) \neq r_m(v|W)$ for every pair of distinct vertices $u$ and $v$, then $W$ is called a resolving set of $G$. If $G$ has a resolving set, then the cardinality of a smallest resolving set is called the multiset dimension of $G$, denoted by $md(G)$. If $G$ does not contain a resolving set, we write $md(G) = \infty$. We present basic results on the multiset dimension. We also study graphs of given diameter and give some sufficient conditions for a graph to have an infinite multiset dimension. Multi-sourcE onLine TrAnsfer learning for Non-statIonary Environments(Melanie) In data stream mining, predictive models typically suffer drops in predictive performance due to concept drift. As enough data representing the new concept must be collected for the new concept to be well learnt, the predictive performance of existing models usually takes some time to recover from concept drift. To speed up recovery from concept drift and improve predictive performance in data stream mining, this work proposes a novel approach called Multi-sourcE onLine TrAnsfer learning for Non-statIonary Environments (Melanie). Melanie is the first approach able to transfer knowledge between multiple data streaming sources in non-stationary environments. It creates several sub-classifiers to learn different aspects from different source and target concepts over time. The sub-classifiers that match the current target concept well are identified, and used to compose an ensemble for predicting examples from the target concept. We evaluate Melanie on several synthetic data streams containing different types of concept drift and on real world data streams. The results indicate that Melanie can deal with a variety drifts and improve predictive performance over existing data stream learning algorithms by making use of multiple sources. Multi-Source Pointer Network In this paper, we study the product title summarization problem in E-commerce applications for display on mobile devices. Comparing with conventional sentence summarization, product title summarization has some extra and essential constraints. For example, factual errors or loss of the key information are intolerable for E-commerce applications. Therefore, we abstract two more constraints for product title summarization: (i) do not introduce irrelevant information; (ii) retain the key information (e.g., brand name and commodity name). To address these issues, we propose a novel multi-source pointer network by adding a new knowledge encoder for pointer network. The first constraint is handled by pointer mechanism. For the second constraint, we restore the key information by copying words from the knowledge encoder with the help of the soft gating mechanism. For evaluation, we build a large collection of real-world product titles along with human-written short titles. Experimental results demonstrate that our model significantly outperforms the other baselines. Finally, online deployment of our proposed model has yielded a significant business impact, as measured by the click-through rate. Multi-Stage Self-Supervised Training(M3S) Graph Convolutional Networks(GCNs) play a crucial role in graph learning tasks, however, learning graph embedding with few supervised signals is still a difficult problem. In this paper, we propose a novel training algorithm for Graph Convolutional Network, called Multi-Stage Self-Supervised(M3S) Training Algorithm, combined with self-supervised learning approach, focusing on improving the generalization performance of GCNs on graphs with few labeled nodes. Firstly, a Multi-Stage Training Framework is provided as the basis of M3S training method. Then we leverage DeepCluster technique, a popular form of self-supervised learning, and design corresponding aligning mechanism on the embedding space to refine the Multi-Stage Training Framework, resulting in M3S Training Algorithm. Finally, extensive experimental results verify the superior performance of our algorithm on graphs with few labeled nodes under different label rates compared with other state-of-the-art approaches. Multi-Stage Temporal Convolutional Network(MS-TCN) Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics. While traditional approaches follow a two-step pipeline, by generating frame-wise probabilities and then feeding them to high-level temporal models, recent approaches use temporal convolutions to directly classify the video frames. In this paper, we introduce a multi-stage architecture for the temporal action segmentation task. Each stage features a set of dilated temporal convolutions to generate an initial prediction that is refined by the next one. This architecture is trained using a combination of a classification loss and a proposed smoothing loss that penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our model achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. Multi-State Adaptive Dynamic Principal Component Analysis mvMonitoring Multi-State Model Multi-state models are used to model a trajectory through multiple states. Survival models are a special case in which there are two states, alive and dead. Multi-state models are therefore useful in clinical settings because they can be used to predict or simulate disease progression in detail. Putter et al. provide a helpful tutorial. Multi-State Morkov Model Multi-Target Embodied Question Answering(MT-EQA) Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA makes the fundamental assumption that every question, e.g., ‘what color is the car?’, has exactly one target (‘car’) being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of EQA – Multi-Target Embodied Question Answering (MT-EQA). Specifically, we study questions that have multiple targets in them, such as ‘Is the dresser in the bedroom bigger than the oven in the kitchen?’, where the agent has to navigate to multiple locations (‘dresser in bedroom’, ‘oven in kitchen’) and perform comparative reasoning (‘dresser’ bigger than ‘oven’) before it can answer a question. Such questions require the development of entirely new modules or components in the agent. To address this, we propose a modular architecture composed of a program generator, a controller, a navigator, and a VQA module. The program generator converts the given question into sequential executable sub-programs; the navigator guides the agent to multiple locations pertinent to the navigation-related sub-programs; and the controller learns to select relevant observations along its path. These observations are then fed to the VQA module to predict the answer. We perform detailed analysis for each of the model components and show that our joint model can outperform previous methods and strong baselines by a significant margin. Multi-Target Filtering and Tracking(MTFT) Defining a multi-target motion model, which is an important step of tracking algorithms, can be very challenging. Using fixed models (as in several generative Bayesian algorithms, such as Kalman filters) can fail to accurately predict sophisticated target motions. On the other hand, sequential learning of the motion model (for example, using recurrent neural networks) can be computationally complex and difficult due to the variable unknown number of targets. In this paper, we propose a multi-target filtering and tracking (MTFT) algorithm which learns the motion model, simultaneously for all targets, from an implicitly represented state map and performs spatio-temporal data prediction. To this end, the multi-target state is modelled over a continuous hypothetical target space, using random finite sets and Gaussian mixture probability hypothesis density formulations. The prediction step is recursively performed using a deep convolutional recurrent neural network with a long short-term memory architecture, which is trained as a regression block, on the fly, over ‘probability density difference’ maps. Our approach is evaluated over widely used pedestrian tracking benchmarks, remarkably outperforming state-of-the-art multi-target filtering algorithms, while giving competitive results when compared with other tracking approaches. Multi-Task Attention Network(MTAN) In this paper, we propose a novel multi-task learning architecture, which incorporates recent advances in attention mechanisms. Our approach, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with task-specific soft-attention modules, which are trainable in an end-to-end manner. These attention modules allow for learning of task-specific features from the global pool, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. Experiments on the CityScapes dataset show that our method outperforms several baselines in both single-task and multi-task learning, and is also more robust to the various weighting schemes in the multi-task loss function. We further explore the effectiveness of our method through experiments over a range of task complexities, and show how our method scales well with task complexity compared to baselines. Multi-task Determinantal Point Process(multi-task DPP) Determinantal point processes (DPPs) have received significant attention in the recent years as an elegant model for a variety of machine learning tasks, due to their ability to elegantly model set diversity and item quality or popularity. Recent work has shown that DPPs can be effective models for product recommendation and basket completion tasks. We present an enhanced DPP model that is specialized for the task of basket completion, the multi-task DPP. We view the basket completion problem as a multi-class classification problem, and leverage ideas from tensor factorization and multi-class classification to design the multi-task DPP model. We evaluate our model on several real-world datasets, and find that the multi-task DPP provides significantly better predictive quality than a number of state-of-the-art models. Multi-Task Graph Autoencoder We examine two fundamental tasks associated with graph representation learning: link prediction and node classification. We present a new autoencoder architecture capable of learning a joint representation of local graph structure and available node features for the simultaneous multi-task learning of unsupervised link prediction and semi-supervised node classification. Our simple, yet effective and versatile model is efficiently trained end-to-end in a single stage, whereas previous related deep graph embedding methods require multiple training steps that are difficult to optimize. We provide an empirical evaluation of our model on five benchmark relational, graph-structured datasets and demonstrate significant improvement over three strong baselines for graph representation learning. Reference code and data are available at https://…/graph-representation-learning Multi-task Knowledge Distillation Model(MKDM) Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous works often leverage model compression approaches to resolve this problem. However, these methods usually induce information loss during the model compression procedure, leading to incomparable results between compressed model and the original model. To tackle this challenge, we propose a Multi-task Knowledge Distillation Model (MKDM for short) for web-scale Question Answering system, by distilling knowledge from multiple teacher models to a light-weight student model. In this way, more generalized knowledge can be transferred. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with significant speedup of model inference. Multi-Task Learning(MTL) One of the key challenges in predictive maintenance is to predict the impending downtime of an equipment with a reasonable prediction horizon so that countermeasures can be put in place. Classically, this problem has been posed in two different ways which are typically solved independently: (1) Remaining useful life (RUL) estimation as a long-term prediction task to estimate how much time is left in the useful life of the equipment and (2) Failure prediction (FP) as a short-term prediction task to assess the probability of a failure within a pre-specified time window. As these two tasks are related, performing them separately is sub-optimal and might results in inconsistent predictions for the same equipment. In order to alleviate these issues, we propose two methods: Deep Weibull model (DW-RNN) and multi-task learning (MTL-RNN). DW-RNN is able to learn the underlying failure dynamics by fitting Weibull distribution parameters using a deep neural network, learned with a survival likelihood, without training directly on each task. While DW-RNN makes an explicit assumption on the data distribution, MTL-RNN exploits the implicit relationship between the long-term RUL and short-term FP tasks to learn the underlying distribution. Additionally, both our methods can leverage the non-failed equipment data for RUL estimation. We demonstrate that our methods consistently outperform baseline RUL methods that can be used for FP while producing consistent results for RUL and FP. We also show that our methods perform at par with baselines trained on the objectives optimized for either of the two tasks. Multitask Learning Deep Neural Network(MTLDNN) It is an enduring question how to combine revealed preference (RP) and stated preference (SP) data to analyze travel behavior. This study presents a new approach of using multitask learning deep neural network (MTLDNN) to combine RP and SP data and incorporate the traditional nest logit approach as a special case. Based on a combined RP and SP survey in Singapore to examine the demand for autonomous vehicles (AV), we designed, estimated and compared one hundred MTLDNN architectures with three major findings. First, the traditional nested logit approach of combining RP and SP can be regarded as a special case of MTLDNN and is only one of a large number of possible MTLDNN architectures, and the nested logit approach imposes the proportional parameter constraint under the MTLDNN framework. Second, out of the 100 MTLDNN models tested, the best one has one shared layer and five domain-specific layers with weak regularization, but the nested logit approach with proportional parameter constraint rivals the best model. Third, the proportional parameter constraint works well in the nested logit model, but is too restrictive for deeper architectures. Overall, this study introduces the MTLDNN model to combine RP and SP data, relates the nested logit approach to the hyperparameter space of MTLDNN, and explores hyperparameter training and architecture design for the joint demand analysis. Multitask Learning Encoder(MTLE) Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and movie description rely on simple encoding mechanisms through recurrent neural networks to encode temporal visual information extracted from video data. In this paper, we introduce a novel multitask encoder-decoder framework for automatic semantic description and captioning of video sequences. In contrast to current approaches, our method relies on distinct decoders that train a visual encoder in a multitask fashion. Our system does not depend solely on multiple labels and allows for a lack of training data working even with datasets where only one single annotation is viable per video. Our method shows improved performance over current state of the art methods in several metrics on multi-caption and single-caption datasets. To the best of our knowledge, our method is the first method to use a multitask approach for encoding video features. Our method demonstrates its robustness on the Large Scale Movie Description Challenge (LSMDC) 2017 where our method won the movie description task and its results were ranked among other competitors as the most helpful for the visually impaired. Multi-Task Learning Extreme Learning Machine(MTL-ELM) In multi-task learning (MTL), related tasks learn jointly to improve generalization performance. To exploit the high learning speed of extreme learning machines (ELMs), we apply the ELM framework to the MTL problem, where the output weights of ELMs for all the tasks are learned collaboratively. We first present the ELM based MTL problem in the centralized setting, which is solved by the proposed MTL-ELM (multi-task learning extreme learning machine) algorithm. Due to the fact that many data sets of different tasks are geo-distributed, decentralized machine learning is studied. We formulate the decentralized MTL problem based on ELM as majorized multi-block optimization with coupled bi-convex objective functions. To solve the problem, we propose the DMTL-ELM algorithm, which is a hybrid Jacobian and Gauss-Seidel Proximal multi-block alternating direction method of multipliers (ADMM). Further, to reduce the computation load of DMTL-ELM, DMTL-ELM with first-order approximation (FO-DMTL-ELM) is presented. Theoretical analysis shows that the convergence to the stationary point of DMTL-ELM and FO-DMTL-ELM can be guaranteed conditionally. Through simulations, we demonstrate the convergence of proposed MTL-ELM, DMTL-ELM, and FO-DMTL-ELM algorithms, and also show that they can outperform existing MTL methods. Moreover, by adjusting the dimension of hidden feature space, there exists a trade-off between communication load and learning accuracy for DMTL-ELM. Multi-Task Multiple Kernel Relationship Learning(MK-MTRL) This paper presents a novel multitask multiple-kernel learning framework that efficiently learns the kernel weights leveraging the relationship across multiple tasks. The idea is to automatically infer this task relationship in the \textit{RKHS} space corresponding to the given base kernels. The problem is formulated as a regularization-based approach called \textit{Multi-Task Multiple Kernel Relationship Learning} (\textit{MK-MTRL}), which models the task relationship matrix from the weights learned from latent feature spaces of task-specific base kernels. Unlike in previous work, the proposed formulation allows one to incorporate prior knowledge for simultaneously learning several related task. We propose an alternating minimization algorithm to learn the model parameters, kernel weights and task relationship matrix. In order to tackle large-scale problems, we further propose a two-stage \textit{MK-MTRL} online learning algorithm and show that it significantly reduces the computational time, and also achieves performance comparable to that of the joint learning framework. Experimental results on benchmark datasets show that the proposed formulations outperform several state-of-the-art multi-task learning methods. Multitask Soft Option Learning(MSOL) We present Multitask Soft Option Learning (MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. Additionally, MSOL avoids several instabilities during training in a multitask setting and provides a natural way to not only learn intra-option policies, but also their terminations. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines in challenging multi-task environments. Multi-Task Triple-Stream Network(MTTSNet) Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce ‘relational captioning,’ a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods. Multi-Task WaveNet This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic features. Multi-task WaveNet can produce more natural and expressive speech by addressing the pitch prediction error accumulation issue and possesses more succinct inference procedures than the original WaveNet. Experimental results prove that the SPSS method proposed in this paper can achieve better performance than the state-of-the-art approach utilizing the original WaveNet in both objective and subjective preference tests. Multi-Tasking Evolutionary Algorithm(MTEA) Multi-task learning uses auxiliary data or knowledge from relevant tasks to facilitate the learning in a new task. Multi-task optimization applies multi-task learning to optimization to study how to effectively and efficiently tackle multiple optimization problems simultaneously. Evolutionary multi-tasking, or multi-factorial optimization, is an emerging subfield of multi-task optimization, which integrates evolutionary computation and multi-task learning. This paper proposes a novel easy-to-implement multi-tasking evolutionary algorithm (MTEA), which copes well with significantly different optimization tasks by estimating and using the bias among them. Comparative studies with eight state-of-the-art single- and multi-task approaches in the literature on nine benchmarks demonstrated that on average the MTEA outperformed all of them, and has lower computational cost than six of them. Particularly, unlike other multi-task algorithms, the performance of the MTEA is consistently good whether the tasks are similar or significantly different, making it ideal for real-world applications. Multi-Temporal-Range Mixture Model(M3) Understanding temporal dynamics has proved to be highly valuable for accurate recommendation. Sequential recommenders have been successful in modeling the dynamics of users and items over time. However, while different model architectures excel at capturing various temporal ranges or dynamics, distinct application contexts require adapting to diverse behaviors. In this paper we examine how to build a model that can make use of different temporal ranges and dynamics depending on the request context. We begin with the analysis of an anonymized Youtube dataset comprising millions of user sequences. We quantify the degree of long-range dependence in these sequences and demonstrate that both short-term and long-term dependent behavioral patterns co-exist. We then propose a neural Multi-temporal-range Mixture Model (M3) as a tailored solution to deal with both short-term and long-term dependencies. Our approach employs a mixture of models, each with a different temporal range. These models are combined by a learned gating mechanism capable of exerting different model combinations given different contextual information. In empirical evaluations on a public dataset and our own anonymized YouTube dataset, M3 consistently outperforms state-of-the-art sequential recommendation methods. Multi-Turn cue-Words Driven Conversation System With Reinforcement Learning(RLCw) To build an open-domain multi-turn conversation system is one of the most interesting and challenging tasks in Artificial Intelligence. Many research efforts have been dedicated to building such dialogue systems, yet few shed light on modeling the conversation flow in an ongoing dialogue. Besides, it is common for people to talk about highly relevant aspects during a conversation. And the topics are coherent and drift naturally, which demonstrates the necessity of dialogue flow modeling. To this end, we present the multi-turn cue-words driven conversation system with reinforcement learning method (RLCw), which strives to select an adaptive cue word with the greatest future credit, and therefore improve the quality of generated responses. We introduce a new reward to measure the quality of cue words in terms of effectiveness and relevance. To further optimize the model for long-term conversations, a reinforcement approach is adopted in this paper. Experiments on real-life dataset demonstrate that our model consistently outperforms a set of competitive baselines in terms of simulated turns, diversity and human evaluation. Multi-vAlue Rule Set(MRS) We propose a Multi-vAlue Rule Set (MRS) model for in-hospital predicting patient mortality. Compared to rule sets built from single-valued rules, MRS adopts a more generalized form of association rules that allows multiple values in a condition. Rules of this form are more concise than classical single-valued rules in capturing and describing patterns in data. Our formulation also pursues a higher efficiency of feature utilization, which reduces possible cost in data collection and storage. We propose a Bayesian framework for formulating a MRS model and propose an efficient inference method for learning a maximum \emph{a posteriori}, incorporating theoretically grounded bounds to iteratively reduce the search space and improve the search efficiency. Experiments show that our model was able to achieve better performance than baseline method including the current system used by the hospital. Multivariate Adaptive Regression Splines(MARS) Deep neural networks (DNNs) generate much richer function spaces than shallow networks. Since the function spaces induced by shallow networks have several approximation theoretic drawbacks, this explains, however, not necessarily the success of deep networks. In this article we take another route by comparing the expressive power of DNNs with ReLU activation function to piecewise linear spline methods. We show that MARS (multivariate adaptive regression splines) is improper learnable by DNNs in the sense that for any given function that can be expressed as a function in MARS with $M$ parameters there exists a multilayer neural network with $O(M \log (M/\varepsilon))$ parameters that approximates this function up to sup-norm error $\varepsilon.$ We show a similar result for expansions with respect to the Faber-Schauder system. Based on this, we derive risk comparison inequalities that bound the statistical risk of fitting a neural network by the statistical risk of spline-based methods. This shows that deep networks perform better or only slightly worse than the considered spline methods. We provide a constructive proof for the function approximations. earth Multivariate Anomaly Detection Generative Adversarial Network(MAD-GAN) The prevalence of networked sensors and actuators in many real-world systems such as smart buildings, factories, power plants, and data centers generate substantial amounts of multivariate time series data for these systems. The rich sensor data can be continuously monitored for intrusion events through anomaly detection. However, conventional threshold-based anomaly detection methods are inadequate due to the dynamic complexities of these systems, while supervised machine learning methods are unable to exploit the large amounts of data due to the lack of labeled data. On the other hand, current unsupervised machine learning approaches have not fully exploited the spatial-temporal correlation and other dependencies amongst the multiple variables (sensors/actuators) in the system for detecting anomalies. In this work, we propose an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs). Instead of treating each data stream independently, our proposed MAD-GAN framework considers the entire variable set concurrently to capture the latent interactions amongst the variables. We also fully exploit both the generator and discriminator produced by the GAN, using a novel anomaly score called DR-score to detect anomalies by discrimination and reconstruction. We have tested our proposed MAD-GAN using two recent datasets collected from real-world CPS: the Secure Water Treatment (SWaT) and the Water Distribution (WADI) datasets. Our experimental results showed that the proposed MAD-GAN is effective in reporting anomalies caused by various cyber-intrusions compared in these complex real-world systems. Multivariate Bayesian Model with Shrinkage Priors(MBSP) The method is described in Bai and Ghosh (2018) . MBSP Multivariate Bernoulli Autoregressive Process(BAR) Multivariate Bernoulli autoregressive (BAR) processes model time series of events in which the likelihood of current events is determined by the times and locations of past events. These processes can be used to model nonlinear dynamical systems corresponding to criminal activity, responses of patients to different medical treatment plans, opinion dynamics across social networks, epidemic spread, and more. Past work examines this problem under the assumption that the event data is complete, but in many cases only a fraction of events are observed. Incomplete observations pose a significant challenge in this setting because the unobserved events still govern the underlying dynamical system. In this work, we develop a novel approach to estimating the parameters of a BAR process in the presence of unobserved events via an unbiased estimator of the complete data log-likelihood function. We propose a computationally efficient estimation algorithm which approximates this estimator via Taylor series truncation and establish theoretical results for both the statistical error and optimization error of our algorithm. We further justify our approach by testing our method on both simulated data and a real data set consisting of crimes recorded by the city of Chicago. Multivariate Count Autoregression We are studying the problems of modeling and inference for multivariate count time series data with Poisson marginals. The focus is on linear and log-linear models. For studying the properties of such processes we develop a novel conceptual framework which is based on copulas. However, our approach does not impose the copula on a vector of counts; instead the joint distribution is determined by imposing a copula function on a vector of associated continuous random variables. This specific construction avoids conceptual difficulties resulting from the joint distribution of discrete random variables yet it keeps the properties of the Poisson process marginally. We employ Markov chain theory and the notion of weak dependence to study ergodicity and stationarity of the models we consider. We obtain easily verifiable conditions for both linear and log-linear models under both theoretical frameworks. Suitable estimating equations are suggested for estimating unknown model parameters. The large sample properties of the resulting estimators are studied in detail. The work concludes with some simulations and a real data example. Multivariate D-Vine Time Series Model(mDvine) This paper proposes a novel semiparametric multivariate D-vine time series model (mDvine) that enables the simultaneous copula-based modeling of both temporal and cross-sectional dependence for multivariate time series. To construct the mDvine, we first build a semiparametric univariate D-vine time series model (uDvine) based on a D-vine. The uDvine generalizes the existing first-order copula-based Markov chain models to Markov chains of an arbitrary-order. Building upon the uDvine, we then construct the mDvine by joining multiple uDvines via another parametric copula. As a simple and tractable model, the mDvine provides flexible models for marginal behavior of time series and can also generate sophisticated temporal and cross-sectional dependence structures. Probabilistic properties of both the uDvine and mDvine are studied in detail. Furthermore, robust and computationally efficient procedures, including a sequential model selection method and a two-stage MLE, are proposed for model estimation and inference, and their statistical properties are investigated. Numerical experiments are conducted to demonstrate the flexibility of the mDvine, and to examine the performance of the sequential model selection procedure and the two-stage MLE. Real data applications on the Australian electricity price and the Ireland wind speed data demonstrate the superior performance of the mDvine to traditional multivariate time series models. Multivariate Exponentially Weighted Moving Average(MEWMA) The Steady-State Behavior of Multivariate Exponentially Weighted Moving Average Control Charts Multivariate Imputation by Chained Equations(MICE) Multivariate imputation by chained equations (MICE) is a particular multiple imputation technique (Raghunathan et al., 2001; Van Buuren, 2007). MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR), which means that the probability that a value is missing depends only on observed values and not on unobserved values (Schafer & Graham, 2002). In other words, after controlling for all of the available data (i.e., the variables included in the imputation model) “any remaining missingness is completely random” (Graham, 2009). Implementing MICE when data are not MAR could result in biased estimates. In the remainder of this paper, we assume that the MICE procedures are used with data that are MAR. mice Multivariate Locally Stationary Wavelet Analysis(mvLSW) mvLSW Multivariate Ordinal Regression Model mvord Multivariate Process Capability Indices(MPCI) MPCI Multivariate Range Boxes dynRB Multivariate Response Regression Models Multivariate Statistics Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical implementation of multivariate statistics to a particular problem may involve several types of univariate and multivariate analysis in order to understand the relationships between variables and their relevance to the actual problem being studied. In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both: 1. how these can be used to represent the distributions of observed data; 2. how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis. Certain types of problem involving multivariate data, for example simple linear regression and multiple regression, are NOT usually considered as special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables. Multivariate Subjective Fiducial Inference The aim of this paper is to firmly establish subjective fiducial inference as a rival to the more conventional schools of statistical inference, and to show that Fisher’s intuition concerning the importance of the fiducial argument was correct. In particular, methodology outlined in an earlier paper will be modified, enhanced and extended to deal with general inferential problems in which various parameters are unknown. Although the resulting theory is classified as being ‘subjective’, it is shown that this is simply due to the argument that all probability statements made about fixed but unknown parameters must be inherently subjective, rather than due to a need to emphasize how different the fiducial probabilities that can be derived using this theory are from objective probabilities. Some important examples of the application of this theory are presented. Multi-View Clustering(MVC) Multi-view Clustering: A Survey One-Pass Incomplete Multi-view Clustering Multi-view Intact Space Learning(MISL) It is practical to assume that an individual view is unlikely to be sufficient for effective multi-view learning. Therefore, integration of multi-view information is both valuable and necessary. In this paper, we propose the Multi-view Intact Space Learning (MISL) algorithm, which integrates the encoded complementary information in multiple views to discover a latent intact representation of the data. Even though each view on its own is insufficient, we show theoretically that by combing multiple views we can obtain abundant information for latent intact space learning. Employing the Cauchy loss (a technique used in statistical learning) as the error measurement strengthens robustness to outliers. We propose a new definition of multi-view stability and then derive the generalization error bound based on multi-view stability and Rademacher complexity, and show that the complementarity between multiple views is beneficial for the stability and generalization. MISL is efficiently optimized using a novel Iteratively Reweight Residuals (IRR) technique, whose convergence is theoretically analyzed. Experiments on synthetic data and real-world datasets demonstrate that MISL is an effective and promising algorithm for practical applications. Multiview Learning Multi-view learning is an emerging direction in machine learning which considers learning with multiple views to improve the generalization performance. Multi-view learning is also known as data fusion or data integration from multiple feature sets. Multi-View Locality Low-Rank Embedding for Dimension Reduction(MvL2E) During the last decades, we have witnessed a surge of interests of learning a low-dimensional space with discriminative information from one single view. Even though most of them can achieve satisfactory performance in some certain situations, they fail to fully consider the information from multiple views which are highly relevant but sometimes look different from each other. Besides, correlations between features from multiple views always vary greatly, which challenges multi-view subspace learning. Therefore, how to learn an appropriate subspace which can maintain valuable information from multi-view features is of vital importance but challenging. To tackle this problem, this paper proposes a novel multi-view dimension reduction method named Multi-view Locality Low-rank Embedding for Dimension Reduction (MvL2E). MvL2E makes full use of correlations between multi-view features by adopting low-rank representations. Meanwhile, it aims to maintain the correlations and construct a suitable manifold space to capture the low-dimensional embedding for multi-view features. A centroid based scheme is designed to force multiple views to learn from each other. And an iterative alternating strategy is developed to obtain the optimal solution of MvL2E. The proposed method is evaluated on 5 benchmark datasets. Comprehensive experiments show that our proposed MvL2E can achieve comparable performance with previous approaches proposed in recent literatures. Multi-View Multiple Clustering(MVMC) Multiple clustering aims at exploring alternative clusterings to organize the data into meaningful groups from different perspectives. Existing multiple clustering algorithms are designed for single-view data. We assume that the individuality and commonality of multi-view data can be leveraged to generate high-quality and diverse clusterings. To this end, we propose a novel multi-view multiple clustering (MVMC) algorithm. MVMC first adapts multi-view self-representation learning to explore the individuality encoding matrices and the shared commonality matrix of multi-view data. It additionally reduces the redundancy (i.e., enhancing the individuality) among the matrices using the Hilbert-Schmidt Independence Criterion (HSIC), and collects shared information by forcing the shared matrix to be smooth across all views. It then uses matrix factorization on the individual matrices, along with the shared matrix, to generate diverse clusterings of high-quality. We further extend multiple co-clustering on multi-view data and propose a solution called multi-view multiple co-clustering (MVMCC). Our empirical study shows that MVMC (MVMCC) can exploit multi-view data to generate multiple high-quality and diverse clusterings (co-clusterings), with superior performance to the state-of-the-art methods. Multiway Data Analysis Multiway data analysis is a method of analyzing large data sets by representing the data as a multidimensional array. The proper choice of array dimensions and analysis techniques can reveal patterns in the underlying data undetected by other methods. ➘ “Tensor Methods” http://…/Applied_multiway_data_analysis multiway Mumble Mumble is an open source, low-latency, high quality voice chat software primarily intended for use while gaming. MuProp Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. MuRel Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks. In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps. We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 and TDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context. Our code is available: https://…/murel.bootstrap.pytorch Murphy Diagram In the context of probability forecasts for binary weather events, displays of this type have a rich tradition that can be traced to Thompson and Brier (1955) and Murphy (1977). More recent examples include the papers by Schervish (1989), Richardson (2000), Wilks (2001), Mylne (2002), and Berrocal et al. (2010), among many others. Murphy (1977) distinguished three kinds of diagrams that reflect the economic decisions involved. The negatively oriented expense diagram shows the mean raw loss or expense of a given forecast scheme; the positively oriented value diagram takes the unconditional or climatological forecast as reference and plots the difference in expense between this reference forecast and the forecast at hand, and lastly, the relative-value diagram plots the ratio of the utility of a given forecast and the utility of an oracle forecast. The displays introduced above are similar to the value diagrams of Murphy, and we refer to them as Murphy diagrams. Murphy diagrams in R Mutation-Selection Equilibrium We propose a class of evolutionary models that involves an arbitrary exchangeable process as the breeding process and different selection schemes. In those models, a new genome is born according to the breeding process, and then a genome is removed according to the selection scheme that involves fitness. Thus the population size remains constant. The process evolves according to a Markov chain, and, unlike in many other existing models, the stationary distribution — so called mutation-selection equilibrium — can be easily found and studied. The behaviour of the stationary distribution when the population size increases is our main object of interest. Several phase-transition theorems are proved. Mutex Watershed Image partitioning, or segmentation without semantics, is the task of decomposing an image into distinct segments, or equivalently to detect closed contours. Most prior work either requires seeds, one per segment; or a threshold; or formulates the task as multicut / correlation clustering, an NP-hard problem. Here, we propose a greedy algorithm for signed graph partitioning, the ‘Mutex Watershed’. Unlike seeded watershed, the algorithm can accommodate not only attractive but also repulsive cues, allowing it to find a previously unspecified number of segments without the need for explicit seeds or a tunable threshold. We also prove that this simple algorithm solves to global optimality an objective function that is intimately related to the multicut / correlation clustering integer linear programming formulation. The algorithm is deterministic, very simple to implement, and has empirically linearithmic complexity. When presented with short-range attractive and long-range repulsive cues from a deep neural network, the Mutex Watershed gives the best results currently known for the competitive ISBI 2012 EM segmentation benchmark. Mutual Information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the ‘amount of information’ (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected ‘amount of information’ held in a random variable. Not limited to real-valued random variables like the correlation coefficient, MI is more general and determines how similar the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI). Mutual Information Neural Estimator(MINE) Recently, a method called the Mutual Information Neural Estimator (MINE) that uses neural networks has been proposed to estimate mutual information and more generally the Kullback-Leibler (KL) divergence between two distributions. The method uses the Donsker-Varadhan representation to arrive at the estimate of the KL divergence and is better than the existing estimators in terms of scalability and flexibility. The output of MINE algorithm is not guaranteed to be a consistent estimator. We propose a new estimator that instead of searching among functions characterized by neural networks searches the functions in a Reproducing Kernel Hilbert Space. We prove that the proposed estimator is consistent. We carry out simulations and show that when the datasets are small the proposed estimator is more reliable than the MINE estimator and when the datasets are large the performance of the two methods are close. Mutual Iterative Attention(MIA) In image-grounded text generation, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on the COCO dataset for image captioning. Extensive experiments show that the refined image representations boost the baseline models by up to 12% in terms of CIDEr, demonstrating that our method is effective and generalizes well to a wide range of models. Mutual Posterior-Divergence Regularization Variational Autoencoder (VAE), a simple and effective deep generative model, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. However, recent studies demonstrate that, when equipped with expressive generative distributions (aka. decoders), VAE suffers from learning uninformative latent representations with the observation called KL Varnishing, in which case VAE collapses into an unconditional generative model. In this work, we introduce mutual posterior-divergence regularization, a novel regularization that is able to control the geometry of the latent space to accomplish meaningful representation learning, while achieving comparable or superior capability of density estimation. Experiments on three image benchmark datasets demonstrate that, when equipped with powerful decoders, our model performs well both on density estimation and representation learning. mxnet MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scaling effectively to multiple GPUs and multiple machines. MXNet is also more than a deep learning project. It is also a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers. mxnet MXNET-MPI Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming paradigms, co-existing at the same time. The key advantage of the new model is to embed the scaling benefits of MPI parallelism into the loosely coupled PS task model. Apart from providing a practical usage model of MPI in cloud, such framework allows for novel communication avoiding algorithms that do parameter averaging in Stochastic Gradient Descent(SGD) approaches. We show how MPI and PS models can synergestically apply algorithms such as Elastic SGD to improve the rate of convergence against existing approaches. These new algorithms directly help scaling SGD clusterwide. Further, we also optimize the critical component of the framework, namely global aggregation or allreduce using a novel concept of tensor collectives. These treat a group of vectors on a node as a single object allowing for the existing single vector algorithms to be directly applicable. We back our claims with sufficient emperical evidence using large scale ImageNet 1K data. Our framework is built upon MXNET but the design is generic and can be adapted to other popular DL infrastructures. MyCaffe Over the past few years Caffe, from Berkeley AI Research, has gained a strong following in the deep learning community with over 15K forks on the github.com/BLVC/Caffe site. With its well organized, very modular C++ design it is easy to work with and very fast. However, in the world of Windows development, C# has helped accelerate development with many of the enhancements that it offers over C++, such as garbage collection, a very rich .NET programming framework and easy database access via Entity Frameworks. So how can a C# developer use the advances of C# to take full advantage of the benefits offered by the Berkeley Caffe deep learning system? The answer is the fully open source, ‘MyCaffe’ for Windows .NET programmers. MyCaffe is an open source, complete C# language re-write of Berkeley’s Caffe. This article describes the general architecture of MyCaffe including the newly added MyCaffeTrainerRL for Reinforcement Learning. In addition, this article discusses how MyCaffe closely follows the C++ Caffe, while talking efficiently to the low level NVIDIA CUDA hardware to offer a high performance, highly programmable deep learning system for Windows .NET programmers. Myelin Machine learning models benefit from large and diverse datasets. Using such datasets, however, often requires trusting a centralized data aggregator. For sensitive applications like healthcare and finance this is undesirable as it could compromise patient privacy or divulge trade secrets. Recent advances in secure and privacy-preserving computation, including trusted hardware enclaves and differential privacy, offer a way for mutually distrusting parties to efficiently train a machine learning model without revealing the training data. In this work, we introduce Myelin, a deep learning framework which combines these privacy-preservation primitives, and use it to establish a baseline level of performance for fully private machine learning. Myia We review the current state of automatic differentiation (AD) for array programming in machine learning (ML), including the different approaches such as operator overloading (OO) and source transformation (ST) used for AD, graph-based intermediate representations for programs, and source languages. Based on these insights, we introduce a new graph-based intermediate representation (IR) which specifically aims to efficiently support fully-general AD for array programming. Unlike existing dataflow programming representations in ML frameworks, our IR naturally supports function calls, higher-order functions and recursion, making ML models easier to implement. The ability to represent closures allows us to perform AD using ST without a tape, making the resulting derivative (adjoint) program amenable to ahead-of-time optimization using tools from functional language compilers, and enabling higher-order derivatives. Lastly, we introduce a proof of concept compiler toolchain called Myia which uses a subset of Python as a front end.