# WhatIs-D

 d3.compose Compose complex, data-driven visualizations from reusable charts and components with d3. · Get started quickly with standard charts and components · Layout charts and components automatically · Powerful foundation for creating custom charts and components Dafny Dafny is a programming language with a program verifier. As you type in your program, the verifier constantly looks over your shoulders and flags any errors. DAG Variational Autoencoder(D-VAE) Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interests to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose a DAG-style asynchronous message passing scheme that allows encoding the computations defined by DAGs, rather than using existing simultaneous message passing schemes to encode the graph structures. We demonstrate the effectiveness of our proposed D-VAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization. DAG-exchangeability Motivated by problems in Bayesian nonparametrics and probabilistic programming discussed in Staton et al. (2018), we present a new kind of partial exchangeability for random arrays which we call DAG-exchangeability. In our setting, a given random array is indexed by certain subgraphs of a directed acyclic graph (DAG) of finite depth, where each nonterminal vertex has infinitely many outgoing edges. We prove a representation theorem for such arrays which generalizes the Aldous-Hoover representation theorem. In the case that the DAGs are finite collections of certain rooted trees, our arrays are hierarchically exchangeable in the sense of Austin and Panchenko (2014), and we recover the representation theorem proved by them. Additionally, our representation is fine-grained in the sense that representations at lower levels of the hierarchy are also available. This latter feature is important in applications to probabilistic programming, thus offering an improvement over the Austin-Panchenko representation even for hierarchical exchangeability. DAgger EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning DAGOR Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally, the design of overload control is ad-hoc for individual service. However, service-specific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementation of service. Service developers usually have difficulty to accurately estimate the dynamics of actual workload during the development of service. Therefore, it is essential to decouple the overload control from the service logic. In this paper, we propose DAGOR, an overload control scheme designed for the account-oriented microservice architecture. DAGOR is service agnostic and system-centric. It manages overload at the microservice granule such that each microservice monitors its load status in real time and triggers load shedding in a collaborative manner among its relevant services when overload is detected. DAGOR has been used in the WeChat backend for five years. Experimental results show that DAGOR can benefit high success rate of service even when the system is experiencing overload, while ensuring fairness in the overload control. dAIrector dAIrector is an automated director which collaborates with humans storytellers for live improvisational performances and writing assistance. dAIrector can be used to create short narrative arcs through contextual plot generation. In this work, we present the system architecture, a quantitative evaluation of design choices, and a case-study usage of the system which provides qualitative feedback from a professional improvisational performer. We present relevant metrics for the understudied domain of human-machine creative generation, specifically long-form narrative creation. We include, alongside publication, open-source code so that others may test, evaluate, and run the dAIrector. Daisee We study adaptive importance sampling (AIS) as an online learning problem and argue for the importance of the trade-off between exploration and exploitation in this adaptation. Borrowing ideas from the bandits literature, we propose Daisee, a partition-based AIS algorithm. We further introduce a notion of regret for AIS and show that Daisee has $\mathcal{O}(\sqrt{T}(\log T)^{\frac{3}{4}})$ cumulative pseudo-regret, where $T$ is the number of iterations. We then extend Daisee to adaptively learn a hierarchical partitioning of the sample space for more efficient sampling and confirm the performance of both algorithms empirically. Daleel In this paper we present Daleel, a multi-criteria adaptive decision making framework that is developed to find the optimal IaaS deployment strategy. DALEX Predictive modeling is invaded by elastic, yet complex methods such as neural networks or ensembles (model stacking, boosting or bagging). Such methods are usually described by a large number of parameters or hyper parameters – a price that one needs to pay for elasticity. The very number of parameters makes models hard to understand. This paper describes a consistent collection of explainers for predictive models, a.k.a. black boxes. Each explainer is a technique for exploration of a black box model. Presented approaches are model-agnostic, what means that they extract useful information from any predictive method despite its internal structure. Each explainer is linked with a specific aspect of a model. Some are useful in decomposing predictions, some serve better in understanding performance, while others are useful in understanding importance and conditional responses of a particular variable. Every explainer presented in this paper works for a single model or for a collection of models. In the latter case, models can be compared against each other. Such comparison helps to find strengths and weaknesses of different approaches and gives additional possibilities for model validation. Presented explainers are implemented in the DALEX package for R. They are based on a uniform standardized grammar of model exploration which may be easily extended. The current implementation supports the most popular frameworks for classification and regression. Damerau-Levenshtein Distance In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau-Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The Damerau-Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that these four operations correspond to more than 80% of all human misspellings. Damerau’s paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, Damerau-Levenshtein distance has also seen uses in biology to measure the variation between protein sequences. Damped Least-Squares(DLS) In mathematics and computing, the Levenberg-Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, is used to solve non-linear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the Gauss-Newton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For well-behaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as Gauss-Newton using a trust region approach. The LMA is a very popular curve-fitting algorithm used in many software applications for solving generic curve-fitting problems. However, as for many fitting algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum. onls DancingLines Nowadays, events usually burst and are propagated online through multiple modern media like social networks and search engines. There exists various research discussing the event dissemination trends on individual medium, while few studies focus on event popularity analysis from a cross-platform perspective. Challenges come from the vast diversity of events and media, limited access to aligned datasets across different media and a great deal of noise in the datasets. In this paper, we design DancingLines, an innovative scheme that captures and quantitatively analyzes event popularity between pairwise text media. It contains two models: TF-SW, a semantic-aware popularity quantification model, based on an integrated weight coefficient leveraging Word2Vec and TextRank; and wDTW-CD, a pairwise event popularity time series alignment model matching different event phases adapted from Dynamic Time Warping. We also propose three metrics to interpret event popularity trends between pairwise social platforms. Experimental results on eighteen real-world event datasets from an influential social network and a popular search engine validate the effectiveness and applicability of our scheme. DancingLines is demonstrated to possess broad application potentials for discovering the knowledge of various aspects related to events and different media. Dark Data The total amount of data in every organization is far, far greater than anyone including, most crucially, their Information Technology group knows about. Moreover this ‘missing’ data that can’t be seen and currently can’t be made use of is also the very stuff that holds the organization together. This is what we call ‘Dark Data.’ Dark Knowledge A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give an image of a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This ‘dark knowledge’, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier. http://…/dark-knowledge-neural-network.html http://…/geoff-hintons-dark-knowledge http://…/1503.02531v1.pdf DARPA Open Catalog Welcome to the DARPA Open Catalog, which contains a curated list of DARPA-sponsored software and peer-reviewed publications. DARPA sponsors fundamental and applied research in a variety of areas that may lead to experimental results and reusable technology designed to benefit multiple government domains. The DARPA Open Catalog organizes publicly releasable material from DARPA programs. DARPA has an open strategy to help increase the impact of government investments. DARPA is interested in building communities around government-funded research. DARPA plans to continue to make available information generated by DARPA programs, including software, publications, data, and experimental results. The table on this page lists the programs currently participating in the catalog. Dash Dash is a Python framework for building analytical web applications. No JavaScript required. Built on top of Plotly.js, React, and Flask, Dash ties modern UI elements like dropdowns, sliders, and graphs to your analytical Python code. Dash: A Beginner´s Guide Dask Dask is a flexible parallel computing library for analytic computing. Dask is composed of two components: 1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads. 2. ‘Big Data’ collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers. Ultimate guide to handle Big Datasets for Machine Learning using Dask (in Python) dask-searchcv This library provides implementations of Scikit-Learn’s GridSearchCV and RandomizedSearchCV. They implement many (but not all) of the same parameters, and should be a drop-in replacement for the subset that they do implement. For certain problems, these implementations can be more efficient than those in Scikit-Learn, as they can avoid expensive repeated computations. DASNet Pixel-level annotation demands expensive human efforts and limits the performance of deep networks that usually benefits from more such training data. In this work we aim to achieve high quality instance and semantic segmentation results over a small set of pixel-level mask annotations and a large set of box annotations. The basic idea is exploring detection models to simplify the pixel-level supervised learning task and thus reduce the required amount of mask annotations. Our architecture, named DASNet, consists of three modules: detection, attention, and segmentation. The detection module detects all classes of objects, the attention module generates multi-scale class-specific features, and the segmentation module recovers the binary masks. Our method demonstrates substantially improved performance compared to existing semi-supervised approaches on PASCAL VOC 2012 dataset. Dat Build data pipelines – Dat is an open source project that provides a streaming interface between every file format and data storage backend. Data Acceleration Data technologies are evolving rapidly, but organizations have adopted most of these in piecemeal fashion. As a result, enterprise data – whether related to customer interactions, business performance, computer notifications, or external events in the business environment – is vastly underutilized. Moreover, companies’ data ecosystems have become complex and littered with data silos. This makes the data more difficult to access, which in turn limits the value that organizations can get out of it. Indeed, according to a recent Gartner, Inc. report, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage through 2015. Furthermore, a recent Accenture study found that half of all companies have concerns about the accuracy of their data, and the majority of executives are unclear about the business outcomes they are getting from their data analytics programs. To unlock the value hidden in their data, companies must start treating data as a supply chain, enabling it to flow easily and usefully through the entire organization – and eventually throughout each company’s ecosystem of partners, including suppliers and customers. The time is right for this approach. For one thing, new external data sources are becoming available, providing fresh opportunities for data insights. In addition, the tools and technology required to build a better data platform are available and in use. These provide a foundation on which companies can construct an integrated, end-to-end data supply chain. Data Acquisition Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems (abbreviated with the acronym DAS or DAQ) typically convert analog waveforms into digital values for processing. Data Aggregation In statistics, aggregate data describes data combined from several measurements. When data are aggregated, groups of observations are replaced with summary statistics based on those observations. In economics, aggregate data or data aggregates describes high-level data that is composed from a multitude or combination of other more individual data. Data Analysis Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data Analysis Expressions(DAX) Data Analysis Expressions are a collection of functions that can be used to perform a task and return one or more values. Although this sounds very similar to any other programming language, DAX is only a formula or a query language. DAX was developed around 2009 by Microsoft to be used with Microsoft’s PowerPivot, which at that time was available as an Excel (2010) add-in. It is extremely popular today as it is now the language of choice for Power BI and is supported by Tabular SSAS as well. Data Analytics Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. Data analytics is used in many industries to allow companies and [organizations] to make better business decisions and in the sciences to verify or disprove existing models or theories. Definition Data Archaeology Data archaeology refers to the art and science of recovering computer data encoded and/or encrypted in now obsolete media or formats. Data archaeology can also refer to recovering information from damaged electronic formats after natural or man made disasters. Data as a Service(DaaS) Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of service-oriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. Traditionally, most enterprises have used data stored in a self-contained repository, for which software was specifically developed to access and present the data in a human-readable form. One result of this paradigm is the bundling of both the data and the software needed to interpret it into a single package, sold as a consumer product. As the number of bundled software/data packages proliferated and required interaction among one another, another layer of interface was required. These interfaces, collectively known as enterprise application integration (EAI), often tended to encourage vendor lock-in, as it is generally easy to integrate applications that are built upon the same foundation technology. The result of the combined software/data consumer package and required EAI middleware has been an increased amount of software for organizations to manage and maintain, simply for the use of particular data. In addition to routine maintenance costs, a cascading amount of software updates are required as the format of the data changes. The existence of this situation contributes to the attractiveness of DaaS to data consumers, because it allows for the separation of data cost and usage from that of a specific software or platform. Data Assimilation Data assimilation is the process by which observations are incorporated into a computer model of a real system. Applications of data assimilation arise in many fields of geosciences, perhaps most importantly in weather forecasting and hydrology. Data assimilation proceeds by analysis cycles. In each analysis cycle, observations of the current (and possibly past) state of a system are combined with the results from a numerical model (the forecast) to produce an analysis, which is considered as ‘the best’ estimate of the current state of the system. This is called the analysis step. Essentially, the analysis step tries to balance the uncertainty in the data and in the forecast. The model is then advanced in time and its result becomes the forecast in the next analysis cycle. Book: Data Assimilation Book: Data Assimilation Data Assimilation Lecture Notes Data Augmentation Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more in-depth insight. Data augmentation can help reduce the manual interventation required to developed meaningful information and insight of business data, as well as significantly enhance data quality. Data augmentation is of the last steps done in enterprise data management after monitoring, profiling and integration Some of the common techniques used in data augmentation include: · Extrapolation Technique: Based on heuristics. The relevant fields are updated or provided with values. · Tagging Technique: Common records are tagged to a group, making it easier to understand and differentiate for the group. · Aggregation Technique: Using mathematical values of averages and means, values are estimated for relevant fields if needed · Probability Technique: Based on heuristics and analytical statistics, values are populated based on the probability of events. https://…/01-jcgs-art.pdf Data Blending Data blending is the process of combining data from multiple sources to reveal deeper intelligence that drives better business decision-making. Data blending differs from data integration and data warehousing in that its primary use is not to create the single, unified version of the truth that is stored in systems of record. Rather, business and data analysts use data blending to build an analytic dataset to assist in answering a specific business questions and driving a particular business process. Data Broker / Information Broker An information broker (independent information professional, information consultant, or data broker) collects information, often about individual people. The data are then sold to companies that use it to target advertising and marketing towards specific groups, to verify a person’s identity including for purposes of fraud detection, and to sell to individuals and organizations so they can research particular individuals. Critics, including consumer protection organizations, say the industry is secretive and unaccountable, and should be better regulated. Data Business Model According to Wikipedia, a business model “describes the rationale of how an organization creates, delivers, and captures value.” A Data Business Model is a business model where data is an indispensable component. If you remove the data, the business fails (or at least suffers greatly). To take one example, Amazon’s data is core to their business. Their historical transaction data helps them figure out how much inventory to hold and how to price products. Additionally, data about product views and purchases powers the recommendation engine, which drives a large portion of sales. Furthermore, product reviews drive traffic and SEO. As icing on the cake, all of this is a virtuous cycle: recommendations drive purchases, which result in more reviews, which lead to better SEO and more traffic, which results in more visitors and better recommendations. If Amazon wasn’t so effective as using data, it would be a much smaller company. The best part of data business models is that they often have the same kind of positive feedback loop as Amazon. In each business model, the more you use data to make money, the more data you get as a result, which helps you make more money in the future. Data Calculator Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices. The Data Calculator: Data Structure Design and Cost Synthesis From First Principles, and Learned Cost Models Data Communications Data Communications concerns the transmission of digital messages to devices external to the message source. “External” devices are generally thought of as being independently powered circuitry that exists beyond the chassis of a computer or other digital message source. As a rule, the maximum permissible transmission rate of a message is directly proportional to signal power, and inversely proportional to channel noise. It is the aim of any communications system to provide the highest possible transmission rate at the lowest possible power and with the least possible noise. Data Cube Materialization Data cube materialization is a classical database operator introduced in Gray et al.~(Data Mining and Knowledge Discovery, Vol.~1), which is critical for many analysis tasks. Nandi et al.~(Transactions on Knowledge and Data Engineering, Vol.~6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. Data Cube Vocabulary There are many situations where it would be useful to be able to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts. The Data Cube vocabulary provides a means to do this using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations. The Data Cube vocabulary is a core foundation which supports extension vocabularies to enable publication of other aspects of statistical data flows or other multi-dimensional data sets. Data Curation Data curation is a term used to indicate management activities required to maintain research data long-term such that it is available for reuse and preservation. In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. The term is also used in the humanities, where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. Data Decorations Alberto Cairo left a comment about ‘data decorations’. This is a name he’s using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience – one might argue that the experience was worse than just staring at the data table. Data Distillery The paper tackles the unsupervised estimation of the effective dimension of a sample of dependent random vectors. The proposed method uses the principal components (PC) decomposition of sample covariance to establish a low-rank approximation that helps uncover the hidden structure. The number of PCs to be included in the decomposition is determined via a Probabilistic Principal Components Analysis (PPCA) embedded in a penalized profile likelihood criterion. The choice of penalty parameter is guided by a data-driven procedure that is justified via analytical derivations and extensive finite sample simulations. Application of the proposed penalized PPCA is illustrated with three gene expression datasets in which the number of cancer subtypes is estimated from all expression measurements. The analyses point towards hidden structures in the data, e.g. additional subgroups, that could be of scientific interest. Data Driven Business Model(DDBM) This paper contributes by providing a definition of a data-driven business model as a business model that relies on data as a key resource. Data Driven Documents(D3) D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation. Visualizing Data with D3.js D3 Tips and Tricks Awesome D3 Data Dropout Deep learning models learn to fit training data while they are highly expected to generalize well to testing data. Most works aim at finding such models by creatively designing architectures and fine-tuning parameters. To adapt to particular tasks, hand-crafted information such as image prior has also been incorporated into end-to-end learning. However, very little progress has been made on investigating how an individual training sample will influence the generalization ability of a model. In other words, to achieve high generalization accuracy, do we really need all the samples in a training dataset? In this paper, we demonstrate that deep learning models such as convolutional neural networks may not favor all training samples, and generalization accuracy can be further improved by dropping those unfavorable samples. Specifically, the influence of removing a training sample is quantifiable, and we propose a Two-Round Training approach, aiming to achieve higher generalization accuracy. We locate unfavorable samples after the first round of training, and then retrain the model from scratch with the reduced training dataset in the second round. Since our approach is essentially different from fine-tuning or further training, the computational cost should not be a concern. Our extensive experimental results indicate that, with identical settings, the proposed approach can boost performance of the well-known networks on both high-level computer vision problems such as image classification, and low-level vision problems such as image denoising. Data Envelopment Analysis(DEA) Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. It is used to empirically measure productive efficiency of decision making units (or DMUs). Although DEA has a strong link to production theory in economics, the tool is also used for benchmarking in operations management, where a set of measures is selected to benchmark the performance of manufacturing and service operations. Data Envelopment Analysis rDEA Data Fabric The Data Fabric is the platform that supports all the data in the company. How it’s managed, described, combined and universally accessed. This platform is formed from an Enterprise Knowledge Graph to create an uniform and unified data environment. Data Federation In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition ‘Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.’ This definition is based on the following concepts: · Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization. · Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages. · Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation. · One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data. · On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format. Spark Reaches for the Holy Grail: Federated Queries Data Fusion Data fusion is the process of integration of multiple data and knowledge representing the same real-world object into a consistent, accurate, and useful representation. Data fusion processes are often categorized as low, intermediate or high, depending on the processing stage at which fusion takes place. Low level data fusion combines several sources of raw data to produce new raw data. The expectation is that fused data is more informative and synthetic than the original inputs. For example, sensor fusion is also known as (multi-sensor) data fusion and is a subset of information fusion. Data Hoarding ➘ “Digital Hoarding” http://…/Wikipedia:Avoid_data-hoarding http://…hoarding-pca-visualization-decisions.html Data Illustrator Data Illustrator: Augmenting Vector Design Tools with Lazy Data Binding for Expressive Visualization Authoring. Building graphical user interfaces for visualization authoring is challenging as one must reconcile the tension between flexible graphics manipulation and procedural visualization generation based on a graphical grammar or declarative languages. To better support designers’ workflows and practices, we propose Data Illustrator, a novel visualization framework. In our approach, all visualizations are initially vector graphics; data binding is applied when necessary and only constrains interactive manipulation to that data bound property. The framework augments graphic design tools with new concepts and operators, and describes the structure and generation of a variety of visualizations. Based on the framework, we design and implement a visualization authoring system. The system extends interaction techniques in modern vector design tools for direct manipulation of visualization configurations and parameters. We demonstrate the expressive power of our approach through a variety of examples. A qualitative study shows that designers can use our framework to compose visualizations. Data Impartment Data Journalism / Data Driven Journalism Data-driven journalism, often shortened to “ddj”, is a term in use since 2009/2010, to describe a journalistic process based on analyzing and filtering large data sets for the purpose of creating a news story. Main drivers for this process are newly available resources such as “open source” software and “open data”. This approach to journalism builds on older practices, most notably on CAR (acronym for “computer-assisted reporting”) a label used mainly in the US for decades. Other labels for partially similar approaches are “precision journalism”, based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories. Data Justice As a handful of data platforms generate massive amounts of user data, the barriers to entry rise since potential competitors have little data themselves to entice advertisers compared to the incumbents who have both the concentrated processing power and supply of user data to dominate particular sectors. The upshot of this market power by big data platforms is that the marketplace is doing little to create options for consumers that might alleviate the misuse of consumer data or encourage big data platforms to better compensate users who are willing to share their data. Data Justice has been launched as a project to promote public education and new alliances to challenge the danger of big data to workers, consumers and the public. Data Lake A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. Data Leakage Data Leakage is the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. Leakage is a pervasive challenge in applied machine learning, causing models to over-represent their generalization error and often rendering them useless in the real world. It can caused by human or mechanical error, and can be intentional or unintentional in both cases. Data Learning Technology is generating a huge and growing availability of observa tions of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collection, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new paradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information. Data Lineage Data lineage is generally defined as a kind of data life cycle that includes the data’s origins and where it moves over time. This term can also describe what happens to data as it goes through diverse processes. Data lineage can help with efforts to analyze how information is used and to track key bits of information that serve a particular purpose. How to track and visualize data lineage Data Lineage Analysis “Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time.” It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already. Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. “Lineage is a simple type of why provenance.” Data Literacy A statistical understanding and experience of applying analysis techniques to real data through code and visualization. Data literacy will be the fundamental skill for the 21st century. It’s also extremely easy to learn. The best way to develop this skill is to simply work with datasets. (https://…/data-year# ) Data literacy is the ability to read, understand, create and communicate data as information. Much like literacy as a general concept, data literacy focuses on the competencies involved in working with data. As data collection and sharing become routine and data analysis and big data become common ideas in the news, business, government and society, it becomes more and more important for students, citizens, and readers to have some data literacy. Data Mining(DM) Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data Mining Reality Check(DMRC) Is a means for ensuring the validity of data mining results. Data Mining Reality Check is technology for giving you the probability distribution against which to compare the best performance from your data mining exercise – that is, the probability distribution of your best network relative to the benchmark, viewed as a random variable generated by a random process in which really there is nothing better than the benchmark. Data Normalization Data normalization is the process of reducing data to its canonical form. For instance, Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. In the field of software security, a common vulnerability is unchecked malicious input. The mitigation for this problem is proper input validation. Before input validation may be performed, the input must be normalized, i.e., eliminating encoding (for instance HTML encoding) and reducing the input data to a single common character set. Data Oriented Design In computing, data-oriented design is a program optimization approach motivated by efficient usage of the CPU cache, used in video game development. The approach is to focus on the data layout, separating and sorting fields according to when they are needed, and to think about transformations of data. Proponents include Mike Acton and Scott Meyers. Data Pallets Trusting simulation output is crucial for Sandia’s mission objectives. We rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated ‘sandbox’ and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project explores extending the container concept to include storage as a new container type we call \emph{data pallets}. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it. Data Paring The problem that needs to be more discussed is data paring. The need for this is fairly obvious: data is growing exponentially, and growing your compute data exponentially will require budgets that aren’t realistic. One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t; I would bet that for most purposes the interesting data added is a tiny percentage of the new data that is added to the overall pool of data. Data Partitioning Data partitioning in data mining is the division of the whole data available into two or three non overlapping sets: the training set , the validation set , and the test set. If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model. Data Pattern Processing Data Plumbing Data Poisoning Data Poisoning Attacks against Online Learning Data Preprocessing Data pre-processing is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. Data Profiling Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: 1. Find out whether existing data can easily be used for other purposes 2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category 3. Give metrics on data quality including whether the data conforms to particular standards or patterns 4. Assess the risk involved in integrating data for new applications, including the challenges of joins 5. Assess whether metadata accurately describes the actual values in the source database 6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. 7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality. Data Reuse Efficiency Recurrent neural networks (RNNs) have shown state of the art results for speech recognition, natural language processing, image captioning and video summarizing applications. Many of these applications run on low-power platforms, so their energy efficiency is extremely important. We observed that cache-oblivious RNN scheduling during inference typically results in 30-50x more data transferred on and off the CPU than the application’s working set size. This can potentially impact its energy efficiency. This paper presents a new metric called Data Reuse Efficiency to gauge the RNN scheduling efficiency of a platform and shows the factors that influence the DRE value. Additionally, this paper discusses an optimization to improve reuse in RNNs and highlights the positive impact of this optimization on the total amount of memory read from or written to the memory controller (and, hence, the DRE value) during the execution of an RNN application for a mobile SoC. Data Science Data science is a buzz word reflecting the application of statistics by advances in computer science. Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. Data Science Maturity Model(DSMM) Many organizations have been underwhelmed by the return on their investment in data science. This is due to a narrow focus on tools, rather than a broader consideration of how data science teams work and how they fit within the larger organization. To help data science practitioners and leaders identify their existing gaps and direct future investment, Domino has developed a framework called the Data Science Maturity Model (DSMM). The DSMM assesses how reliably and sustainably a data science team can deliver value for their organization. The model consists of four levels of maturity and is split along five dimensions that apply to all analytical organizations. By design, the model is not specific to any given industry – it applies as much to an insurance company as it does to a manufacturer. Data Science Virtual Machine(DSVM) The Data Science Virtual Machine runs on Windows Server 2012 and contains popular tools for data exploration, modeling and development activities. The main tools included are Microsoft R Server Developer Edition (An enterprise ready scalable R framework), Anaconda Python distribution, Julia Pro developer edition, Jupyter notebooks for R, Python and Julia, Visual Studio Community Edition with Python, R and node.js tools, Power BI desktop, SQL Server 2016 Developer edition including support In-Database analytics using Microsoft R Server. It also includes open source deep learning tools like Microsoft Cognitive Toolkit (CNTK 2.0) and mxnet; ML algorithms like xgboost, Vowpal Wabbit. The Azure SDK and libraries on the VM allows you to build your applications using various services in the cloud that are part of the Cortana Analytics Suite which includes Azure Machine Learning, Azure data factory, Stream Analytics and SQL Datawarehouse, Hadoop, Data Lake, Spark and more. You can deploy models as web services in the cloud on Azure Machine Learning OR deploy them either on the cloud or on-premises using the Microsoft R Server operationalization. Data Science-as-a-Service(DSaaS) Data Science as a Service is Analyze’s unique approach to providing today’s business and government with the most advanced and efficient big data and data science analytics. With more than 25 years combined experience in data science and cybersecurity, Analyze helps organizations stay ahead of the competition, increase revenue, and improve operational efficiency. Data Shapley As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor. Data Shared Adaptive Bootstrap Aggregated(AdaBag) In this article we propose a new supervised ensemble learning method called Data Shared Adaptive Bootstrap Aggregated (AdaBag) Lasso for capturing low dimensional useful features for word based sentiment analysis and mining problems. The literature on ensemble methods is very rich in both statistics and machine learning. The algorithm is a substantial upgrade of the Data Shared Lasso uplift algorithm. The most significant conceptual addition to the existing literature lies in the final selection of bag of predictors through a special bootstrap aggregation scheme. We apply the algorithm to one simulated data and perform dimension reduction in grouped IMDb data (drama, comedy and horror) to extract reduced set of word features for predicting sentiment ratings of movie reviews demonstrating different aspects. We also compare the performance of the present method with the classical Principal Components with associated Linear Discrimination (PCA-LD) as baseline. There are few limitations in the algorithm. Firstly, the algorithm workflow does not incorporate online sequential data acquisition and it does not use sentence based models which are common in ANN algorithms . Our results produce slightly higher error rate compare to the reported state-of-the-art as a consequence. Data Sketch Data sketches are approximate succinct summaries of long streams. They are widely used for processing massive amounts of data and answering statistical queries about it in real-time. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. Data Standardization When approaching data for modeling, some standard procedures should be used to prepare the data for modeling: 1.First the data should be filtered, and any outliers removed from the data (watch for a future post on how to scrub your raw data removing only legitimate outliers). 2.The data should be normalized or standardized to bring all of the variables into proportion with one another. For example, if one variable is 100 times larger than another (on average), then your model may be better behaved if you normalize/standardize the two variables to be approximately equivalent. Technically though, whether normalized/standardized, the coefficients associated with each variable will scale appropriately to adjust for the disparity in the variable sizes. Data Stewardship In metadata, a data steward is a person that is responsible for maintaining a data element in a metadata registry. A data steward is a broad job role that incorporates processes, policies, guidelines and responsibilities for administering organizations’ entire data in compliance with business and/or regulatory obligations. A data steward’s responsibility stems from an understanding of the business domain and the interaction of business processes with data entities/elements. A data steward ensures that there are documented procedures and guidelines for data access and use. A data steward may share some responsibilities with a data custodian, and work with database/warehouse administrators and other related staff to plan and execute an enterprise-wide data governance, control and compliance policy. Data stewardship roles are common when organizations are attempting to exchange data precisely and consistently between computer systems and reuse data-related resources. Master data management often makes references to the need for data stewardship for its implementation to succeed. Data Stream Mining Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Often, concepts from the field of incremental learning, a generalization of Incremental heuristic search are applied to cope with structural changes, on-line learning and real-time demands. In many applications, especially operating within non-stationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift. Data Structure Graph A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph. Data Transfer Project(DTP) The Data Transfer Project was formed in 2017 to create an open-source, service-to-service data portability platform so that all individuals across the web could easily move their data between online service providers whenever they want. The contributors to the Data Transfer Project believe portability and interoperability are central to innovation. Making it easier for individuals to choose among services facilitates competition, empowers individuals to try new services and enables them to choose the offering that best suits their needs. Data Transfer Project (DTP) is a collaboration of organizations committed to building a common framework with open-source code that can connect any two online service providers, enabling a seamless, direct, user initiated portability of data between the two platforms. The Data Transfer Project uses services´ existing APIs and authorization mechanisms to access data. It then uses service specific adapters to transfer that data into a common format, and then back into the new service´s API. Data Understanding In the field of machine learning, data understanding is the practice of getting initial insights in unknown datasets. Such knowledge-intensive tasks require a lot of documentation, which is necessary for data scientists to grasp the meaning of the data. Usually, documentation is separate from the data in various external documents, diagrams, spreadsheets and tools which causes considerable look up overhead. Moreover, other supporting applications are not able to consume and utilize such unstructured data. That is why we propose a methodology that uses a single semantic model that interlinks data with its documentation. Hence, data scientists are able to directly look up the connected information about the data by simply following links. Equally, they can browse the documentation which always refers to the data. Furthermore, the model can be used by other approaches providing additional support, like searching, comparing, integrating or visualizing data. To showcase our approach we also demonstrate an early prototype. Data Version Control(DVC) DVC makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data – through cloud storage (AWS S3, GCP) in a single DVC environment. Data Visualization Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning ‘information that has been abstracted in some schematic form, including attributes or variables for the units of information’. A primary goal of data visualization is to communicate information clearly and efficiently to users via the information graphics selected, such as tables and charts. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look-up a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables. Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly information-based economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as ‘Big Data’. Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge. Data Warehouse(DW) In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. Data2Vis Rapidly creating effective visualizations using expressive grammars is challenging for users who have limited time and limited skills in statistics and data visualization. Even high-level, dedicated visualization tools often require users to manually select among data attributes, decide which transformations to apply, and specify mappings between visual encoding variables and raw or transformed attributes. In this paper, we introduce Data2Vis, a neural translation model, for automatically generating visualizations from given datasets. We formulate visualization generation as a sequence to sequence translation problem where data specification is mapped to a visualization specification in a declarative language (Vega-Lite). To this end, we train a multilayered Long Short-Term Memory (LSTM) model with attention on a corpus of visualization specifications. Qualitative results show that our model learns the vocabulary and syntax for a valid visualization specification, appropriate transformations (count, bins, mean) and how to use common data selection patterns that occur within data visualizations. Our model generates visualizations that are comparable to manually-created visualizations in a fraction of the time, with potential to learn more complex visualization strategies at scale. Data-Adaptive Nonparametric Kernel(DANK) Traditional kernels or their combinations are often not sufficiently flexible to fit the data in complicated practical tasks. In this paper, we present a Data-Adaptive Nonparametric Kernel (DANK) learning framework by imposing an adaptive matrix on the kernel/Gram matrix in an entry-wise strategy. Since we do not specify the formulation of the adaptive matrix, each entry in it can be directly and flexibly learned from the data. Therefore, the solution space of the learned kernel is largely expanded, which makes DANK flexible to adapt to the data. Specifically, the proposed kernel learning framework can be seamlessly embedded to support vector machines (SVM) and support vector regression (SVR), which has the capability of enlarging the margin between classes and reducing the model generalization error. Theoretically, we demonstrate that the objective function of our devised model is gradient-Lipschitz continuous. Thereby, the training process for kernel and parameter learning in SVM/SVR can be efficiently optimized in a unified framework. Further, to address the scalability issue in DANK, a decomposition-based scalable approach is developed, of which the effectiveness is demonstrated by both empirical studies and theoretical guarantees. Experimentally, our method outperforms other representative kernel learning based algorithms on various classification and regression benchmark datasets. Data-as-a-Service(DaaS) Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of service-oriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. Data-Dependent Upsampling(DUpsampling) Recent semantic segmentation methods exploit encoder-decoder architectures to produce the desired pixel-wise segmentation prediction. The last layer of the decoders is typically a bilinear upsampling procedure to recover the final pixel-wise prediction. We empirically show that this oversimple and data-independent bilinear upsampling may lead to sub-optimal results. In this work, we propose a data-dependent upsampling (DUpsampling) to replace bilinear, which takes advantages of the redundancy in the label space of semantic segmentation and is able to recover the pixel-wise prediction from low-resolution outputs of CNNs. The main advantage of the new upsampling layer lies in that with a relatively lower-resolution feature map such as $\frac{1}{16}$ or $\frac{1}{32}$ of the input size, we can achieve even better segmentation accuracy, significantly reducing computation complexity. This is made possible by 1) the new upsampling layer’s much improved reconstruction capability; and more importantly 2) the DUpsampling based decoder’s flexibility in leveraging almost arbitrary combinations of the CNN encoders’ features. Experiments demonstrate that our proposed decoder outperforms the state-of-the-art decoder, with only $\sim$20\% of computation. Finally, without any post-processing, the framework equipped with our proposed decoder achieves new state-of-the-art performance on two datasets: 88.1\% mIOU on PASCAL VOC with 30\% computation of the previously best model; and 52.5\% mIOU on PASCAL Context. Data-Driven Design The aim of this research is to introduce a novel structural design process that allows architects and engineers to extend their typical design space horizon and thereby promoting the idea of creativity in structural design. The theoretical base of this work builds on the combination of structural form-finding and state-of-the-art machine learning algorithms. In the first step of the process, Combinatorial Equilibrium Modelling (CEM) is used to generate a large variety of spatial networks in equilibrium for given input parameters. In the second step, these networks are clustered and represented in a form-map through the implementation of a Self Organizing Map (SOM) algorithm. In the third step, the solution space is interpreted with the help of a Uniform Manifold Approximation and Projection algorithm (UMAP). This allows gaining important insights in the structure of the solution space. A specific case study is used to illustrate how the infinite equilibrium states of a given topology can be defined and represented by clusters. Furthermore, three classes, related to the non-linear interaction between the input parameters and the form space, are verified and a statement about the entire manifold of the solution space of the case study is made. To conclude, this work presents an innovative approach on how the manifold of a solution space can be grasped with a minimum amount of data and how to operate within the manifold in order to increase the diversity of solutions. Data-Driven Robust Model Predictive Control(DDRMPC) We develop a novel data-driven robust model predictive control (DDRMPC) approach for automatic control of irrigation systems. The fundamental idea is to integrate both mechanistic models, which describe dynamics in soil moisture variations, and data-driven models, which characterize uncertainty in forecast errors of evapotranspiration and precipitation, into a holistic systems control framework. To better capture the support of uncertainty distribution, we take a new learning-based approach by constructing uncertainty sets from historical data. For evapotranspiration forecast error, the support vector clustering-based uncertainty set is adopted, which can be conveniently built from historical data. As for precipitation forecast errors, we analyze the dependence of their distribution on forecast values, and further design a tailored uncertainty set based on the properties of this type of uncertainty. In this way, the overall uncertainty distribution can be elaborately described, which finally contributes to rational and efficient control decisions. To assure the quality of data-driven uncertainty sets, a training-calibration scheme is used to provide theoretical performance guarantees. A generalized affine decision rule is adopted to obtain tractable approximations of optimal control problems, thereby ensuring the practicability of DDRMPC. Case studies using real data show that, DDRMPC can reliably maintain soil moisture above the safety level and avoid crop devastation. The proposed DDRMPC approach leads to a 40\% reduction of total water consumption compared to the fine-tuned open-loop control strategy. In comparison with the carefully tuned rule-based control and certainty equivalent model predictive control, the proposed DDRMPC approach can significantly reduce the total water consumption and improve the control performance. Data-Driven Threshold Machine(DTM) We present a novel distribution-free approach, the data-driven threshold machine (DTM), for a fundamental problem at the core of many learning tasks: choose a threshold for a given pre-specified level that bounds the tail probability of the maximum of a (possibly dependent but stationary) random sequence. We do not assume data distribution, but rather relying on the asymptotic distribution of extremal values, and reduce the problem to estimate three parameters of the extreme value distributions and the extremal index. We specially take care of data dependence via estimating extremal index since in many settings, such as scan statistics, change-point detection, and extreme bandits, where dependence in the sequence of statistics can be significant. Key features of our DTM also include robustness and the computational efficiency, and it only requires one sample path to form a reliable estimate of the threshold, in contrast to the Monte Carlo sampling approach which requires drawing a large number of sample paths. We demonstrate the good performance of DTM via numerical examples in various dependent settings. Data-Enabled Predictive Control(DeePC) We consider the problem of optimal trajectory tracking for unknown systems. A novel data-enabled predictive control (DeePC) algorithm is presented that computes optimal and safe control policies using real-time feedback driving the unknown system along a desired trajectory while satisfying system constraints. Using a finite number of data samples from the unknown system, our proposed algorithm uses a behavioural systems theory approach to learn a non-parametric system model used to predict future trajectories. The DeePC algorithm is shown to be equivalent to the classical and widely adopted Model Predictive Control (MPC) algorithm in the case of deterministic linear time-invariant systems. In the case of nonlinear stochastic systems, we propose regularizations to the DeePC algorithm. Simulations are provided to illustrate performance and compare the algorithm with other methods. Dataflow Matrix Machine(DMM) Dataflow matrix machines generalize neural nets by replacing streams of numbers with streams of vectors (or other kinds of linear streams admitting a notion of linear combination of several streams) and adding a few more changes on top of that, namely arbitrary input and output arities for activation functions, countable-sized networks with finite dynamically changeable active part capable of unbounded growth, and a very expressive self-referential mechanism. While recurrent neural networks are Turing-complete, they form an esoteric programming platform, not conductive for practical general-purpose programming. Dataflow matrix machines are more suitable as a general-purpose programming platform, although it remains to be seen whether this platform can be made fully competitive with more traditional programming platforms currently in use. At the same time, dataflow matrix machines retain the key property of recurrent neural networks: programs are expressed via matrices of real numbers, and continuous changes to those matrices produce arbitrarily small variations in the programs associated with those matrices. Spaces of vector-like elements are of particular importance in this context. In particular, we focus on the vector space $V$ of finite linear combinations of strings, which can be also understood as the vector space of finite prefix trees with numerical leaves, the vector space of ‘mixed rank tensors’, or the vector space of recurrent maps. This space, and a family of spaces of vector-like elements derived from it, are sufficiently expressive to cover all cases of interest we are currently aware of, and allow a compact and streamlined version of dataflow matrix machines based on a single space of vector-like elements and variadic neurons. We call elements of these spaces V-values. Their role in our context is somewhat similar to the role of S-expressions in Lisp. Dataflow Reuse Algorithms Distributed Stream Processing Systems (DSPS) like Apache Storm and Spark Streaming enable composition of continuous dataflows that execute persistently over data streams. They are used by Internet of Things (IoT) applications to analyze sensor data from Smart City cyber-infrastructure, and make active utility management decisions. As the ecosystem of such IoT applications that leverage shared urban sensor streams continue to grow, applications will perform duplicate pre-processing and analytics tasks. This offers the opportunity to collaboratively reuse the outputs of overlapping dataflows, thereby improving the resource efficiency. In this paper, we propose \emph{dataflow reuse algorithms} that given a submitted dataflow, identifies the intersection of reusable tasks and streams from a collection of running dataflows to form a \emph{merged dataflow}. Similar algorithms to unmerge dataflows when they are removed are also proposed. We implement these algorithms for the popular Apache Storm DSPS, and validate their performance and resource savings for 35 synthetic dataflows based on public OPMW workflows with diverse arrival and departure distributions, and on 21 real IoT dataflows from RIoTBench. Dataflow-Flavored Model of Computation(MoC) The majority of contemporary mobile devices and personal computers are based on heterogeneous computing platforms that consist of a number of CPU cores and one or more Graphics Processing Units (GPUs). Despite the high volume of these devices, there are few existing programming frameworks that target full and simultaneous utilization of all CPU and GPU devices of the platform. This article presents a dataflow-flavored Model of Computation (MoC) that has been developed for deploying signal processing applications to heterogeneous platforms. The presented MoC is dynamic and allows describing applications with data dependent run-time behavior. On top of the MoC, formal design rules are presented that enable application descriptions to be simultaneously dynamic and decidable. Decidability guarantees compile-time application analyzability for deadlock freedom and bounded memory. The presented MoC and the design rules are realized in a novel Open Source programming environment ‘PRUNE’ and demonstrated with representative application examples from the domains of image processing, computer vision and wireless communications. Experimental results show that the proposed approach outperforms the state-of-the-art in analyzability, flexibility and performance. Data-free Automatic Acceleration of Convolutional Networks(DAC) Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance. DataJoint The relational data model offers unrivaled rigor and precision in defining data structure and querying complex data. Yet the use of relational databases in scientific data pipelines is limited due to their perceived unwieldiness. We propose a simplified and conceptually refined relational data model named DataJoint. The model includes a language for schema definition, a language for data queries, and diagramming notation for visualizing entities and relationships among them. The model adheres to the principle of entity normalization, which requires that all data — both stored and derived — must be represented by well-formed entity sets. DataJoint’s data query language is an algebra on entity sets with five operators that provide matching capabilities to those of other relational query languages with greater clarity due to entity normalization. Practical implementations of DataJoint have been adopted in neuroscience labs for fluent interaction with scientific data pipelines. Datalog Datalog is a declarative logic programming language that syntactically is a subset of Prolog. It is often used as a query language for deductive databases. In recent years, Datalog has found new application in data integration, information extraction, networking, program analysis, security, and cloud computing. Its origins date back to the beginning of logic programming, but it became prominent as a separate area around 1977 when Hervé Gallaire and Jack Minker organized a workshop on logic and databases. David Maier is credited with coining the term Datalog. DataMarket DataOps DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. Datar Various tools, softwares and systems are proposed and implemented to tackle the challenges in big data on different emphases, e.g., data analysis, data transaction, data query, data storage, data visualization, data privacy. In this paper, we propose datar, a new prospective and unified framework for Big Data Management System (BDMS) from the point of system architecture by leveraging ideas from mainstream computer structure. We introduce five key components of datar by reviewing the current status of BDMS. Datar features with configuration chain of pluggable engines, automatic dataflow on job pipelines, intelligent self-driving system management and interactive user interfaces. Moreover, we present biggy as an implementation of datar with manipulation details demonstrated by four running examples. Evaluations on efficiency and scalability are carried out to show the performance. Our work argues that the envisioned datar is a feasible solution to the unified framework of BDMS, which can manage big data pluggablly, automatically and intelligently with specific functionalities, where specific functionalities refer to input, storage, computation, control and output of big data. Dataset Culling Real-time CNN based object detection models for applications like surveillance can achieve high accuracy but require extensive computations. Recent work has shown 10 to 100x reduction in computation cost with domain-specific network settings. However, this prior work focused on inference only: if the domain network requires frequent retraining, training and retraining costs can be a significant bottleneck. To address training costs, we propose Dataset Culling: a pipeline to significantly reduce the required training dataset size for domain specific models. Dataset Culling reduces the dataset size by filtering out non-essential data for train-ing, and reducing the size of each image until detection degrades. Both of these operations use a confusion loss metric which enables us to execute the culling with minimal computation overhead. On a custom long-duration dataset, we show that Dataset Culling can reduce the training costs 47x with no accuracy loss or even with slight improvements. Codes are available: https://…/DatasetCulling Dataset Distillation Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called {\em dataset distillation}: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to {\em synthesize} a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic {\em distilled images} (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. We evaluate our method in a wide range of initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods in most settings. Datasheets for Datasets Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects’ demographics and consent as applicable. We develop prototypes of datasheets for two well-known datasets: Labeled Faces in The Wild~\cite{lfw} and the Pang \& Lee Polarity Dataset~\cite{polarity}. Data-to-Decisions(D2D) Datification / Datafication A concept that tracks the conception, development, storage and marketing of all types of data, both for business and life. It has grown in popularity of late to capture how data measures things and organizations in order to compete and win. It is about making business visible. Datmo Workflow tools to help you experiment, deploy, and scale. By data scientists, for data scientists. Datmo is an open source model tracking and reproducibility tool for developers. Features • One command environment setup (languages, frameworks, packages, etc) • Tracking and logging for model config and results • Project versioning (model state tracking) • Experiment reproducibility (re-run tasks) • Visualize + export experiment history Dato Dato (formerly known as GraphLab): Your app drives business. From inspiration to production, build intelligent apps fast with the power of Dato’s machine learning platform. Data science at scale has never been easier. Dawid-Skene Algorithm(DSA) More and more online communities classify contributions based on collaborative ratings of these contributions. A popular method for such a rating-based classification is the Dawid-Skene algorithm (DSA). However, despite its popularity, DSA has two major shortcomings: (1) It is vulnerable to raters with a low competence, i.e., a low probability of rating correctly. (2) It is defenseless against collusion attacks. In a collusion attack, raters coordinate to rate the same data objects with the same value to artificially increase their remuneration. Error Rate Analysis of Labeling by Crowdsourcing Fast Dawid-Skene DAWNBENCH DAWNBench is a benchmark suite for end-to-end deep learning training and inference. Computation time and cost are critical resources in building deep models, yet many existing benchmarks focus solely on model accuracy. DAWNBench provides a reference set DBSCAN++ DBSCAN is a classical density-based clustering procedure which has had tremendous practical relevance. However, it implicitly needs to compute the empirical density for each sample point, leading to a quadratic worst-case time complexity, which may be too slow on large datasets. We propose DBSCAN++, a simple modification of DBSCAN which only requires computing the densities for a subset of the points. We show empirically that, compared to traditional DBSCAN, DBSCAN++ can provide not only competitive performance but also added robustness in the bandwidth hyperparameter while taking a fraction of the runtime. We also present statistical consistency guarantees showing the trade-off between computational cost and estimation rates. Surprisingly, up to a certain point, we can enjoy the same estimation rates while lowering computational cost, showing that DBSCAN++ is a sub-quadratic algorithm that attains minimax optimal rates for level-set estimation, a quality that may be of independent interest. DC.js dc.js is a javascript charting library with native crossfilter support and allowing highly efficient exploration on large multi-dimensional dataset (inspired by crossfilter’s demo). It leverages d3 engine to render charts in css friendly svg format. Charts rendered using dc.js are naturally data driven and reactive therefore providing instant feedback on user’s interaction. The main objective of this project is to provide an easy yet powerful javascript library which can be utilized to perform data visualization and analysis in browser as well as on mobile device. D-Calibration ➘ “Individual Survival Distribution” DCDistance Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the pre-processing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curse-of-dimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99% of the original set. Additionally, this algorithm was also capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and state-of-the-art features selection algorithms. DCM Bandits Search engines recommend a list of web pages. The user examines this list, from the first page to the last, and may click on multiple attractive pages. This type of user behavior can be modeled by the \emph{dependent click model (DCM)}. In this work, we propose \emph{DCM bandits}, an online learning variant of the DCM model where the objective is to maximize the probability of recommending a satisfactory item. The main challenge of our problem is that the learning agent does not observe the reward. It only observes the clicks. This imbalance between the feedback and rewards makes our setting challenging. We propose a computationally-efficient learning algorithm for our problem, which we call dcmKL-UCB; derive gap-dependent upper bounds on its regret under reasonable assumptions; and prove a matching lower bound up to logarithmic factors. We experiment with dcmKL-UCB on both synthetic and real-world problems. Our algorithm outperforms a range of baselines and performs well even when our modeling assumptions are violated. To the best of our knowledge, this is the first regret-optimal online learning algorithm for learning to rank with multiple clicks in a cascade-like model. DCSVM We present DCSVM, an efficient algorithm for multi-class classification using Support Vector Machines. DCSVM is a divide and conquer algorithm which relies on data sparsity in high dimensional space and performs a smart partitioning of the whole training data set into disjoint subsets that are easily separable. A single prediction performed between two partitions eliminates at once one or more classes in one partition, leaving only a reduced number of candidate classes for subsequent steps. The algorithm continues recursively, reducing the number of classes at each step, until a final binary decision is made between the last two classes left in the competition. In the best case scenario, our algorithm makes a final decision between $k$ classes in $O(\log k)$ decision steps and in the worst case scenario DCSVM makes a final decision in $k-1$ steps, which is not worse than the existent techniques. DDRL-AM Learning powerful discriminative features for remote sensing image scene classification is a challenging computer vision problem. In the past, most classification approaches were based on handcrafted features. However, most recent approaches to remote sensing scene classification are based on Convolutional Neural Networks (CNNs). The de facto practice when learning these CNN models is only to use original RGB patches as input with training performed on large amounts of labeled data (ImageNet). In this paper, we show class activation map (CAM) encoded CNN models, codenamed DDRL-AM, trained using original RGB patches and attention map based class information provide complementary information to the standard RGB deep models. To the best of our knowledge, we are the first to investigate attention information encoded CNNs. Additionally, to enhance the discriminability, we further employ a recently developed object function called ‘center loss,’ which has proved to be very useful in face recognition. Finally, our framework provides attention guidance to the model in an end-to-end fashion. Extensive experiments on two benchmark datasets show that our approach matches or exceeds the performance of other methods. De Bruijn Entropy De Bruijn entropy and string similarity Deadline-Aware Task rEplication for Vehicular Cloud(DATE-V) Vehicular Cloud Computing (VCC) is a new technological shift which exploits the computation and storage resources on vehicles for computational service provisioning. Spare on-board resources are pooled by a VCC operator, e.g. a roadside unit, to complete task requests using the vehicle-as-a-resource framework. In this paper, we investigate timely service provisioning for deadline-constrained tasks in VCC systems by leveraging the task replication technique (i.e., allowing one task to be executed by several server vehicles). A learning-based algorithm, called DATE-V (Deadline-Aware Task rEplication for Vehicular Cloud), is proposed to address the special issues in VCC systems including uncertainty of vehicle movements, volatile vehicle members, and large vehicle population. The proposed algorithm is developed based on a novel Contextual-Combinatorial Multi-Armed Bandit (CC-MAB) learning framework. DATE-V is contextual’ because it utilizes side information (context) of vehicles and tasks to infer the completion probability of a task replication under random vehicle movements. DATE-V is combinatorial’ because it aims to replicate the received task and send the task replications to multiple server vehicles to guarantee the service timeliness. We rigorously prove that our learning algorithm achieves a sublinear regret bound compared to an oracle algorithm that knows the exact completion probability of any task replications. Simulations are carried out based on real-world vehicle movement traces and the results show that DATE-V significantly outperforms benchmark solutions. Debagging It is easy to convert a sentence into a bag of words, but it is much harder to convert a bag of words into a meaningful sentence. We name the latter the debagging problem. De-Biased Sparse PCA Sparse principal component analysis (sPCA) has become one of the most widely used techniques for dimensionality reduction in high-dimensional datasets. The main challenge underlying sPCA is to estimate the first vector of loadings of the population covariance matrix, provided that only a certain number of loadings are non-zero. In this paper, we propose confidence intervals for individual loadings and for the largest eigenvalue of the population covariance matrix. Given an independent sample $X^i \in\mathbb R^p, i = 1,…,n,$ generated from an unknown distribution with an unknown covariance matrix $\Sigma_0$, our aim is to estimate the first vector of loadings and the largest eigenvalue of $\Sigma_0$ in a setting where $p\gg n$. Next to the high-dimensionality, another challenge lies in the inherent non-convexity of the problem. We base our methodology on a Lasso-penalized M-estimator which, despite non-convexity, may be solved by a polynomial-time algorithm such as coordinate or gradient descent. We show that our estimator achieves the minimax optimal rates in $\ell_1$ and $\ell_2$-norm. We identify the bias in the Lasso-based estimator and propose a de-biased sparse PCA estimator for the vector of loadings and for the largest eigenvalue of the covariance matrix $\Sigma_0$. Our main results provide theoretical guarantees for asymptotic normality of the de-biased estimator. The major conditions we impose are sparsity in the first eigenvector of small order $\sqrt{n}/\log p$ and sparsity of the same order in the columns of the inverse Hessian matrix of the population risk. Decentralized Elimination We consider the \textit {decentralized exploration problem}: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm {\sc Decentralized Elimination}, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Finally, we propose an extension of the proposed algorithm to the non-stationary bandits. Experiments illustrate and complete the analysis. Decentralized Exploration Problem We consider the \textit {decentralized exploration problem}: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm {\sc Decentralized Elimination}, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Finally, we propose an extension of the proposed algorithm to the non-stationary bandits. Experiments illustrate and complete the analysis. Decentralized High-Dimensional Bayesian Optimization(DEC-HBO) This paper presents a novel decentralized high-dimensional Bayesian optimization (DEC-HBO) algorithm that, in contrast to existing HBO algorithms, can exploit the interdependent effects of various input components on the output of the unknown objective function f for boosting the BO performance and still preserve scalability in the number of input dimensions without requiring prior knowledge or the existence of a low (effective) dimension of the input space. To realize this, we propose a sparse yet rich factor graph representation of f to be exploited for designing an acquisition function that can be similarly represented by a sparse factor graph and hence be efficiently optimized in a decentralized manner using distributed message passing. Despite richly characterizing the interdependent effects of the input components on the output of f with a factor graph, DEC-HBO can still guarantee no-regret performance asymptotically. Empirical evaluation on synthetic and real-world experiments (e.g., sparse Gaussian process model with 1811 hyperparameters) shows that DEC-HBO outperforms the state-of-the-art HBO algorithms. Decentralized Simultaneous Perturbation Stochastic Approximations, with Constant Sensitivity Parameters(DSPG) In this paper, we present an asynchronous approximate gradient method that is easy to implement called DSPG (Decentralized Simultaneous Perturbation Stochastic Approximations, with Constant Sensitivity Parameters). It is obtained by modifying SPSA (Simultaneous Perturbation Stochastic Approximations) to allow for decentralized optimization in multi-agent learning and distributed control scenarios. SPSA is a popular approximate gradient method developed by Spall, that is used in Robotics and Learning. In the multi-agent learning setup considered herein, the agents are assumed to be asynchronous (agents abide by their local clocks) and communicate via a wireless medium, that is prone to losses and delays. We analyze the gradient estimation bias that arises from setting the sensitivity parameters to a single value, and the bias that arises from communication losses and delays. Specifically, we show that these biases can be countered through better and frequent communication and/or by choosing a small fixed value for the sensitivity parameters. We also discuss the variance of the gradient estimator and its effect on the rate of convergence. Finally, we present numerical results supporting DSPG and the aforementioned theories and discussions. DeceptionNet We present a novel approach to tackle domain adaptation between synthetic and real data. Instead of employing ‘blind’ domain randomization, i.e. augmenting synthetic renderings with random backgrounds or changing illumination and colorization, we leverage the task network as its own adversarial guide towards useful augmentations that maximize the uncertainty of the output. To this end, we design a min-max optimization scheme where a given task competes against a special deception network, with the goal of minimizing the task error subject to specific constraints enforced by the deceiver. The deception network samples from a family of differentiable pixel-level perturbations and exploits the task architecture to find the most destructive augmentations. Unlike GAN-based approaches that require unlabeled data from the target domain, our method achieves robust mappings that scale well to multiple target distributions from source data alone. We apply our framework to the tasks of digit recognition on enhanced MNIST variants as well as classification and object pose estimation on the Cropped LineMOD dataset and compare to a number of domain adaptation approaches, demonstrating similar results with superior generalization capabilities. Decision Analysis(DA) Decision analysis (DA) is the discipline comprising the philosophy, theory, methodology, and professional practice necessary to address important decisions in a formal manner. Decision analysis includes many procedures, methods, and tools for identifying, clearly representing, and formally assessing important aspects of a decision, for prescribing a recommended course of action by applying the maximum expected utility action axiom to a well-formed representation of the decision, and for translating the formal representation of a decision and its corresponding recommendation into insight for the decision maker and other stakeholders. Decision Forest Customer behavior is often assumed to follow weak rationality, which implies that adding a product to an assortment will not increase the choice probability of another product in that assortment. However, an increasing amount of research has revealed that customers are not necessarily rational when making decisions. In this paper, we study a new nonparametric choice model that relaxes this assumption and can model a wider range of customer behavior, such as decoy effects between products. In this model, each customer type is associated with a binary decision tree, which represents a decision process for making a purchase based on checking for the existence of specific products in the assortment. Together with a probability distribution over customer types, we show that the resulting model — a decision forest — is able to represent any customer choice model, including models that are inconsistent with weak rationality. We theoretically characterize the depth of the forest needed to fit a data set of historical assortments and prove that asymptotically, a forest whose depth scales logarithmically in the number of assortments is sufficient to fit most data sets. We also propose an efficient algorithm for estimating such models from data, based on combining randomization and optimization. Using synthetic data and real transaction data exhibiting non-rational behavior, we show that the model outperforms the multinomial logit and ranking-based models in out-of-sample predictive ability. Decision Model and Notation(DMN) The primary goal of DMN is to provide an industry standard modelling notation for decision management and business rules that is readily understandable by all business users: from the business analysts who need to create initial decision requirements and then more detailed decision models, to the technical developers responsible for automating the decisions in processes, and finally, to the business people who will manage and monitor those decisions. The submission has been designed to be complementary to and useable alongside the OMG Business Process Model & Notation (BPMN) standard and will ensure that decision models are interchangeable across organizations. Decision Requirement Knowledge Bases(DKB) The Decision Model and Notation (DMN) is a recent OMG standard for the elicitation and representation of decision models, and for managing their interconnection with business processes. DMN builds on the notion of decision table, and their combination into more complex decision requirements graphs (DRGs), which bridge between business process models and decision logic models. DRGs may rely on additional, external business knowledge models, whose functioning is not part of the standard. In this work, we consider one of the most important types of business knowledge, namely background knowledge that conceptually accounts for the structural aspects of the domain of interest, and propose decision requirement knowledge bases (DKBs), where DRGs are modeled in DMN, and domain knowledge is captured by means of first-order logic with datatypes. We provide a logic-based semantics for such an integration, and formalize different DMN reasoning tasks for DKBs. We then consider background knowledge formulated as a description logic ontology with datatypes, and show how the main verification tasks for DMN in this enriched setting, can be formalized as standard DL reasoning services, and actually carried out in ExpTime. We discuss the effectiveness of our framework on a case study in maritime security. This work is under consideration in Theory and Practice of Logic Programming (TPLP). Decision Scientist Decision Scientists build decision support tools to enable decision makers to make decisions, or take action, under uncertainty with a data-centric bias. Traditional analytics falls under this domain. Often decision makers like linear solutions that provide simple, explainable, socializable decision making frameworks. That is, they are looking for a rationale. Data Scientists build machines to make decisions about large-scale complex dynamical processes that are typically too fast (velocity, veracity, volume, etc.) for a human operator/manager. They typically don’t concern themselves with whether the algorithm is explainable or socializable, but are more concerned with whether it is functional, reliable, accurate, and robust. Decision Stream Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Selection of relevant features for spitting the tree nodes is a key property of their architecture, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture – a Decision Stream, – aimed to overcome this problem. Instead of building an acyclic tree structure during the training process, we propose merging nodes from different branches based on their similarity that is estimated with two-sample test statistics. To evaluate the proposed solution, we test it on several common machine learning problems~— credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree method on both regression and classification tasks, yielding a prediction error decrease up to 35%. Decision Stump A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1-rules. Decision Support(DS) The term Decision Support (DS) is used often and in a variety of contexts related to decision making. Recently, for example, it is often mentioned in connection with Data Warehouses and On-Line Analytical Processing (OLAP). Another recent trend is to associate DS with Data Mining. This is the case in the project SolEuNet , which attempts to exploit these two approaches in a complementary way in order to support difficult real-life problem solving. Unfortunately, although the term ‘Decision Support’ seems rather intuitive and simple, it is in fact very loosely defined. It means different things to different people and in different contexts. Also, its meaning has shifted during the recent history. Nowadays, DS is probably most often associated with Data Warehouses and OLAP. A decade ago, it was coupled with Decision Support Systems (DSS). Still before that, there was a close link with Operations Research (OR) and Decision Analysis (DA). This causes a lot of confusion and misunderstanding, and provokes requests for clarification. The confusion is further exemplified by the multitude of related terms and acronyms that are either equal to, or start with ‘DS’: Decision Support, Decision Sciences, Decision Systems, Decision Support Systems, etc. This paper attempts to clarify these issues. We take the viewpoint that Decision Support is a broad, generic term that encompasses all aspects related to supporting people in making decisions. First, we present the results of a survey of WWW documents related to DS. On this basis, and on the basis of relevant literature and our previous experience in the field of DS, we provide a classification of DS and related disciplines. DS itself is given a role within Decision Making and Decision Sciences. Some most prominent DS disciplines are briefly overviewed: Operations Research, Decision Analysis, Decision Support Systems, Data Warehousing and OLAP, and Group Decision Support. Decision Support System(DSS) A Decision Support System (DSS) is a computer-based information system that supports business or organizational decision-making activities. DSSs serve the management, operations, and planning levels of an organization (usually mid and higher management) and help to make decisions, which may be rapidly changing and not easily specified in advance (Unstructured and Semi-Structured decision problems). Decision support systems can be either fully computerized, human or a combination of both. While academics have perceived DSS as a tool to support decision making process, DSS users see DSS as a tool to facilitate organizational processes. Some authors have extended the definition of DSS to include any system that might support decision making. Sprague (1980) defines DSS by its characteristics: 1. DSS tends to be aimed at the less well structured, underspecified problem that upper level managers typically face; 2. DSS attempts to combine the use of models or analytic techniques with traditional data access and retrieval functions; 3. DSS specifically focuses on features which make them easy to use by noncomputer people in an interactive mode; and 4. DSS emphasizes flexibility and adaptability to accommodate changes in the environment and the decision making approach of the user. DSSs include knowledge-based systems. A properly designed DSS is an interactive software-based system intended to help decision makers compile useful information from a combination of raw data, documents, and personal knowledge, or business models to identify and solve problems and make decisions. Typical information that a decision support application might gather and present includes: · inventories of information assets (including legacy and relational data sources, cubes, data warehouses, and data marts), · comparative sales figures between one period and the next, · projected revenue figures based on product sales assumptions. Decision Theory Decision theory or theory of choice in economics, psychology, philosophy, mathematics, computer science, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision. It is closely related to the field of game theory; decision theory is concerned with the choices of individual agents whereas game theory is concerned with interactions of agents whose decisions affect each other. Decision Tree Based Missing Value Imputation Technique(DMI) Decision tree based Missing value Imputation technique’ (DMI) makes use of an EM algorithm and a decision tree (DT) algorithm. Decision Tree Learning / Classification and Regression Trees(CART) Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Declarative Statistics In this work we introduce declarative statistics, a suite of declarative modelling tools for statistical analysis. Statistical constraints represent the key building block of declarative statistics. First, we introduce a range of relevant counting and matrix constraints and associated decompositions, some of which novel, that are instrumental in the design of statistical constraints. Second, we introduce a selection of novel statistical constraints and associated decompositions, which constitute a self-contained toolbox that can be used to tackle a wide range of problems typically encountered by statisticians. Finally, we deploy these statistical constraints to a wide range of application areas drawn from classical statistics and we contrast our framework against established practices. Decomposition-Based Transfer Distance Metric Learning(DTDML) Distance metric learning (DML) is a critical factor for image analysis and pattern recognition. To learn a robust distance metric for a target task, we need abundant side information (i.e., the similarity/dissimilarity pairwise constraints over the labeled data), which is usually unavailable in practice due to the high labeling cost. This paper considers the transfer learning setting by exploiting the large quantity of side information from certain related, but different source tasks to help with target metric learning (with only a little side information). The state-of-the-art metric learning algorithms usually fail in this setting because the data distributions of the source task and target task are often quite different. We address this problem by assuming that the target distance metric lies in the space spanned by the eigenvectors of the source metrics (or other randomly generated bases). The target metric is represented as a combination of the base metrics, which are computed using the decomposed components of the source metrics (or simply a set of random bases); we call the proposed method, decomposition-based transfer DML (DTDML). In particular, DTDML learns a sparse combination of the base metrics to construct the target metric by forcing the target metric to be close to an integration of the source metrics. The main advantage of the proposed method compared with existing transfer metric learning approaches is that we directly learn the base metric coefficients instead of the target metric. To this end, far fewer variables need to be learned. We therefore obtain more reliable solutions given the limited side information and the optimization tends to be faster. Experiments on the popular handwritten image (digit, letter) classification and challenge natural image annotation tasks demonstrate the effectiveness of the proposed method. Deconfounded Recommender The goal of a recommender system is to show its users items that they will like. In forming its prediction, the recommender system tries to answer: ‘what would the rating be if we ‘forced’ the user to watch the movie?’ This is a question about an intervention in the world, a causal question, and so traditional recommender systems are doing causal inference from observational data. This paper develops a causal inference approach to recommendation. Traditional recommenders are likely biased by unobserved confounders, variables that affect both the ‘treatment assignments’ (which movies the users watch) and the ‘outcomes’ (how they rate them). We develop the deconfounded recommender, a strategy to leverage classical recommendation models for causal predictions. The deconfounded recommender uses Poisson factorization on which movies users watched to infer latent confounders in the data; it then augments common recommendation models to correct for potential confounding bias. The deconfounded recommender improves recommendation and it enjoys stable performance against interventions on test sets. Deconvolutional Paragraph Representation Learning Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNN-based decoding (reconstruction) decreases with the length of the text. We propose a sequence-to-sequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semi-supervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data. Decoupled Learning Incorporating encoding-decoding nets with adversarial nets has been widely adopted in image generation tasks. We observe that the state-of-the-art achievements were obtained by carefully balancing the reconstruction loss and adversarial loss, and such balance shifts with different network structures, datasets, and training strategies. Empirical studies have demonstrated that an inappropriate weight between the two losses may cause instability, and it is tricky to search for the optimal setting, especially when lacking prior knowledge on the data and network. This paper gives the first attempt to relax the need of manual balancing by proposing the concept of \textit{decoupled learning}, where a novel network structure is designed that explicitly disentangles the backpropagation paths of the two losses. Experimental results demonstrate the effectiveness, robustness, and generality of the proposed method. The other contribution of the paper is the design of a new evaluation metric to measure the image quality of generative models. We propose the so-called \textit{normalized relative discriminative score} (NRDS), which introduces the idea of relative comparison, rather than providing absolute estimates like existing metrics. Decoupled Network Inner product-based convolution has been a central component of convolutional neural networks (CNNs) and the key to learning visual representations. Inspired by the observation that CNN-learned features are naturally decoupled with the norm of features corresponding to the intra-class variation and the angle corresponding to the semantic difference, we propose a generic decoupled learning framework which models the intra-class variation and semantic difference independently. Specifically, we first reparametrize the inner product to a decoupled form and then generalize it to the decoupled convolution operator which serves as the building block of our decoupled networks. We present several effective instances of the decoupled convolution operator. Each decoupled operator is well motivated and has an intuitive geometric interpretation. Based on these decoupled operators, we further propose to directly learn the operator from data. Extensive experiments show that such decoupled reparameterization renders significant performance gain with easier convergence and stronger robustness. Decreasing-Trend-Nature(DTN) We propose a novel diminishing learning rate scheme, coined Decreasing-Trend-Nature (DTN), which allows us to prove fast convergence of the Stochastic Gradient Descent (SGD) algorithm to a first-order stationary point for smooth general convex and some class of nonconvex including neural network applications for classification problems. We are the first to prove that SGD with diminishing learning rate achieves a convergence rate of $\mathcal{O}(1/t)$ for these problems. Our theory applies to neural network applications for classification problems in a straightforward way. DEDPUL This paper studies Positive-Unlabeled Classification, the problem of semi-supervised binary classification in the case when Negative (N) class in the training set is contaminated with instances of Positive (P) class. We develop a novel method (DEDPUL) that simultaneously solves two problems concerning the contaminated Unlabeled (U) sample: estimates the proportions of the mixing components (P and N) in U, and classifies U. By conducting experiments on synthetic and real-world data we favorably compare DEDPUL with current state-of-the-art methods for both problems. We introduce an automatic procedure for DEDPUL hyperparameter optimization. Additionally, we improve two methods in the literature and achieve DEDPUL level of performance with one of them. Deducer An R Graphical User Interface (GUI) for Everyone: Deducer is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excel-like spreadsheet in which to view and edit data frames. The goal of the project is two fold. 1. Provide an intuitive graphical user interface (GUI) for R, encouraging non-technical users to learn and perform analyses without programming getting in their way. 2. Increase the efficiency of expert R users when performing common tasks by replacing hundreds of keystrokes with a few mouse clicks. Also, as much as possible the GUI should not get in their way if they just want to do some programming. Deducer is designed to be used with the Java based R console JGR, though it supports a number of other R environments (e.g. Windows RGUI and RTerm). Deductive Reasoning Deductive reasoning, also deductive logic, logical deduction is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion. Deductive reasoning goes in the same direction as that of the conditionals, and links premises with conclusions. If all premises are true, the terms are clear, and the rules of deductive logic are followed, then the conclusion reached is necessarily true. Deductive reasoning (‘top-down logic’) contrasts with inductive reasoning (‘bottom-up logic’) in the following way; in deductive reasoning, a conclusion is reached reductively by applying general rules which hold over the entirety of a closed domain of discourse, narrowing the range under consideration until only the conclusion(s) is left. In inductive reasoning, the conclusion is reached by generalizing or extrapolating from specific cases to general rules, i.e., there is epistemic uncertainty. However, the inductive reasoning mentioned here is not the same as induction used in mathematical proofs – mathematical induction is actually a form of deductive reasoning. Deductive reasoning differs from abductive reasoning by the direction of the reasoning relative to the conditionals. Deductive reasoning goes in the same direction as that of the conditionals, whereas abductive reasoning goes in the opposite direction to that of the conditionals. Deductron The current paper is a study in Recurrent Neural Networks (RNN), motivated by the lack of examples simple enough so that they can be thoroughly understood theoretically, but complex enough to be realistic. We constructed an example of structured data, motivated by problems from image-to-text conversion (OCR), which requires long-term memory to decode. Our data is a simple writing system, encoding characters ‘X’ and ‘O’ as their upper halves, which is possible due to symmetry of the two characters. The characters can be connected, as in some languages using cursive, such as Arabic (abjad). The string ‘XOOXXO’ may be encoded as ‘${\vee}{\wedge}\kern-1.5pt{\wedge}{\vee}\kern-1.5pt{\vee}{\wedge}$’. It follows that we may need to know arbitrarily long past to decode a current character, thus requiring long-term memory. Subsequently we constructed an RNN capable of decoding sequences encoded in this manner. Rather than by training, we constructed our RNN ‘by inspection’, i.e. we guessed its weights. This involved a sequence of steps. We wrote a conventional program which decodes the sequences as the example above. Subsequently, we interpreted the program as a neural network (the only example of this kind known to us). Finally, we generalized this neural network to discover a new RNN architecture whose instance is our handcrafted RNN. It turns out to be a 3 layer network, where the middle layer is capable of performing simple logical inferences; thus the name ‘deductron’. It is demonstrated that it is possible to train our network by simulated annealing. Also, known variants of stochastic gradient descent (SGD) methods are shown to work. Deduplication with Hadoop(Dedoop) Entity Matching for Big Data: Automatically matching entities (objects) and ontologies are key technologies to semantically integrate heterogeneous data. These match techniques are needed to identify equivalent data objects (duplicates) or semantically equivalent metadata elements (ontology concepts, schema attributes). The proposed techniques demand very high resources that limit their applicability to large-scale (Big Data) problems unless a powerful cloud infrastructure can be utilized. This is because the (fuzzy) match approaches basically have a quadratic complexity to compare the all elements to be matched with each other. For sufficient match quality, multiple match algorithms need to be applied and combined within so-called match workflows adding further resource requirements as well as a significant optimization problem to select matchers and configure their combination. Deep Abstract Q-Network We examine the problem of learning and planning on high-dimensional domains with long horizons and sparse rewards. Recent approaches have shown great successes in many Atari 2600 domains. However, domains with long horizons and sparse rewards, such as Montezuma’s Revenge and Venture, remain challenging for existing methods. Methods using abstraction (Dietterich 2000; Sutton, Precup, and Singh 1999) have shown to be useful in tackling long-horizon problems. We combine recent techniques of deep reinforcement learning with existing model-based approaches using an expert-provided state abstraction. We construct toy domains that elucidate the problem of long horizons, sparse rewards and high-dimensional inputs, and show that our algorithm significantly outperforms previous methods on these domains. Our abstraction-based approach outperforms Deep Q-Networks (Mnih et al. 2015) on Montezuma’s Revenge and Venture, and exhibits backtracking behavior that is absent from previous methods. Deep Adversarial Data Augmentation(DADA) Deep learning has revolutionized the performance of classification, but meanwhile demands sufficient labeled data for training. Given insufficient data, while many techniques have been developed to help combat overfitting, the challenge remains if one tries to train deep networks, especially in the ill-posed extremely low data regimes: only a small set of labeled data are available, and nothing — including unlabeled data — else. Such regimes arise from practical situations where not only data labeling but also data collection itself is expensive. We propose a deep adversarial data augmentation (DADA) technique to address the problem, in which we elaborately formulate data augmentation as a problem of training a class-conditional and supervised generative adversarial network (GAN). Specifically, a new discriminator loss is proposed to fit the goal of data augmentation, through which both real and augmented samples are enforced to contribute to and be consistent in finding the decision boundaries. Tailored training techniques are developed accordingly. To quantitatively validate its effectiveness, we first perform extensive simulations to show that DADA substantially outperforms both traditional data augmentation and a few GAN-based options. We then extend experiments to three real-world small labeled datasets where existing data augmentation and/or transfer learning strategies are either less effective or infeasible. All results endorse the superior capability of DADA in enhancing the generalization ability of deep networks trained in practical extremely low data regimes. Source code is available at https://…/DADA. Deep Adversarial Network Alignment(DANA) Network alignment, in general, seeks to discover the hidden underlying correspondence between nodes across two (or more) networks when given their network structure. However, most existing network alignment methods have added assumptions of additional constraints to guide the alignment, such as having a set of seed node-node correspondences across the networks or the existence of side-information. Instead, we seek to develop a general network alignment algorithm that makes no additional assumptions. Recently, network embedding has proven effective in many network analysis tasks, but embeddings of different networks are not aligned. Thus, we present our Deep Adversarial Network Alignment (DANA) framework that first uses deep adversarial learning to discover complex mappings for aligning the embedding distributions of the two networks. Then, using our learned mapping functions, DANA performs an efficient nearest neighbor node alignment. We perform experiments on real world datasets to show the effectiveness of our framework for first aligning the graph embedding distributions and then discovering node alignments that outperform existing methods. Deep Alignment Network(DAN) In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the state-of-the-art failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge. Classifying and Visualizing Emotions with Emotional DAN Deep AlterNations for Training nEural networks(DANTE) We present DANTE, a novel method for training neural networks using the alternating minimization principle. DANTE provides an alternate perspective to traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convexity to cast training a neural network as a bi-quasi-convex optimization problem. We show that for neural network configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE can also be extended to networks with multiple hidden layers. In experiments on standard datasets, neural networks trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of the solution, as well as training speed. Deep Analytic Network(DAN) Stacking-based deep neural network (S-DNN) is aggregated with pluralities of basic learning modules, one after another, to synthesize a deep neural network (DNN) alternative for pattern classification. Contrary to the DNNs trained end to end by backpropagation (BP), each S-DNN layer, i.e., a self-learnable module, is to be trained decisively and independently without BP intervention. In this paper, a ridge regression-based S-DNN, dubbed deep analytic network (DAN), along with its kernelization (K-DAN), are devised for multilayer feature re-learning from the pre-extracted baseline features and the structured features. Our theoretical formulation demonstrates that DAN/K-DAN re-learn by perturbing the intra/inter-class variations, apart from diminishing the prediction errors. We scrutinize the DAN/K-DAN performance for pattern classification on datasets of varying domains – faces, handwritten digits, generic objects, to name a few. Unlike the typical BP-optimized DNNs to be trained from gigantic datasets by GPU, we disclose that DAN/K-DAN are trainable using only CPU even for small-scale training sets. Our experimental results disclose that DAN/K-DAN outperform the present S-DNNs and also the BP-trained DNNs, including multiplayer perceptron, deep belief network, etc., without data augmentation applied. Deep Anchored Convolutional Neural Network(DACNN) Convolutional Neural Networks (CNNs) have been proven to be extremely successful at solving computer vision tasks. State-of-the-art methods favor such deep network architectures for its accuracy performance, with the cost of having massive number of parameters and high weights redundancy. Previous works have studied how to prune such CNNs weights. In this paper, we go to another extreme and analyze the performance of a network stacked with a single convolution kernel across layers, as well as other weights sharing techniques. We name it Deep Anchored Convolutional Neural Network (DACNN). Sharing the same kernel weights across layers allows to reduce the model size tremendously, more precisely, the network is compressed in memory by a factor of L, where L is the desired depth of the network, disregarding the fully connected layer for prediction. The number of parameters in DACNN barely increases as the network grows deeper, which allows us to build deep DACNNs without any concern about memory costs. We also introduce a partial shared weights network (DACNN-mix) as well as an easy-plug-in module, coined regulators, to boost the performance of our architecture. We validated our idea on 3 datasets: CIFAR-10, CIFAR-100 and SVHN. Our results show that we can save massive amounts of memory with our model, while maintaining a high accuracy performance. Deep and Shallow Feature Learning Network(DSNet) In recent years, Discriminative Correlation Filter (DCF) based tracking methods have achieved great success in visual tracking. However, the multi-resolution convolutional feature maps trained from other tasks like image classification, cannot be naturally used in the conventional DCF formulation. Furthermore, these high-dimensional feature maps significantly increase the tracking complexity and thus limit the tracking speed. In this paper, we present a deep and shallow feature learning network, namely DSNet, to learn the multi-level same-resolution compressed (MSC) features for efficient online tracking, in an end-to-end offline manner. Specifically, the proposed DSNet compresses multi-level convolutional features to uniform spatial resolution features. The learned MSC features effectively encode both appearance and semantic information of objects in the same-resolution feature maps, thus enabling an elegant combination of the MSC features with any DCF-based methods. Additionally, a channel reliability measurement (CRM) method is presented to further refine the learned MSC features. We demonstrate the effectiveness of the MSC features learned from the proposed DSNet on two DCF tracking frameworks: the basic DCF framework and the continuous convolution operator framework. Extensive experiments show that the learned MSC features have the appealing advantage of allowing the equipped DCF-based tracking methods to perform favorably against the state-of-the-art methods while running at high frame rates. Deep Appearance map(DAM) We propose a deep representation of appearance, i. e. the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have used deep learning to extract classic appearance representations relating to reflectance model parameters (e. g. Phong) or illumination (e. g. HDR environment maps). We suggest to directly represent appearance itself as a network we call a deep appearance map (DAM). This is a 4D generalization over 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn). Finally, we generalize this to an appearance estimation-and-segmentation task, where we map from an image showing multiple materials to multiple networks reproducing their appearance, as well as per-pixel segmentation. Deep Approximately Orthogonal Nonnegative Matrix Factorization Nonnegative Matrix Factorization (NMF) is a widely used technique for data representation. Inspired by the expressive power of deep learning, several NMF variants equipped with deep architectures have been proposed. However, these methods mostly use the only nonnegativity while ignoring task-specific features of data. In this paper, we propose a novel deep approximately orthogonal nonnegative matrix factorization method where both nonnegativity and orthogonality are imposed with the aim to perform a hierarchical clustering by using different level of abstractions of data. Experiment on two face image datasets showed that the proposed method achieved better clustering performance than other deep matrix factorization methods and state-of-the-art single layer NMF variants. Deep Archetypal Analysis Deep Archetypal Analysis’ generates latent representations of high-dimensional datasets in terms of fractions of intuitively understandable basic entities called archetypes. The proposed method is an extension of linear ‘Archetypal Analysis’ (AA), an unsupervised method to represent multivariate data points as sparse convex combinations of extremal elements of the dataset. Unlike the original formulation of AA, ‘Deep AA’ can also handle side information and provides the ability for data-driven representation learning which reduces the dependence on expert knowledge. Our method is motivated by studies of evolutionary trade-offs in biology where archetypes are species highly adapted to a single task. Along these lines, we demonstrate that ‘Deep AA’ also lends itself to the supervised exploration of chemical space, marking a distinct starting point for de novo molecular design. In the unsupervised setting we show how ‘Deep AA’ is used on CelebA to identify archetypal faces. These can then be superimposed in order to generate new faces which inherit dominant traits of the archetypes they are based on. Deep Asymmetric Multitask Feature Learning(Deep-AMTFL) We propose Deep Asymmetric Multitask Feature Learning (Deep-AMTFL) which can learn deep representations shared across multiple tasks while effectively preventing negative transfer that may happen in the feature sharing process. Specifically, we introduce an asymmetric autoencoder term that allows predictors for the confident tasks to have high contribution to the feature learning while suppressing the influences of less confident task predictors. This allows learning less noisy representations, and allows weak predictors to exploit knowledge from the strong predictors via the shared latent features. Such asymmetric knowledge transfer through shared features is also more scalable and efficient than inter-task asymmetric transfer. We validate our Deep-AMTFL model on multiple benchmark datasets for multitask learning and image classification, on which it significantly outperforms existing symmetric and asymmetric multitask learning models, by effectively preventing negative transfer in deep feature learning. Deep Asymmetric Network This work presents deep asymmetric networks with a set of node-wise variant activation functions. The nodes’ sensitivities are affected by activation function selections such that the nodes with smaller indices become increasingly more sensitive. As a result, features learned by the nodes are sorted by the node indices in the order of their importance. Asymmetric networks not only learn input features but also the importance of those features. Nodes of lesser importance in asymmetric networks can be pruned to reduce the complexity of the networks, and the pruned networks can be retrained without incurring performance losses. We validate the feature-sorting property using both shallow and deep asymmetric networks as well as deep asymmetric networks transferred from famous networks. Deep Attention GAN(DA-GAN) Unsupervised image translation, which aims in translating two independent sets of images, is challenging in discovering the correct correspondences without paired data. Existing works build upon Generative Adversarial Network (GAN) such that the distribution of the translated images are indistinguishable from the distribution of the target set. However, such set-level constraints cannot learn the instance-level correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem. To address the above issues, we propose a novel framework for instance-level image translation by Deep Attention GAN (DA-GAN). Such a design enables DA-GAN to decompose the task of translating samples from two sets into translating instances in a highly-structured latent space. Specifically, we jointly learn a deep attention encoder, and the instancelevel correspondences could be consequently discovered through attending on the learned instance pairs. Therefore, the constraints could be exploited on both set-level and instance-level. Comparisons against several state-ofthe- arts demonstrate the superiority of our approach, and the broad application capability, e.g, pose morphing, data augmentation, etc., pushes the margin of domain translation problem. Deep Attention-guided Hashing(DAgH) With the rapid growth of multimedia data (e.g., image, audio and video etc.) on the web, learning-based hashing techniques such as Deep Supervised Hashing (DSH) have proven to be very efficient for large-scale multimedia search. The recent successes seen in Learning-based hashing methods are largely due to the success of deep learning-based hashing methods. However, there are some limitations to previous learning-based hashing methods (e.g., the learned hash codes containing repetitive and highly correlated information). In this paper, we propose a novel learning-based hashing method, named Deep Attention-guided Hashing (DAgH). DAgH is implemented using two stream frameworks. The core idea is to use guided hash codes which are generated by the hashing network of the first stream framework (called first hashing network) to guide the training of the hashing network of the second stream framework (called second hashing network). Specifically, in the first network, it leverages an attention network and hashing network to generate the attention-guided hash codes from the original images. The loss function we propose contains two components: the semantic loss and the attention loss. The attention loss is used to punish the attention network to obtain the salient region from pairs of images; in the second network, these attention-guided hash codes are used to guide the training of the second hashing network (i.e., these codes are treated as supervised labels to train the second network). By doing this, DAgH can make full use of the most critical information contained in images to guide the second hashing network in order to learn efficient hash codes in a true end-to-end fashion. Results from our experiments demonstrate that DAgH can generate high quality hash codes and it outperforms current state-of-the-art methods on three benchmark datasets, CIFAR-10, NUS-WIDE, and ImageNet. Deep Auto-Encoder and Q-Network(DAQN) The deep reinforcement learning method usually requires a large number of training images and executing actions to obtain sufficient results. When it is extended a real-task in the real environment with an actual robot, the method will be required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose an extended deep reinforcement learning method that is applied a generative model to initialize the network for reducing the number of training trials. In this paper, we used a deep q-network method as the deep reinforcement learning method and a deep auto-encoder as the generative model. We conducted experiments on three different tasks: a cart-pole game, an atari game, and a real-game with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 times faster on a task with real environment images. Deep Autoencoder MIxture Clustering(DAMIC) In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods. Deep Back-Projection Network(DBPN) The feed-forward architectures of recently proposed deep super-resolution networks learn representations of low-resolution inputs, and the non-linear mapping from those to high-resolution output. However, this approach does not fully address the mutual dependencies of low- and high-resolution images. We propose Deep Back-Projection Networks (DBPN), that exploit iterative up- and down-sampling layers, providing an error feedback mechanism for projection errors at each stage. We construct mutually-connected up- and down-sampling stages each of which represents different types of image degradation and high-resolution components. We show that extending this idea to allow concatenation of features across up- and down-sampling stages (Dense DBPN) allows us to reconstruct further improve super-resolution, yielding superior results and in particular establishing new state of the art results for large scaling factors such as 8x across multiple data sets. Deep Bayesian Active Semi-Supervised Learning In many applications the process of generating label information is expensive and time consuming. We present a new method that combines active and semi-supervised deep learning to achieve high generalization performance from a deep convolutional neural network with as few known labels as possible. In a setting where a small amount of labeled data as well as a large amount of unlabeled data is available, our method first learns the labeled data set. This initialization is followed by an expectation maximization algorithm, where further training reduces classification entropy on the unlabeled data by targeting a low entropy fit which is consistent with the labeled data. In addition the algorithm asks at a specified frequency an oracle for labels of data with entropy above a certain entropy quantile. Using this active learning component we obtain an agile labeling process that achieves high accuracy, but requires only a small amount of known labels. For the MNIST dataset we report an error rate of 2.06% using only 300 labels and 1.06% for 1000 labels. These results are obtained without employing any special network architecture or data augmentation. Deep Bayesian Multi-Target Learning(DBMTL) With the increasing variety of services that e-commerce platforms provide, criteria for evaluating their success become also increasingly multi-targeting. This work introduces a multi-target optimization framework with Bayesian modeling of the target events, called Deep Bayesian Multi-Target Learning (DBMTL). In this framework, target events are modeled as forming a Bayesian network, in which directed links are parameterized by hidden layers, and learned from training samples. The structure of Bayesian network is determined by model selection. We applied the framework to Taobao live-streaming recommendation, to simultaneously optimize (and strike a balance) on targets including click-through rate, user stay time in live room, purchasing behaviors and interactions. Significant improvement has been observed for the proposed method over other MTL frameworks and the non-MTL model. Our practice shows that with an integrated causality structure, we can effectively make the learning of a target benefit from other targets, creating significant synergy effects that improve all targets. The neural network construction guided by DBMTL fits in with the general probabilistic model connecting features and multiple targets, taking weaker assumption than the other methods discussed in this paper. This theoretical generality brings about practical generalization power over various targets distributions, including sparse targets and continuous-value ones. Deep Bayesian Regression Model(DBRM) Regression models are used for inference and prediction in a wide range of applications providing a powerful scientific tool for researchers and analysts from different fields. In many research fields the amount of available data as well as the number of potential explanatory variables is rapidly increasing. Variable selection and model averaging have become extremely important tools for improving inference and prediction. However, often linear models are not sufficient and the complex relationship between input variables and a response is better described by introducing non-linearities and complex functional interactions. Deep learning models have been extremely successful in terms of prediction although they are often difficult to specify and potentially suffer from overfitting. The aim of this paper is to bring the ideas of deep learning into a statistical framework which yields more parsimonious models and allows to quantify model uncertainty. To this end we introduce the class of deep Bayesian regression models (DBRM) consisting of a generalized linear model combined with a comprehensive non-linear feature space, where non-linear features are generated just like in deep learning but combined with variable selection in order to include only important features. DBRM can easily be extended to include latent Gaussian variables to model complex correlation structures between observations, which seems to be not easily possible with existing deep learning approaches. Two different algorithms based on MCMC are introduced to fit DBRM and to perform Bayesian inference. The predictive performance of these algorithms is compared with a large number of state of the art algorithms. Furthermore we illustrate how DBRM can be used for model inference in various applications. Deep Belief Networks(DBN) In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a type of deep neural network, composed of multiple layers of latent variables (“hidden units”), with connections between the layers but not between units within each layer. When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs. After this learning step, a DBN can be further trained in a supervised way to perform classification. DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each sub-network’s hidden layer serves as the visible layer for the next. This also leads to a fast, layer-by-layer unsupervised training procedure, where contrastive divergence is applied to each sub-network in turn, starting from the “lowest” pair of layers (the lowest visible layer being a training set). The observation, due to Hinton’s student Teh, that DBNs can be trained greedily, one layer at a time, has been called a breakthrough in deep learning. Deep Bilevel Learning We present a novel regularization approach to train neural networks that enjoys better generalization and test error than standard stochastic gradient descent. Our approach is based on the principles of cross-validation, where a validation set is used to limit the model overfitting. We formulate such principles as a bilevel optimization problem. This formulation allows us to define the optimization of a cost on the validation set subject to another optimization on the training set. The overfitting is controlled by introducing weights on each mini-batch in the training set and by choosing their values so that they minimize the error on the validation set. In practice, these weights define mini-batch learning rates in a gradient descent update equation that favor gradients with better generalization capabilities. Because of its simplicity, this approach can be integrated with other regularization methods and training schemes. We evaluate extensively our proposed algorithm on several neural network architectures and datasets, and find that it consistently improves the generalization of the model, especially when labels are noisy. Deep Broad Learning(DBL) Deep learning has demonstrated the power of detailed modeling of complex high-order (multivariate) interactions in data. For some learning tasks there is power in learning models that are not only Deep but also Broad. By Broad, we mean models that incorporate evidence from large numbers of features. This is of especial value in applications where many different features and combinations of features all carry small amounts of information about the class. The most accurate models will integrate all that information. In this paper, we propose an algorithm for Deep Broad Learning called DBL. The proposed algorithm has a tunable parameter $n$, that specifies the depth of the model. It provides straightforward paths towards out-of-core learning for large data. We demonstrate that DBL learns models from large quantities of data with accuracy that is highly competitive with the state-of-the-art. Deep Canonical Correlation Analysis(DCCA) Deep Clustering In the context of recent deep clustering studies, discriminative models dominate the literature and report the most competitive performances. These models learn a deep discriminative neural network classifier in which the labels are latent. Typically, they use multinomial logistic regression posteriors and parameter regularization, as is very common in supervised learning. It is generally acknowledged that discriminative objective functions (e.g., those based on the mutual information or the KL divergence) are more flexible than generative approaches (e.g., K-means) in the sense that they make fewer assumptions about the data distributions and, typically, yield much better unsupervised deep learning results. On the surface, several recent discriminative models may seem unrelated to K-means. This study shows that these models are, in fact, equivalent to K-means under mild conditions and common posterior models and parameter regularization. We prove that, for the commonly used logistic regression posteriors, maximizing the $L_2$ regularized mutual information via an approximate alternating direction method (ADM) is equivalent to a soft and regularized K-means loss. Our theoretical analysis not only connects directly several recent state-of-the-art discriminative models to K-means, but also leads to a new soft and regularized deep K-means algorithm, which yields competitive performance on several image clustering benchmarks. Deep COACH(D-COACH) Deep Reinforcement Learning (DRL) has become a powerful strategy to solve complex decision making problems based on Deep Neural Networks (DNNs). However, it is highly data demanding, so unfeasible in physical systems for most applications. In this work, we approach an alternative Interactive Machine Learning (IML) strategy for training DNN policies based on human corrective feedback, with a method called Deep COACH (D-COACH). This approach not only takes advantage of the knowledge and insights of human teachers as well as the power of DNNs, but also has no need of a reward function (which sometimes implies the need of external perception for computing rewards). We combine Deep Learning with the COrrective Advice Communicated by Humans (COACH) framework, in which non-expert humans shape policies by correcting the agent’s actions during execution. The D-COACH framework has the potential to solve complex problems without much data or time required. Experimental results validated the efficiency of the framework in three different problems (two simulated, one with a real robot), with state spaces of low and high dimensions, showing the capacity to successfully learn policies for continuous action spaces like in the Car Racing and Cart-Pole problems faster than with DRL. Deep Coherence Model(DCM) In this paper, we propose a novel deep coherence model (DCM) using a convolutional neural network architecture to capture the text coherence. The text coherence problem is investigated with a new perspective of learning sentence distributional representation and text coherence modeling simultaneously. In particular, the model captures the interactions between sentences by computing the similarities of their distributional representations. Further, it can be easily trained in an end-to-end fashion. The proposed model is evaluated on a standard Sentence Ordering task. The experimental results demonstrate its effectiveness and promise in coherence assessment showing a significant improvement over the state-of-the-art by a wide margin. Deep Collaborative Autoencoder(DCAE) In recent years, deep neural networks have yielded state-of-the-art performance on several tasks. Although some recent works have focused on combining deep learning with recommendation, we highlight three issues of existing works. First, most works perform deep content feature learning and resort to matrix factorization, which cannot effectively model the highly complex user-item interaction function. Second, due to the difficulty on training deep neural networks, existing models utilize a shallow architecture, and thus limit the expressiveness potential of deep learning. Third, neural network models are easy to overfit on the implicit setting, because negative interactions are not taken into account. To tackle these issues, we present a novel recommender framework called Deep Collaborative Autoencoder (DCAE) for both explicit feedback and implicit feedback, which can effectively capture the relationship between interactions via its non-linear expressiveness. To optimize the deep architecture of DCAE, we develop a three-stage pre-training mechanism that combines supervised and unsupervised feature learning. Moreover, we propose a popularity-based error reweighting module and a sparsity-aware data-augmentation strategy for DCAE to prevent overfitting on the implicit setting. Extensive experiments on three real-world datasets demonstrate that DCAE can significantly advance the state-of-the-art. Deep Collaborative Weight-Based Classification(DeepCWC) One of the biggest problems in deep learning is its difficulty to retain consistent robustness when transferring the model trained on one dataset to another dataset. To conquer the problem, deep transfer learning was implemented to execute various vision tasks by using a pre-trained deep model in a diverse dataset. However, the robustness was often far from state-of-the-art. We propose a collaborative weight-based classification method for deep transfer learning (DeepCWC). The method performs the L2-norm based collaborative representation on the original images, as well as the deep features extracted by pre-trained deep models. Two distance vectors will be obtained based on the two representation coefficients, and then fused together via the collaborative weight. The two feature sets show a complementary character, and the original images provide information compensating the missed part in the transferred deep model. A series of experiments conducted on both small and large vision datasets demonstrated the robustness of the proposed DeepCWC in both face recognition and object recognition tasks. Deep Collective Matrix Factorization(dCMF) Learning by integrating multiple heterogeneous data sources is a common requirement in many tasks. Collective Matrix Factorization (CMF) is a technique to learn shared latent representations from arbitrary collections of matrices. It can be used to simultaneously complete one or more matrices, for predicting the unknown entries. Classical CMF methods assume linearity in the interaction of latent factors which can be restrictive and fails to capture complex non-linear interactions. In this paper, we develop the first deep-learning based method, called dCMF, for unsupervised learning of multiple shared representations, that can model such non-linear interactions, from an arbitrary collection of matrices. We address optimization challenges that arise due to dependencies between shared representations through Multi-Task Bayesian Optimization and design an acquisition function adapted for collective learning of hyperparameters. Our experiments show that dCMF significantly outperforms previous CMF algorithms in integrating heterogeneous data for predictive modeling. Further, on two tasks – recommendation and prediction of gene-disease association – dCMF outperforms state-of-the-art matrix completion algorithms that can utilize auxiliary sources of information. Deep Comparator Network(DCN) The objective of this work is set-based verification, e.g. to decide if two sets of images of a face are of the same person or not. The traditional approach to this problem is to learn to generate a feature vector per image, aggregate them into one vector to represent the set, and then compute the cosine similarity between sets. Instead, we design a neural network architecture that can directly learn set-wise verification. Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair–this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models. Evaluations on the IARPA Janus face recognition benchmarks show that the comparator networks outperform the previous state-of-the-art results by a large margin. Deep Comparison Network(DCN) Few-shot deep learning is a topical challenge area for scaling visual recognition to open-ended growth in the space of categories to recognise. A promising line work towards realising this vision is deep networks that learn to match queries with stored training images. However, methods in this paradigm usually train a deep embedding followed by a single linear classifier. Our insight is that effective general-purpose matching requires discrimination with regards to features at multiple abstraction levels. We therefore propose a new framework termed Deep Comparison Network (DCN) that decomposes embedding learning into a sequence of modules, and pairs each with a relation module. The relation modules compute a non-linear metric to score the match using the corresponding embedding module’s representation. To ensure that all embedding module’s features are used, the relation modules are deeply supervised. Finally generalisation is further improved by a learned noise regulariser. The resulting network achieves state of the art performance on both miniImageNet and tieredImageNet, while retaining the appealing simplicity and efficiency of deep metric learning approaches. Deep Complex Network At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on real-valued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noise-robust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complex-valued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complex-valued deep neural networks and apply them to convolutional feed-forward networks and convolutional LSTMs. More precisely, we rely on complex convolutions and present algorithms for complex batch-normalization, complex weight initialization strategies for complex-valued neural nets and we use them in experiments with end-to-end training schemes. We demonstrate that such complex-valued models are competitive with their real-valued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech Spectrum Prediction using the TIMIT dataset. We achieve state-of-the-art performance on these audio-related tasks. Deep Complex U-Net Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin. Deep Component Analysis(DeepCA) Despite a lack of theoretical understanding, deep neural networks have achieved unparalleled performance in a wide range of applications. On the other hand, shallow representation learning with component analysis is associated with rich intuition and theory, but smaller capacity often limits its usefulness. To bridge this gap, we introduce Deep Component Analysis (DeepCA), an expressive multilayer model formulation that enforces hierarchical structure through constraints on latent variables in each layer. For inference, we propose a differentiable optimization algorithm implemented using recurrent Alternating Direction Neural Networks (ADNNs) that enable parameter learning using standard backpropagation. By interpreting feed-forward networks as single-iteration approximations of inference in our model, we provide both a novel theoretical perspective for understanding them and a practical technique for constraining predictions with prior knowledge. Experimentally, we demonstrate performance improvements on a variety of tasks, including single-image depth prediction with sparse output constraints. Deep Comprehensive Correlation Mining(DCCM) Recent developed deep unsupervised methods allow us to jointly learn representation and cluster unlabelled data. These deep clustering methods %like DAC start with mainly focus on the correlation among samples, e.g., selecting high precision pairs to gradually tune the feature representation, which neglects other useful correlations. In this paper, we propose a novel clustering framework, named deep comprehensive correlation mining(DCCM), for exploring and taking full advantage of various kinds of correlations behind the unlabeled data from three aspects: 1) Instead of only using pair-wise information, pseudo-label supervision is proposed to investigate category information and learn discriminative features. 2) The features’ robustness to image transformation of input space is fully explored, which benefits the network learning and significantly improves the performance. 3) The triplet mutual information among features is presented for clustering problem to lift the recently discovered instance-level deep mutual information to a triplet-level formation, which further helps to learn more discriminative features. Extensive experiments on several challenging datasets show that our method achieves good performance, e.g., attaining $62.3\%$ clustering accuracy on CIFAR-10, and $34.0\%$ on CIFAR-100, both of which significantly surpass the state-of-the-art results more than $10.0\%$. Deep Compressed Sensing Compressed sensing (CS) provides an elegant framework for recovering sparse signals from compressed measurements. For example, CS can exploit the structure of natural images and recover an image from only a few random measurements. CS is flexible and data efficient, but its application has been restricted by the strong assumption of sparsity and costly reconstruction process. A recent approach that combines CS with neural network generators has removed the constraint of sparsity, but reconstruction remains slow. Here we propose a novel framework that significantly improves both the performance and speed of signal recovery by jointly training a generator and the optimisation process for reconstruction via meta-learning. We explore training the measurements with different objectives, and derive a family of models based on minimising measurement errors. We show that Generative Adversarial Nets (GANs) can be viewed as a special case in this family of models. Borrowing insights from the CS perspective, we develop a novel way of improving GANs using gradient information from the discriminator. Deep Confidence Deep learning architectures have proved versatile in a number of drug discovery applications, including the modelling of in vitro compound activity. While controlling for prediction confidence is essential to increase the trust, interpretability and usefulness of virtual screening models in drug discovery, techniques to estimate the reliability of the predictions generated with deep learning networks remain largely underexplored. Here, we present Deep Confidence, a framework to compute valid and efficient confidence intervals for individual predictions using the deep learning technique Snapshot Ensembling and conformal prediction. Specifically, Deep Confidence generates an ensemble of deep neural networks by recording the network parameters throughout the local minima visited during the optimization phase of a single neural network. This approach serves to derive a set of base learners (i.e., snapshots) with comparable predictive power on average, that will however generate slightly different predictions for a given instance. The variability across base learners and the validation residuals are in turn harnessed to compute confidence intervals using the conformal prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, we show that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of independently trained deep neural networks. In addition, we find that the confidence regions predicted using the Deep Confidence framework span a narrower set of values. Overall, Deep Confidence represents a highly versatile error prediction framework that can be applied to any deep learning-based application at no extra computational cost. Deep Contextual Multi-armed Bandits Contextual multi-armed bandit problems arise frequently in important industrial applications. Existing solutions model the context either linearly, which enables uncertainty driven (principled) exploration, or non-linearly, by using epsilon-greedy exploration policies. Here we present a deep learning framework for contextual multi-armed bandits that is both non-linear and enables principled exploration at the same time. We tackle the exploration vs. exploitation trade-off through Thompson sampling by exploiting the connection between inference time dropout and sampling from the posterior over the weights of a Bayesian neural network. In order to adjust the level of exploration automatically as more data is made available to the model, the dropout rate is learned rather than considered a hyperparameter. We demonstrate that our approach substantially reduces regret on two tasks (the UCI Mushroom task and the Casino Parity task) when compared to 1) non-contextual bandits, 2) epsilon-greedy deep contextual bandits, and 3) fixed dropout rate deep contextual bandits. Our approach is currently being applied to marketing optimization problems at HubSpot. Deep Continuous Clustering Clustering high-dimensional datasets is hard because interpoint distances become less informative in high-dimensional spaces. We present a clustering algorithm that performs nonlinear dimensionality reduction and clustering jointly. The data is embedded into a lower-dimensional space by a deep autoencoder. The autoencoder is optimized as part of the clustering process. The resulting network produces clustered data. The presented approach does not rely on prior knowledge of the number of ground-truth clusters. Joint nonlinear dimensionality reduction and clustering are formulated as optimization of a global continuous objective. We thus avoid discrete reconfigurations of the objective that characterize prior clustering algorithms. Experiments on datasets from multiple domains demonstrate that the presented algorithm outperforms state-of-the-art clustering schemes, including recent methods that use deep networks. Deep Convolutional Cascade for Face Alignment(DeCaFA) Face Alignment is an active computer vision domain, that consists in localizing a number of facial landmarks that vary across datasets. State-of-the-art face alignment methods either consist in end-to-end regression, or in refining the shape in a cascaded manner, starting from an initial guess. In this paper, we introduce DeCaFA, an end-to-end deep convolutional cascade architecture for face alignment. DeCaFA uses fully-convolutional stages to keep full spatial resolution throughout the cascade. Between each cascade stage, DeCaFA uses multiple chained transfer layers with spatial softmax to produce landmark-wise attention maps for each of several landmark alignment tasks. Weighted intermediate supervision, as well as efficient feature fusion between the stages allow to learn to progressively refine the attention maps in an end-to-end manner. We show experimentally that DeCaFA significantly outperforms existing approaches on 300W, CelebA and WFLW databases. In addition, we show that DeCaFA can learn fine alignment with reasonable accuracy from very few images using coarsely annotated data. Deep Convolutional Decision Jungle(CDJ) We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response map-or node-in both the convolutional and fully-connected layers selectively respond to class labels s.t. each data sample travels via a specific soft route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require pre-defined feature sets. Compared to CNNs, the method embeds the benefits of using data-dependent discriminative functions, which better handles multi-modal/heterogeneous data; further,the method offers more diverse sparse network responses, which in turn can be used for cost-effective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss,as used in decision tree growing, measures the purity of data activation according to the class label distribution. The back-propagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms state-of-the-art methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification. Deep Convolutional Gaussian Process We propose deep convolutional Gaussian processes, a deep Gaussian process architecture with convolutional structure. The model is a principled Bayesian framework for detecting hierarchical combinations of local features for image classification. We demonstrate greatly improved image classification performance compared to current Gaussian process approaches on the MNIST and CIFAR-10 datasets. In particular, we improve CIFAR-10 accuracy by over 10 percentage points. Deep Convolutional Generative Adversarial Networks(DCGAN) In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Deep Convolutional Neural Network(DCN) A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks such as visual object and speech recognition. The key factor complicating such tasks is the presence of numerous nuisance variables, for instance, the unknown object position, orientation, and scale in object recognition or the unknown voice pronunciation, pitch, and speed in speech recognition. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks; they are constructed from many layers of alternating linear and nonlinear processing units and are trained using large-scale algorithms and massive amounts of training data. The recent success of deep learning systems is impressive – they now routinely yield pattern recognition systems with nearor super-human capabilities – but a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on a Bayesian generative probabilistic model that explicitly captures variation due to nuisance variables. The graphical structure of the model enables it to be learned from data using classical expectation-maximization techniques. Furthermore, by relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks (DCNs) and random decision forests (RDFs), providing insights into their successes and shortcomings as well as a principled route to their improvement. Deep Convolutional Sparse Coding(D-CSC) Deep Convolutional Sparse Coding (D-CSC) is a framework reminiscent of deep convolutional neural networks (DCNNs), but by omitting the learning of the dictionaries one can more transparently analyse the role of the activation function and its ability to recover activation paths through the layers. Papyan, Romano, and Elad conducted an analysis of such an architecture, demonstrated the relationship with DCNNs and proved conditions under which the D-CSC is guaranteed to recover specific activation paths. A technical innovation of their work highlights that one can view the efficacy of the ReLU nonlinear activation function of a DCNN through a new variant of the tensor’s sparsity, referred to as stripe-sparsity. Using this they proved that representations with an activation density proportional to the ambient dimension of the data are recoverable. Deep Co-Space(DCS) Aiming at improving performance of visual classification in a cost-effective manner, this paper proposes an incremental semi-supervised learning paradigm called Deep Co-Space (DCS). Unlike many conventional semi-supervised learning methods usually performing within a fixed feature space, our DCS gradually propagates information from labeled samples to unlabeled ones along with deep feature learning. We regard deep feature learning as a series of steps pursuing feature transformation, i.e., projecting the samples from a previous space into a new one, which tends to select the reliable unlabeled samples with respect to this setting. Specifically, for each unlabeled image instance, we measure its reliability by calculating the category variations of feature transformation from two different neighborhood variation perspectives, and merged them into an unified sample mining criterion deriving from Hellinger distance. Then, those samples keeping stable correlation to their neighboring samples (i.e., having small category variation in distribution) across the successive feature space transformation, are automatically received labels and incorporated into the model for incrementally training in terms of classification. Our extensive experiments on standard image classification benchmarks (e.g., Caltech-256 and SUN-397) demonstrate that the proposed framework is capable of effectively mining from large-scale unlabeled images, which boosts image classification performance and achieves promising results compared to other semi-supervised learning methods. Deep Counterfactual Regret Minimization(Deep CFR) Counterfactual Regret Minimization (CFR) is the leading algorithm for solving large imperfect-information games. It iteratively traverses the game tree in order to converge to a Nash equilibrium. In order to deal with extremely large games, CFR typically uses domain-specific heuristics to simplify the target game in a process known as abstraction. This simplified game is solved with tabular CFR, and its solution is mapped back to the full game. This paper introduces Deep Counterfactual Regret Minimization (Deep CFR), a form of CFR that obviates the need for abstraction by instead using deep neural networks to approximate the behavior of CFR in the full game. We show that Deep CFR is principled and achieves strong performance in the benchmark game of heads-up no-limit Texas hold’em poker. This is the first successful use of function approximation in CFR for large games. Deep Curiosity Loop(DCL) Inspired by infants’ intrinsic motivation to learn, which values informative sensory channels contingent on their immediate social environment, we developed a deep curiosity loop (DCL) architecture. The DCL is composed of a learner, which attempts to learn a forward model of the agent’s state-action transition, and a novel reinforcement-learning (RL) component, namely, an Action-Convolution Deep Q-Network, which uses the learner’s prediction error as reward. The environment for our agent is composed of visual social scenes, composed of sitcom video streams, thereby both the learner and the RL are constructed as deep convolutional neural networks. The agent’s learner learns to predict the zero-th order of the dynamics of visual scenes, resulting in intrinsic rewards proportional to changes within its social environment. The sources of these socially informative changes within the sitcom are predominantly motions of faces and hands, leading to the unsupervised curiosity-based learning of social interaction features. The face and hand detection is represented by the value function and the social interaction optical-flow is represented by the policy. Our results suggest that face and hand detection are emergent properties of curiosity-based learning embedded in social environments. Deep Data What we call ‘deep data’ is a combination of experts’ domain knowledge of the area … combined with data science. Deep Decoder Deep neural networks, in particular convolutional neural networks, have become highly effective tools for compressing images and solving inverse problems including denoising, inpainting, and reconstruction from few and noisy measurements. This success can be attributed in part to their ability to represent and generate natural images well. Contrary to classical tools such as wavelets, image-generating deep neural networks have a large number of parameters—typically a multiple of their output dimension—and need to be trained on large datasets. In this paper, we propose an untrained simple image model, called the deep decoder, which is a deep neural network that can generate natural images from very few weight parameters. The deep decoder has a simple architecture with no convolutions and fewer weight parameters than the output dimensionality. This underparameterization enables the deep decoder to compress images into a concise set of network weights, which we show is on par with wavelet-based thresholding. Further, underparameterization provides a barrier to overfitting, allowing the deep decoder to have state-of-the-art performance for denoising. The deep decoder is simple in the sense that each layer has an identical structure that consists of only one upsampling unit, pixel-wise linear combination of channels, ReLU activation, and channelwise normalization. This simplicity makes the network amenable to theoretical analysis, and it sheds light on the aspects of neural networks that enable them to form effective signal representations. Deep Density Networks(DDN) Building robust online content recommendation systems requires learning complex interactions between user preferences and content features. The field has evolved rapidly in recent years from traditional multi-arm bandit and collaborative filtering techniques, with new methods integrating Deep Learning models that enable to capture non-linear feature interactions. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this paper we provide a novel method, Deep Density Networks (DDN) which deconvolves measurement and data uncertainties and predicts probability density of CTR (Click Through Rate), enabling us to perform more efficient exploration of the feature space. We show the usefulness of using DDN online in a real world content recommendation system that serves billions of recommendations per day, and present online and offline results to evaluate the benefit of using DDN. Deep Determinantal Point Process(Deep DPP) Determinantal point processes (DPPs) have attracted significant attention as an elegant model that is able to capture the balance between quality and diversity within sets. DPPs are parameterized by a positive semi-definite kernel matrix. While DPPs have substantial expressive power, they are fundamentally limited by the parameterization of the kernel matrix and their inability to capture nonlinear interactions between items within sets. We present the deep DPP model as way to address these limitations, by using a deep feed-forward neural network to learn the kernel matrix. In addition to allowing us to capture nonlinear item interactions, the deep DPP also allows easy incorporation of item metadata into DPP learning. We show experimentally that the deep DPP can provide a considerable improvement in the predictive performance of DPPs. Deep Deterministic Policy Gradient(DDPG) We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies ‘end-to-end’: directly from raw pixel inputs. Deep Diffeomorphic Normalizing Flow(DDNF) The Normalizing Flow (NF) models a general probability density by estimating an invertible transformation applied on samples drawn from a known distribution. We introduce a new type of NF, called Deep Diffeomorphic Normalizing Flow (DDNF). A diffeomorphic flow is an invertible function where both the function and its inverse are smooth. We construct the flow using an ordinary differential equation (ODE) governed by a time-varying smooth vector field. We use a neural network to parametrize the smooth vector field and a recursive neural network (RNN) for approximating the solution of the ODE. Each cell in the RNN is a residual network implementing one Euler integration step. The architecture of our flow enables efficient likelihood evaluation, straightforward flow inversion, and results in highly flexible density estimation. An end-to-end trained DDNF achieves competitive results with state-of-the-art methods on a suite of density estimation and variational inference tasks. Finally, our method brings concepts from Riemannian geometry that, we believe, can open a new research direction for neural density estimation. Deep Differential Recurrent Neural Network(DDRNN) Due to the special gating schemes of Long Short-Term Memory (LSTM), LSTMs have shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatio-temporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain caused by the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatio-temporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS and stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS, and thus the proposed LSTM model is termed deep differential Recurrent Neural Network (d2RNN). The effectiveness of the proposed model is demonstrated on two publicly available human activity datasets: NUS-HGA and Violent-Flows. The proposed model outperforms both LSTM and non-LSTM based state-of-the-art algorithms. Deep Directional Statistics ➘ “Directional Statistics” Deep Directional Statistics: Pose Estimation with Uncertainty Quantification Deep Discrete Supervised Hashing(DDSH) Hashing has been widely used for large-scale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, utilizing supervised information to directly guide discrete (binary) coding procedure can avoid sub-optimal solution and improve the accuracy. On the other hand, deep hashing, which integrates deep feature learning and hash-code learning into an end-to-end architecture, can enhance the feedback between feature learning and hash-code learning. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure in hashing. The key in deep hashing is to adopt the supervised information to directly guide the deep feature learning procedure. However, there have not existed works which can use the supervised information to directly guide both discrete coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH), to address this problem. DDSH is the first deep hashing method which can utilize supervised information to directly guide both discrete coding procedure and deep feature learning procedure, and thus enhance the feedback between these two important procedures. Experiments on three real datasets show that DDSH can outperform other state-of-the-art baselines, including both discrete hashing and deep hashing baselines, for image retrieval. Deep Discriminative Clustering(DDC) Traditional clustering methods often perform clustering with low-level indiscriminative representations and ignore relationships between patterns, resulting in slight achievements in the era of deep learning. To handle this problem, we develop Deep Discriminative Clustering (DDC) that models the clustering task by investigating relationships between patterns with a deep neural network. Technically, a global constraint is introduced to adaptively estimate the relationships, and a local constraint is developed to endow the network with the capability of learning high-level discriminative representations. By iteratively training the network and estimating the relationships in a mini-batch manner, DDC theoretically converges and the trained network enables to generate a group of discriminative representations that can be treated as clustering centers for straightway clustering. Extensive experiments strongly demonstrate that DDC outperforms current methods on eight image, text and audio datasets concurrently. Deep Distance Metric Learning(DDML) Deep distance metric learning (DDML), which is proposed to learn image similarity metrics in an end-to-end manner based on the convolution neural network. Deep Distribution Regression Due to their flexibility and predictive performance, machine-learning based regression methods have become an important tool for predictive modeling and forecasting. However, most methods focus on estimating the conditional mean or specific quantiles of the target quantity and do not provide the full conditional distribution, which contains uncertainty information that might be crucial for decision making. In this article, we provide a general solution by transforming a conditional distribution estimation problem into a constrained multi-class classification problem, in which tools such as deep neural networks. We propose a novel joint binary cross-entropy loss function to accomplish this goal. We demonstrate its performance in various simulation studies comparing to state-of-the-art competing methods. Additionally, our method shows improved accuracy in a probabilistic solar energy forecasting problem. Deep Divergence Graph Kernel(DDGK) Can neural networks learn to compare graphs without feature engineering? In this paper, we show that it is possible to learn representations for graph similarity with neither domain knowledge nor supervision (i.e.\ feature engineering or labeled graphs). We propose Deep Divergence Graph Kernels, an unsupervised method for learning representations over graphs that encodes a relaxed notion of graph isomorphism. Our method consists of three parts. First, we learn an encoder for each anchor graph to capture its structure. Second, for each pair of graphs, we train a cross-graph attention network which uses the node representations of an anchor graph to reconstruct another graph. This approach, which we call isomorphism attention, captures how well the representations of one graph can encode another. We use the attention-augmented encoder’s predictions to define a divergence score for each pair of graphs. Finally, we construct an embedding space for all graphs using these pair-wise divergence scores. Unlike previous work, much of which relies on 1) supervision, 2) domain specific knowledge (e.g. a reliance on Weisfeiler-Lehman kernels), and 3) known node alignment, our unsupervised method jointly learns node representations, graph representations, and an attention-based alignment between graphs. Our experimental results show that Deep Divergence Graph Kernels can learn an unsupervised alignment between graphs, and that the learned representations achieve competitive results when used as features on a number of challenging graph classification tasks. Furthermore, we illustrate how the learned attention allows insight into the the alignment of sub-structures across graphs. Deep Echo State Network(deepESN) The study of deep recurrent neural networks (RNNs) and, in particular, of deep Reservoir Computing (RC) is gaining an increasing research attention in the neural networks community. The recently introduced deep Echo State Network (deepESN) model opened the way to an extremely efficient approach for designing deep neural networks for temporal data. At the same time, the study of deepESNs allowed to shed light on the intrinsic properties of state dynamics developed by hierarchical compositions of recurrent layers, i.e. on the bias of depth in RNNs architectural design. In this paper, we summarize the advancements in the development, analysis and applications of deepESNs. Deep Ensemble Bayesian Active Learning(DEBAL) In image classification tasks, the ability of deep CNNs to deal with complex image data has proven to be unrivalled. However, they require large amounts of labeled training data to reach their full potential. In specialised domains such as healthcare, labeled data can be difficult and expensive to obtain. Active Learning aims to alleviate this problem, by reducing the amount of labelled data needed for a specific task while delivering satisfactory performance. We propose DEBAL, a new active learning strategy designed for deep neural networks. This method improves upon the current state-of-the-art deep Bayesian active learning method, which suffers from the mode collapse problem. We correct for this deficiency by making use of the expressive power and statistical properties of model ensembles. Our proposed method manages to capture superior data uncertainty, which translates into improved classification performance. We demonstrate empirically that our ensemble method yields faster convergence of CNNs trained on the MNIST and CIFAR-10 datasets. Deep Euclidean Feature Representations through Adaptation on the Grassmann Manifold(DEFRAG) We propose a novel technique for training deep networks with the objective of obtaining feature representations that exist in a Euclidean space and exhibit strong clustering behavior. Our desired features representations have three traits: they can be compared using a standard Euclidian distance metric, samples from the same class are tightly clustered, and samples from different classes are well separated. However, most deep networks do not enforce such feature representations. The DEFRAG training technique consists of two steps: first good feature clustering behavior is encouraged though an auxiliary loss function based on the Silhouette clustering metric. Then the feature space is retracted onto a Grassmann manifold to ensure that the L_2 Norm forms a similarity metric. The DEFRAG technique achieves state of the art results on standard classification datasets using a relatively small network architecture with significantly fewer parameters than many standard networks. Deep Evolutionary Network Structured Representation(DENSER) Deep Evolutionary Network Structured Representation (DENSER) is a novel approach to automatically design Artificial Neural Networks (ANNs) using Evolutionary Computation (EC). The algorithm not only searches for the best network topology (e.g., number of layers, type of layers), but also tunes hyper-parameters, such as, learning parameters or data augmentation parameters. The automatic design is achieved using a representation with two distinct levels, where the outer level encodes the general structure of the network, i.e., the sequence of layers, and the inner level encodes the parameters associated with each layer. The allowed layers and hyper-parameter value ranges are defined by means of a human-readable Context-Free Grammar. DENSER was used to evolve ANNs for two widely used image classification benchmarks obtaining an average accuracy result of up to 94.27% on the CIFAR-10 dataset, and of 78.75% on the CIFAR-100. To the best of our knowledge, our CIFAR-100 results are the highest performing models generated by methods that aim at the automatic design of Convolutional Neural Networks (CNNs), and is amongst the best for manually designed and fine-tuned CNNs . Deep Evolving Denoising Autoencoder(DEVDAN) The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances the flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoising autoeconder, namely deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. DEVDAN features an open structure both in the generative phase and in the discriminative phase where input features can be automatically added and discarded on the fly. A network significance (NS) method is formulated in this paper and is derived from the bias-variance concept. This method is capable of estimating the statistical contribution of the network structure and its hidden units which precursors an ideal state to add or prune input features. Furthermore, DEVDAN is free of the problem- specific threshold and works fully in the single-pass learning fashion. The efficacy of DEVDAN is numerically validated using nine non-stationary data stream problems simulated under the prequential test-then-train protocol where DEVDAN is capable of delivering an improvement of classification accuracy to recently published online learning works while having flexibility in the automatic extraction of robust input features and in adapting to rapidly changing environments. Deep Evolving Fuzzy Neural Network(DEVFNN) Existing fuzzy neural networks (FNNs) are mostly developed under a shallow network configuration having lower generalization power than those of deep structures. This paper proposes a novel self-organizing deep fuzzy neural network, namely deep evolving fuzzy neural networks (DEVFNN). Fuzzy rules can be automatically extracted from data streams or removed if they play little role during their lifespan. The structure of the network can be deepened on demand by stacking additional layers using a drift detection method which not only detects the covariate drift, variations of input space, but also accurately identifies the real drift, dynamic changes of both feature space and target space. DEVFNN is developed under the stacked generalization principle via the feature augmentation concept where a recently developed algorithm, namely Generic Classifier (gClass), drives the hidden layer. It is equipped by an automatic feature selection method which controls activation and deactivation of input attributes to induce varying subsets of input features. A deep network simplification procedure is put forward using the concept of hidden layer merging to prevent uncontrollable growth of input space dimension due to the nature of feature augmentation approach in building a deep network structure. DEVFNN works in the sample-wise fashion and is compatible for data stream applications. The efficacy of DEVFNN has been thoroughly evaluated using six datasets with non-stationary properties under the prequential test-then-train protocol. It has been compared with four state-of the art data stream methods and its shallow counterpart where DEVFNN demonstrates improvement of classification accuracy. Deep Expander Network(X-Net) Deep Neural Networks, while being unreasonably effective for several vision tasks, have their usage limited by the computational and memory requirements, both during training and inference stages. Analyzing and improving the connectivity patterns between layers of a network has resulted in several compact architectures like GoogleNet, ResNet and DenseNet-BC. In this work, we utilize results from graph theory to develop an efficient connection pattern between consecutive layers. Specifically, we use {\it expander graphs} that have excellent connectivity properties to develop a sparse network architecture, the deep expander network (X-Net). The X-Nets are shown to have high connectivity for a given level of sparsity. We also develop highly efficient training and inference algorithms for such networks. Experimental results show that we can achieve the similar or better accuracy as DenseNet-BC with two-thirds the number of parameters and FLOPs on several image classification benchmarks. We hope that this work motivates other approaches to utilize results from graph theory to develop efficient network architectures. Deep Extrofitting The retrofitting techniques, which inject external resources into word representations, have compensated the weakness of distributed representations in semantic and relational knowledge between words. Implicitly retrofitting word vectors by expansional technique (extrofitting), showed that our method outperforms retrofitting in word similarity task as well as is good at generalization. In this paper, we propose deep extrofitting, which is to stack extrofitting in depth. Furthermore, inspired by learning theory, we combine retrofitting with extrofitting, optimizing the result between specialization and generalization. When experimenting with GloVe, we show that our deep extrofitting not only outperforms the previous methods on most of word similarity task but also requires only synonyms. We also report further analysis on the effect of deep extrofitted word vectors on text classification task, resulting in the improvement on the performances. Deep Factor Alpha Deep Factor Alpha provides a framework for extracting nonlinear factors information to explain the time-series cross-section properties of asset returns. Sorting securities based on firm characteristics is viewed as a nonlinear activation function which can be implemented within a deep learning architecture. Multi-layer deep learners are constructed to augment traditional long-short factor models. Searching firm characteristic space over deep architectures of nonlinear transformations is compatible with the economic goal of eliminating mispricing Alphas. Joint estimation of factors and betas is achieved with stochastic gradient descent. To illustrate our methodology, we design long-short latent factors in a train-validation-testing framework of US stock market asset returns from 1975 to 2017. We perform an out-of-sample study to analyze Fama-French factors, in both the cross-section and time-series, versus their deep learning counterparts. Finally, we conclude with directions for future research. Deep Factor Model We propose to represent a return model and risk model in a unified manner with deep learning, which is a representative model that can express a nonlinear relationship. Although deep learning performs quite well, it has significant disadvantages such as a lack of transparency and limitations to the interpretability of the prediction. This is prone to practical problems in terms of accountability. Thus, we construct a multifactor model by using interpretable deep learning. We implement deep learning as a return model to predict stock returns with various factors. Then, we present the application of layer-wise relevance propagation (LRP) to decompose attributes of the predicted return as a risk model. By applying LRP to an individual stock or a portfolio basis, we can determine which factor contributes to prediction. We call this model a deep factor model. We then perform an empirical analysis on the Japanese stock market and show that our deep factor model has better predictive capability than the traditional linear model or other machine learning methods. In addition , we illustrate which factor contributes to prediction. Deep Feature Aggregation Network(DFANet) This paper introduces an extremely efficient CNN architecture named DFANet for semantic segmentation under resource constraints. Our proposed network starts from a single lightweight backbone and aggregates discriminative features through sub-network and sub-stage cascade respectively. Based on the multi-scale feature propagation, DFANet substantially reduces the number of parameters, but still obtains sufficient receptive field and enhances the model learning ability, which strikes a balance between the speed and segmentation performance. Experiments on Cityscapes and CamVid datasets demonstrate the superior performance of DFANet with 8$\times$ less FLOPs and 2$\times$ faster than the existing state-of-the-art real-time semantic segmentation methods while providing comparable accuracy. Specifically, it achieves 70.3\% Mean IOU on the Cityscapes test dataset with only 1.7 GFLOPs and a speed of 160 FPS on one NVIDIA Titan X card, and 71.3\% Mean IOU with 3.4 GFLOPs while inferring on a higher resolution image. Deep Feature Factorization We propose Deep Feature Factorization (DFF), a method capable of localizing similar semantic concepts within an image or a set of images. We use DFF to gain insight into a deep convolutional neural network’s learned features, where we detect hierarchical cluster structures in feature space. This is visualized as heat maps, which highlight semantically matching regions across a set of images, revealing what the network perceives’ as similar. DFF can also be used to perform co-segmentation and co-localization, and we report state-of-the-art results on these tasks. Deep Feature Fusion-Audio and Text Modal Fusion(DFF-ATMF) Sentiment analysis research has been rapidly developing in the last decade and has attracted widespread attention from academia and industry, most of which is based on text. However, the information in the real world usually comes as different modalities. In this paper, we consider the task of Multimodal Sentiment Analysis, using Audio and Text Modalities, proposed a novel fusion strategy including Multi-Feature Fusion and Multi-Modality Fusion to improve the accuracy of Audio-Text Sentiment Analysis. We call this the Deep Feature Fusion-Audio and Text Modal Fusion (DFF-ATMF) model, and the features learned from it are complementary to each other and robust. Experiments with the CMU-MOSI corpus and the recently released CMU-MOSEI corpus for Youtube video sentiment analysis show the very competitive results of our proposed model. Surprisingly, our method also achieved the state-of-the-art results in the IEMOCAP dataset, indicating that our proposed fusion strategy is also extremely generalization ability to Multimodal Emotion Recognition. Deep Feature Selection Using Paired-Input Nonlinear Knockoffs(DeepPINK) Deep learning has become increasingly popular in both supervised and unsupervised machine learning thanks to its outstanding empirical performance. However, because of their intrinsic complexity, most deep learning methods are largely treated as black box tools with little interpretability. Even though recent attempts have been made to facilitate the interpretability of deep neural networks (DNNs), existing methods are susceptible to noise and lack of robustness. Therefore, scientists are justifiably cautious about the reproducibility of the discoveries, which is often related to the interpretability of the underlying statistical models. In this paper, we describe a method to increase the interpretability and reproducibility of DNNs by incorporating the idea of feature selection with controlled error rate. By designing a new DNN architecture and integrating it with the recently proposed knockoffs framework, we perform feature selection with a controlled error rate, while maintaining high power. This new method, DeepPINK (Deep feature selection using Paired-Input Nonlinear Knockoffs), is applied to both simulated and real data sets to demonstrate its empirical utility. Deep Feature Synthesis(DFS) In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score. Deep Feature Synthesis: How Automated Feature Engineering Works Deep Frame Interpolation This work presents a supervised learning based approach to the computer vision problem of frame interpolation. The presented technique could also be used in the cartoon animations since drawing each individual frame consumes a noticeable amount of time. The most existing solutions to this problem use unsupervised methods and focus only on real life videos with already high frame rate. However, the experiments show that such methods do not work as well when the frame rate becomes low and object displacements between frames becomes large. This is due to the fact that interpolation of the large displacement motion requires knowledge of the motion structure thus the simple techniques such as frame averaging start to fail. In this work the deep convolutional neural network is used to solve the frame interpolation problem. In addition, it is shown that incorporating the prior information such as optical flow improves the interpolation quality significantly. Deep Fundamental Factor Models Deep fundamental factor models are developed to interpret and capture non-linearity, interaction effects and non-parametric shocks in financial econometrics. Uncertainty quantification provides interpretability with interval estimation, ranking of factor importances and estimation of interaction effects. Estimating factor realizations under either homoscedastic or heteroscedastic error is also available. With no hidden layers we recover a linear factor model and for one or more hidden layers, uncertainty bands for the sensitivity to each input naturally arise from the network weights. To illustrate our methodology, we construct a six-factor model of assets in the S\&P 500 index and generate information ratios that are three times greater than generalized linear regression. We show that the factor importances are materially different from the linear factor model when accounting for non-linearity. Finally, we conclude with directions for future research Deep Gaussian Covariance Network(DGCP) The correlation length-scale next to the noise variance are the most used hyperparameters for the Gaussian processes. Typically, stationary covariance functions are used, which are only dependent on the distances between input points and thus invariant to the translations in the input space. The optimization of the hyperparameters is commonly done by maximizing the log marginal likelihood. This works quite well, if the distances are uniform distributed. In the case of a locally adapted or even sparse input space, the prediction of a test point can be worse dependent of its position. A possible solution to this, is the usage of a non-stationary covariance function, where the hyperparameters are calculated by a deep neural network. So that the correlation length scales and possibly the noise variance are dependent on the test point. Furthermore, different types of covariance functions are trained simultaneously, so that the Gaussian process prediction is an additive overlay of different covariance matrices. The right covariance functions combination and its hyperparameters are learned by the deep neural network. Additional, the Gaussian process will be able to be trained by batches or online and so it can handle arbitrarily large data sets. We call this framework Deep Gaussian Covariance Network (DGCP). There are also further extensions to this framework possible, for example sequentially dependent problems like time series or the local mixture of experts. The basic framework and some extension possibilities will be presented in this work. Moreover, a comparison to some recent state of the art surrogate model methods will be performed, also for a time dependent problem. Deep Gaussian Mixture Model Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers. Deep Gaussian Process(DGP) In this paper we introduce deep Gaussian process (GP) models. Deep GPs are a deep belief network based on Gaussian process mappings. The data is modeled as the output of a multivariate GP. The inputs to that Gaussian process are then governed by another GP. A single layer model is equivalent to a standard GP or the GP latent variable model (GP-LVM). We perform inference in the model by approximate variational marginalization. This results in a strict lower bound on the marginal likelihood of the model which we use for model selection (number of layers and nodes per layer). Deep belief networks are typically applied to relatively large data sets using stochastic gradient descent for optimization. Our fully Bayesian treatment allows for the application of deep models even when data is scarce. Model selection by our variational bound shows that a five layer hierarchy is justified even when modelling a digit data set containing only 150 examples. Robust Deep Gaussian Processes Deep Generalized Canonical Correlation Analysis(DGCCA) We present Deep Generalized Canonical Correlation Analysis (DGCCA) — a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear two-view representation learning (Deep CCA, (Andrew et al., 2013)) and linear many-view representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCA-style multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre- sentations on two distinct datasets for three downstream tasks: phonetic transcrip- tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear many-view techniques. Deep Generative Markov State Model(DeepGenMSM) We propose a deep generative Markov State Model (DeepGenMSM) learning framework for inference of metastable dynamical systems and prediction of trajectories. After unsupervised training on time series data, the model contains (i) a probabilistic encoder that maps from high-dimensional configuration space to a small-sized vector indicating the membership to metastable (long-lived) states, (ii) a Markov chain that governs the transitions between metastable states and facilitates analysis of the long-time dynamics, and (iii) a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM is demonstrated to provide accurate estimates of the long-time kinetics and generate valid distributions for molecular dynamics (MD) benchmark systems. Remarkably, we show that DeepGenMSMs are able to make long time-steps in molecular configuration space and generate physically realistic structures in regions that were not seen in training data. Deep Generative Model(DGM) We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i.e., whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and fitting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a unified framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods. Deep Genetic Network Optimizing a neural network’s performance is a tedious and time taking process, this iterative process does not have any defined solution which can work for all the problems. Optimization can be roughly categorized into – Architecture and Hyperparameter optimization. Many algorithms have been devised to address this problem. In this paper we introduce a neural network architecture (Deep Genetic Network) which will optimize its parameters during training based on its fitness. Deep Genetic Net uses genetic algorithms along with deep neural networks to address the hyperparameter optimization problem, this approach uses ideas like mating and mutation which are key to genetic algorithms which help the neural net architecture to learn to optimize its hyperparameters by itself rather than depending on a person to explicitly set the values. Using genetic algorithms for this problem proved to work exceptionally well when given enough time to train the network. The proposed architecture is found to work well in optimizing hyperparameters in affine, convolutional and recurrent layers proving to be a good choice for conventional supervised learning tasks. Deep Gradient Compression(DGC) Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. Deep Graph Infomax(DGI) We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs—both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to graph representation learning, DGI does not rely on random walks, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning. Deep Graph Translation Inspired by the tremendous success of deep generative models on generating continuous data like image and audio, in the most recent year, few deep graph generative models have been proposed to generate discrete data such as graphs. They are typically unconditioned generative models which has no control on modes of the graphs being generated. Differently, in this paper, we are interested in a new problem named \emph{Deep Graph Translation}: given an input graph, we want to infer a target graph based on their underlying (both global and local) translation mapping. Graph translation could be highly desirable in many applications such as disaster management and rare event forecasting, where the rare and abnormal graph patterns (e.g., traffic congestions and terrorism events) will be inferred prior to their occurrence even without historical data on the abnormal patterns for this graph (e.g., a road network or human contact network). To achieve this, we propose a novel Graph-Translation-Generative Adversarial Networks (GT-GAN) which will generate a graph translator from input to target graphs. GT-GAN consists of a graph translator where we propose new graph convolution and deconvolution layers to learn the global and local translation mapping. A new conditional graph discriminator has also been proposed to classify target graphs by conditioning on input graphs. Extensive experiments on multiple synthetic and real-world datasets demonstrate the effectiveness and scalability of the proposed GT-GAN. Deep Grid Net(DGN) Grid maps obtained from fused sensory information are nowadays among the most popular approaches for motion planning for autonomous driving cars. In this paper, we introduce Deep Grid Net (DGN), a deep learning (DL) system designed for understanding the context in which an autonomous car is driving. DGN incorporates a learned driving environment representation based on Occupancy Grids (OG) obtained from raw Lidar data and constructed on top of the Dempster-Shafer (DS) theory. The predicted driving context is further used for switching between different driving strategies implemented within EB robinos, Elektrobit’s Autonomous Driving (AD) software platform. Based on genetic algorithms (GAs), we also propose a neuroevolutionary approach for learning the tuning hyperparameters of DGN. The performance of the proposed deep network has been evaluated against similar competing driving context estimation classifiers. Deep group-feature selection using Knockoff(Deep-gKnock) Feature selection is central to contemporary high-dimensional data analysis. Grouping structure among features arises naturally in various scientific problems. Many methods have been proposed to incorporate the grouping structure information into feature selection. However, these methods are normally restricted to a linear regression setting. To relax the linear constraint, we combine the deep neural networks (DNNs) with the recent Knockoffs technique, which has been successful in an individual feature selection context. We propose Deep-gKnock (Deep group-feature selection using Knockoffs) as a methodology for model interpretation and dimension reduction. Deep-gKnock performs model-free group-feature selection by controlling group-wise False Discovery Rate (gFDR). Our method improves the interpretability and reproducibility of DNNs. Experimental results on both synthetic and real data demonstrate that our method achieves superior power and accurate gFDR control compared with state-of-the-art methods. Deep Haar Scattering Network An orthogonal Haar scattering transform is a deep network, computed with a hierarchy of additions, subtractions and absolute values, over pairs of coefficients. It provides a simple mathematical model for unsupervised deep network learning. It implements non-linear contractions, which are optimized for classification, with an unsupervised pair matching algorithm, of polynomial complexity. A structured Haar scattering over graph data computes permutation invariant representations of groups of connected points in the graph. If the graph connectivity is unknown, unsupervised Haar pair learning can provide a consistent estimation of connected dyadic groups of points. Classification results are given on image data bases, defined on regular grids or graphs, with a connectivity which may be known or unknown. Deep Hashing Neural Network(HNN) In this paper we propose a synergistic melting of neural networks and decision trees into a deep hashing neural network (HNN) having a modeling capability exponential with respect to its number of neurons. We first derive a soft decision tree named neural decision tree allowing the optimization of arbitrary decision function at each split node. We then rewrite this soft space partitioning as a new kind of neural network layer, namely the hashing layer (HL), which can be seen as a generalization of the known soft-max layer. This HL can easily replace the standard last layer of ANN in any known network topology and thus can be used after a convolutional or recurrent neural network for example. We present the modeling capacity of this deep hashing function on small datasets where one can reach at least equally good results as standard neural networks by diminishing the number of output neurons. Finally, we show that for the case where the number of output neurons is large, the neural network can mitigate the absence of linear decision boundaries by learning for each difficult class a collection of not necessarily connected sub-regions of the space leading to more flexible decision surfaces. Finally, the HNN can be seen as a deep locality sensitive hashing function which can be trained in a supervised or unsupervised setting as we will demonstrate for classification and regression problems. Deep Hierarchical Machine(DHM) We propose Deep Hierarchical Machine (DHM), a model inspired from the divide-and-conquer strategy while emphasizing representation learning ability and flexibility. A stochastic routing framework as used by recent deep neural decision/regression forests is incorporated, but we remove the need to evaluate unnecessary computation paths by utilizing a different topology and introducing a probabilistic pruning technique. We also show a specified version of DHM (DSHM) for efficiency, which inherits the sparse feature extraction process as in traditional decision tree with pixel-difference feature. To achieve sparse feature extraction, we propose to utilize sparse convolution operation in DSHM and show one possibility of introducing sparse convolution kernels by using local binary convolution layer. DHM can be applied to both classification and regression problems, and we validate it on standard image classification and face alignment tasks to show its advantages over past architectures. Deep Hyperalignment(DHA) This paper proposes Deep Hyperalignment (DHA) as a regularized, deep extension, scalable Hyperalignment (HA) method, which is well-suited for applying functional alignment to fMRI datasets with nonlinearity, high-dimensionality (broad ROI), and a large number of subjects. Unlink previous methods, DHA is not limited by a restricted fixed kernel function. Further, it uses a parametric approach, rank-$m$ Singular Value Decomposition (SVD), and stochastic gradient descent for optimization. Therefore, DHA has a suitable time complexity for large datasets, and DHA does not require the training data when it computes the functional alignment for a new subject. Experimental studies on multi-subject fMRI analysis confirm that the DHA method achieves superior performance to other state-of-the-art HA algorithms. Deep Hyperspherical Learning ➘ “Hyperspherical Convolution” Deep Image Prior(DIP) Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash-no flash input pairs. Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learning-based methods using deep convolutional networks and learning-free methods based on handcrafted image priors such as self-similarity. Regularization by architecture: A deep prior approach for inverse problems Deep Image Retargeting(DeepIR) We present \emph{Deep Image Retargeting} (\emph{DeepIR}), a coarse-to-fine framework for content-aware image retargeting. Our framework first constructs the semantic structure of input image with a deep convolutional neural network. Then a uniform re-sampling that suits for semantic structure preserving is devised to resize feature maps to target aspect ratio at each feature layer. The final retargeting result is generated by coarse-to-fine nearest neighbor field search and step-by-step nearest neighbor field fusion. We empirically demonstrate the effectiveness of our model with both qualitative and quantitative results on widely used RetargetMe dataset. Deep Incremental Boosting This paper introduces Deep Incremental Boosting, a new technique derived from AdaBoost, specifically adapted to work with Deep Learning methods, that reduces the required training time and improves generalisation. We draw inspiration from Transfer of Learning approaches to reduce the start-up time to training each incremental Ensemble member. We show a set of experiments that outlines some preliminary results on some common Deep Learning datasets and discuss the potential improvements Deep Incremental Boosting brings to traditional Ensemble methods in Deep Learning. Deep Information Network We describe a novel classifier with a tree structure, designed using information theory concepts. This Information Network is made of information nodes, that compress the input data, and multiplexers, that connect two or more input nodes to an output node. Each information node is trained, independently of the others, to minimize a local cost function that minimizes the mutual information between its input and output with the constraint of keeping a given mutual information between its output and the target (information bottleneck). We show that the system is able to provide good results in terms of accuracy, while it shows many advantages in terms of modularity and reduced complexity. Deep Interest Evolution Network(DIEN) Click-through rate~(CTR) prediction, whose goal is to estimate the probability of the user clicks, has become one of the core tasks in advertising systems. For CTR prediction model, it is necessary to capture the latent user interest behind the user behavior data. Besides, considering the changing of the external environment and the internal cognition, user interest evolves over time dynamically. There are several CTR prediction methods for interest modeling, while most of them regard the representation of behavior as the interest directly, and lack specially modeling for latent interest behind the concrete behavior. Moreover, few work consider the changing trend of interest. In this paper, we propose a novel model, named Deep Interest Evolution Network~(DIEN), for CTR prediction. Specifically, we design interest extractor layer to capture temporal interests from history behavior sequence. At this layer, we introduce an auxiliary loss to supervise interest extracting at each step. As user interests are diverse, especially in the e-commerce system, we propose interest evolving layer to capture interest evolving process that is relative to the target item. At interest evolving layer, attention mechanism is embedded into the sequential structure novelly, and the effects of relative interests are strengthened during interest evolution. In the experiments on both public and industrial datasets, DIEN significantly outperforms the state-of-the-art solutions. Notably, DIEN has been deployed in the display advertisement system of Taobao, and obtained 20.7\% improvement on CTR. Deep Inverse Optimization Given a set of observations generated by an optimization process, the goal of inverse optimization is to determine likely parameters of that process. We cast inverse optimization as a form of deep learning. Our method, called deep inverse optimization, is to unroll an iterative optimization process and then use backpropagation to learn parameters that generate the observations. We demonstrate that by backpropagating through the interior point algorithm we can learn the coefficients determining the cost vector and the constraints, independently or jointly, for both non-parametric and parametric linear programs, starting from one or multiple observations. With this approach, inverse optimization can leverage concepts and algorithms from deep learning. Deep Invertible Network(i-RevNet) It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations. Deep Item Response Theory(Deep-IRT) Deep learning based knowledge tracing model has been shown to outperform traditional knowledge tracing model without the need for human-engineered features, yet its parameters and representations have long been criticized for not being explainable. In this paper, we propose Deep-IRT (deep item response theory) which is a synthesis of the item response theory (IRT) model and a knowledge tracing model that is based on the deep neural network architecture called dynamic key-value memory network (DKVMN) to make deep learning based knowledge tracing explainable. Specifically, we use the DKVMN model to process the student’s learning trajectory and estimate the student ability level and the item difficulty level over time. Then, we use the IRT model to estimate the probability that a student will answer an item correctly using the estimated student ability and the item difficulty. Experiments show that the Deep-IRT model retains the performance of the DKVMN model, while it provides a direct psychological interpretation of both students and items. Deep Kernelized Autoencoder Autoencoders learn data representations (codes) in such a way that the input is reproduced at the output of the network. However, it is not always clear what kind of properties of the input data need to be captured by the codes. Kernel machines have experienced great success by operating via inner-products in a theoretically well-defined reproducing kernel Hilbert space, hence capturing topological properties of input data. In this paper, we enhance the autoencoder’s ability to learn effective data representations by aligning inner products between codes with respect to a kernel matrix. By doing so, the proposed kernelized autoencoder allows learning similarity-preserving embeddings of input data, where the notion of similarity is explicitly controlled by the user and encoded in a positive semi-definite kernel matrix. Experiments are performed for evaluating both reconstruction and kernel alignment performance in classification tasks and visualization of high-dimensional data. Additionally, we show that our method is capable to emulate kernel principal component analysis on a denoising task, obtaining competitive results at a much lower computational cost. Deep k-Means The current trend of pushing CNNs deeper with convolutions has created a pressing demand to achieve higher compression gains on CNNs where convolutions dominate the computation and parameter amount (e.g., GoogLeNet, ResNet and Wide ResNet). Further, the high energy consumption of convolutions limits its deployment on mobile devices. To this end, we proposed a simple yet effective scheme for compressing convolutions though applying k-means clustering on the weights, compression is achieved through weight-sharing, by only recording $K$ cluster centers and weight assignment indexes. We then introduced a novel spectrally relaxed $k$-means regularization, which tends to make hard assignments of convolutional layer weights to $K$ learned cluster centers during re-training. We additionally propose an improved set of metrics to estimate energy consumption of CNN hardware implementations, whose estimation results are verified to be consistent with previously proposed energy estimation tool extrapolated from actual hardware measurements. We finally evaluated Deep $k$-Means across several CNN models in terms of both compression ratio and energy consumption reduction, observing promising results without incurring accuracy loss. The code is available at https://…/Deep-K-Means Deep $k$-Means: Jointly Clustering with $k$-Means and Learning Representations Deep k-Nearest Neighbors(DkNN) Deep neural networks (DNNs) enable innovative applications of machine learning like image recognition, machine translation, or malware detection. However, deep learning is often criticized for its lack of robustness in adversarial settings (e.g., vulnerability to adversarial inputs) and general inability to rationalize its predictions. In this work, we exploit the structure of deep learning to enable new learning-based inference and decision strategies that achieve desirable properties such as robustness and interpretability. We take a first step in this direction and introduce the Deep k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest neighbors algorithm with representations of the data learned by each layer of the DNN: a test input is compared to its neighboring training points according to the distance that separates them in the representations. We show the labels of these neighboring points afford confidence estimates for inputs outside the model’s training manifold, including on malicious inputs like adversarial examples–and therein provides protections against inputs that are outside the models understanding. This is because the nearest neighbors can be used to estimate the nonconformity of, i.e., the lack of support for, a prediction in the training data. The neighbors also constitute human-interpretable explanations of predictions. We evaluate the DkNN algorithm on several datasets, and show the confidence estimates accurately identify inputs outside the model, and that the explanations provided by nearest neighbors are intuitive and useful in understanding model failures. Deep Knowledge-Aware Network(DKN) Online news recommender systems aim to address the information explosion of news and make personalized recommendation for users. In general, news language is highly condensed, full of knowledge entities and common sense. However, existing methods are unaware of such external knowledge and cannot fully discover latent knowledge-level connections among news. The recommended results for a user are consequently limited to simple patterns and cannot be extended reasonably. Moreover, news recommendation also faces the challenges of high time-sensitivity of news and dynamic diversity of users’ interests. To solve the above problems, in this paper, we propose a deep knowledge-aware network (DKN) that incorporates knowledge graph representation into news recommendation. DKN is a content-based deep recommendation framework for click-through rate prediction. The key component of DKN is a multi-channel and word-entity-aligned knowledge-aware convolutional neural network (KCNN) that fuses semantic-level and knowledge-level representations of news. KCNN treats words and entities as multiple channels, and explicitly keeps their alignment relationship during convolution. In addition, to address users’ diverse interests, we also design an attention module in DKN to dynamically aggregate a user’s history with respect to current candidate news. Through extensive experiments on a real online news platform, we demonstrate that DKN achieves substantial gains over state-of-the-art deep recommendation models. We also validate the efficacy of the usage of knowledge in DKN. Deep Laplacian Pyramid Super-Resolution Network Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of run-time and image quality. Deep Layer Aggregation Convolutional networks have had great success in image classification and other areas of computer vision. Recent efforts have designed deeper or wider networks to improve performance; as convolutional blocks are usually stacked together, blocks at different depths represent information at different scales. Recent models have explored skip’ connections to aggregate information across layers, but heretofore such skip connections have themselves been shallow’, projecting to a single fusion node. In this paper, we investigate new deep-across-layer architectures to aggregate the information from multiple layers. We propose novel iterative and hierarchical structures for deep layer aggregation. The former can produce deep high resolution representations from a network whose final layers have low resolution, while the latter can effectively combine scale information from all blocks. Results show that the our proposed architectures can make use of network parameters and features more efficiently without dictating convolution module structure. We also show transfer of the learned networks to semantic segmentation tasks and achieve better results than alternative networks with baseline training settings. Deep LDA Hashing The conventional supervised hashing methods based on classification do not entirely meet the requirements of hashing technique, but Linear Discriminant Analysis (LDA) does. In this paper, we propose to perform a revised LDA objective over deep networks to learn efficient hashing codes in a truly end-to-end fashion. However, the complicated eigenvalue decomposition within each mini-batch in every epoch has to be faced with when simply optimizing the deep network w.r.t. the LDA objective. In this work, the revised LDA objective is transformed into a simple least square problem, which naturally overcomes the intractable problems and can be easily solved by the off-the-shelf optimizer. Such deep extension can also overcome the weakness of LDA Hashing in the limited linear projection and feature learning. Amounts of experiments are conducted on three benchmark datasets. The proposed Deep LDA Hashing shows nearly 70 points improvement over the conventional one on the CIFAR-10 dataset. It also beats several state-of-the-art methods on various metrics. Deep Learning Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to create models to learn these representations. Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce state-of-the-art results on various tasks. Deep Learning Accelerator Unit(DLAU) As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well to maintain the low power cost, in this paper we design DLAU, which is a scalable accelerator architecture for large-scale deep learning networks using FPGA as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to 36.1x speedup comparing to the Intel Core2 processors, with the power consumption at 234mW. Deep Learning Alternating Minimization(DLAM) In recent years, stochastic gradient descent (SGD) is a dominant optimization method for training deep neural networks. But the SGD suffers from several limitations including lack of theoretical guarantees, gradient vanishing, poor conditioning and difficulty in solving highly non-smooth constraints and functions, which motivates the development of alternating minimization-based methods for deep neural network optimization. However, as an emerging domain, there are still several challenges to overcome, where the major ones include: 1) no guarantee on the global convergence under mild conditions, and 2) low efficiency of computation for the subproblem optimization in each iteration. In this paper, we propose a novel deep learning alternating minimization (DLAM) algorithm to deal with those two challenges. Furthermore, global convergence of our DLAM algorithm is analyzed and guaranteed under mild conditions which are satisfied by commonly-used models. Experiments on real-world datasets demonstrate the effectiveness of our DLAM algorithm. Deep Learning Approximation Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called Deep Learning Approximation to build a faster network in a tiny fraction of the time required for training by only manipulating the network structure and coefficients without requiring re-training or access to the training data. Speedup is achieved by by applying a sequential series of independent optimizations that reduce the floating-point operations (FLOPs) required to perform a forward pass. First, lossless optimizations are applied, followed by lossy approximations using singular value decomposition (SVD) and low-rank matrix decomposition. The optimal approximation is chosen by weighing the relative accuracy loss and FLOP reduction according to a single parameter specified by the user. On PASCAL VOC 2007 with the YOLO network, we show an end-to-end 2x speedup in a network forward pass with a 5% drop in mAP that can be re-gained by finetuning. Deep Learning Extreme Multi-Label Classification Extreme Multi-label classification (XML) is an important yet challenging machine learning task, that assigns to each instance its most relevant candidate labels from an extremely large label collection, where the numbers of labels, features and instances could be thousands or millions. XML is more and more on demand in the Internet industries, accompanied with the increasing business scale / scope and data accumulation. The extremely large label collections yield challenges such as computational complexity, inter-label dependency and noisy labeling. Many methods have been proposed to tackle these challenges, based on different mathematical formulations. In this paper, we propose a deep learning XML method, with a word-vector-based self-attention, followed by a ranking-based AutoEncoder architecture. The proposed method has three major advantages: 1) the autoencoder simultaneously considers the inter-label dependencies and the feature-label dependencies, by projecting labels and features onto a common embedding space; 2) the ranking loss not only improves the training efficiency and accuracy but also can be extended to handle noisy labeled data; 3) the efficient attention mechanism improves feature representation by highlighting feature importance. Experimental results on benchmark datasets show the proposed method is competitive to state-of-the-art methods. Deep Learning Impact(DLI) Deep Learning Impact (DLI) is a set of software tools to help users develop AI models with the leading open source deep learning frameworks, like TensorFlow and Caffe, for the deployment and prediction phases of deep learning. DLI enables users to run distributed deep learning workloads on x86 and Power, and complements the PowerAI deep learning software distribution. Deep Learning Library(DLL) Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four different neural network models, we compared DLL to five popular deep learning frameworks. Experimentally, it is shown that the proposed framework is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other frameworks are reported. Deep Learning Model Predictive Control(DeepMPC) The control of complex systems is of critical importance in many branches of science, engineering, and industry. Controlling an unsteady fluid flow is particularly important, as flow control is a key enabler for technologies in energy (e.g., wind, tidal, and combustion), transportation (e.g., planes, trains, and automobiles), security (e.g., tracking airborne contamination), and health (e.g., artificial hearts and artificial respiration). However, the high-dimensional, nonlinear, and multi-scale dynamics make real-time feedback control infeasible. Fortunately, these high-dimensional systems exhibit dominant, low-dimensional patterns of activity that can be exploited for effective control in the sense that knowledge of the entire state of a system is not required. Advances in machine learning have the potential to revolutionize flow control given its ability to extract principled, low-rank feature spaces characterizing such complex systems. We present a novel deep learning model predictive control (DeepMPC) framework that exploits low-rank features of the flow in order to achieve considerable improvements to control performance. Instead of predicting the entire fluid state, we use a recurrent neural network (RNN) to accurately predict the control relevant quantities of the system. The RNN is then embedded into a MPC framework to construct a feedback loop, and incoming sensor data is used to perform online updates to improve prediction accuracy. The results are validated using varying fluid flow examples of increasing complexity. Deep Learning Optimization Library(DLOPT) Deep learning hyper-parameter optimization is a tough task. Finding an appropriate network configuration is a key to success, however most of the times this labor is roughly done. In this work we introduce a novel library to tackle this problem, the Deep Learning Optimization Library: DLOPT. We briefly describe its architecture and present a set of use examples. This is an open source project developed under the GNU GPL v3 license and it is freely available at https://…/dlopt Deep Learning Optimizer Benchmark Suite(DeepOBS) Because the choice and tuning of the optimizer affects the speed, and ultimately the performance of deep learning, there is significant past and recent research in this area. Yet, perhaps surprisingly, there is no generally agreed-upon protocol for the quantitative and reproducible evaluation of optimization strategies for deep learning. We suggest routines and benchmarks for stochastic optimization, with special focus on the unique aspects of deep learning, such as stochasticity, tunability and generalization. As the primary contribution, we present DeepOBS, a Python package of deep learning optimization benchmarks. The package addresses key challenges in the quantitative assessment of stochastic optimizers, and automates most steps of benchmarking. The library includes a wide and extensible set of ready-to-use realistic optimization problems, such as training Residual Networks for image classification on ImageNet or character-level language prediction models, as well as popular classics like MNIST and CIFAR-10. The package also provides realistic baseline results for the most popular optimizers on these test problems, ensuring a fair comparison to the competition when benchmarking new optimizers, and without having to run costly experiments. It comes with output back-ends that directly produce LaTeX code for inclusion in academic publications. It supports TensorFlow and is available open source. Deep Learning Virtual Machine(DLVM) Many current approaches to deep learning make use of high-level toolkits such as TensorFlow, Torch, or Caffe. Toolkits such as Caffe have a layer-based programming framework with hard-coded gradients specified for each layer type, making research using novel layer types problematic. Toolkits such as Torch and TensorFlow define a computation graph in a host language such as Python, where each node represents a linear algebra operation parallelized as a compute kernel on GPU and stores the result of evaluation; some of these toolkits subsequently perform runtime interpretation over that graph, storing the results of forward calculations and reverse-accumulated gradients at each node. This approach is more flexible, but these toolkits take a very limited and ad-hoc approach to performing optimization. Also problematic are the facts that most toolkits lack type safety, and target only a single (usually GPU) architecture, limiting users’ abilities to make use of heterogeneous and emerging hardware architectures. We introduce a novel framework for high-level programming that addresses all of the above shortcomings. Deep Linear Discriminant Analysis(DeepLDA) We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. Deep Local Binary Patterns(Deep LBP) Local Binary Pattern (LBP) is a traditional descriptor for texture analysis that gained attention in the last decade. Being robust to several properties such as invariance to illumination translation and scaling, LBPs achieved state-of-the-art results in several applications. However, LBPs are not able to capture high-level features from the image, merely encoding features with low abstraction levels. In this work, we propose Deep LBP, which borrow ideas from the deep learning community to improve LBP expressiveness. By using parametrized data-driven LBP, we enable successive applications of the LBP operators with increasing abstraction levels. We validate the relevance of the proposed idea in several datasets from a wide range of applications. Deep LBP improved the performance of traditional and multiscale LBP in all cases. Deep Logic Model Deep learning is very effective at jointly learning feature representations and classification models, especially when dealing with high dimensional input patterns. Probabilistic logic reasoning, on the other hand, is capable to take consistent and robust decisions in complex environments. The integration of deep learning and logic reasoning is still an open-research problem and it is considered to be the key for the development of real intelligent agents. This paper presents Deep Logic Models, which are deep graphical models integrating deep learning and logic reasoning both for learning and inference. Deep Logic Models create an end-to-end differentiable architecture, where deep learners are embedded into a network implementing a continuous relaxation of the logic knowledge. The learning process allows to jointly learn the weights of the deep learners and the meta-parameters controlling the high-level reasoning. The experimental results show that the proposed methodology overtakes the limitations of the other approaches that have been proposed to bridge deep learning and reasoning. Deep Long Short Term Memory Network Based Approach to Learn the OR Function(LSTM-OR) Prognostics or Remaining Useful Life (RUL) Estimation from multi-sensor time series data is useful to enable condition-based maintenance and ensure high operational availability of equipment. We propose a novel deep learning based approach for Prognostics with Uncertainty Quantification that is useful in scenarios where: (i) access to labeled failure data is scarce due to rarity of failures (ii) future operational conditions are unobserved and (iii) inherent noise is present in the sensor readings. All three scenarios mentioned are unavoidable sources of uncertainty in the RUL estimation process often resulting in unreliable RUL estimates. To address (i), we formulate RUL estimation as an Ordinal Regression (OR) problem, and propose LSTM-OR: deep Long Short Term Memory (LSTM) network based approach to learn the OR function. We show that LSTM-OR naturally allows for incorporation of censored operational instances in training along with the failed instances, leading to more robust learning. To address (ii), we propose a simple yet effective approach to quantify predictive uncertainty in the RUL estimation models by training an ensemble of LSTM-OR models. Through empirical evaluation on C-MAPSS turbofan engine benchmark datasets, we demonstrate that LSTM-OR is significantly better than the commonly used deep metric regression based approaches for RUL estimation, especially when failed training instances are scarce. Further, our uncertainty quantification approach yields high quality predictive uncertainty estimates while also leading to improved RUL estimates compared to single best LSTM-OR models. Deep Loopy Neural Network Existing deep learning models may encounter great challenges in handling graph structured data. In this paper, we introduce a new deep learning model for graph data specifically, namely the deep loopy neural network. Significantly different from the previous deep models, inside the deep loopy neural network, there exist a large number of loops created by the extensive connections among nodes in the input graph data, which makes model learning an infeasible task. To resolve such a problem, in this paper, we will introduce a new learning algorithm for the deep loopy neural network specifically. Instead of learning the model variables based on the original model, in the proposed learning algorithm, errors will be back-propagated through the edges in a group of extracted spanning trees. Extensive numerical experiments have been done on several real-world graph datasets, and the experimental results demonstrate the effectiveness of both the proposed model and the learning algorithm in handling graph data. Deep Matching and Validation Network(DMVN) Image splicing is a very common image manipulation technique that is sometimes used for malicious purposes. A splicing detection and localization algorithm usually takes an input image and produces a binary decision indicating whether the input image has been manipulated, and also a segmentation mask that corresponds to the spliced region. Most existing splicing detection and localization pipelines suffer from two main shortcomings: 1) they use handcrafted features that are not robust against subsequent processing (e.g., compression), and 2) each stage of the pipeline is usually optimized independently. In this paper we extend the formulation of the underlying splicing problem to consider two input images, a query image and a potential donor image. Here the task is to estimate the probability that the donor image has been used to splice the query image, and obtain the splicing masks for both the query and donor images. We introduce a novel deep convolutional neural network architecture, called Deep Matching and Validation Network (DMVN), which simultaneously localizes and detects image splicing. The proposed approach does not depend on handcrafted features and uses raw input images to create deep learned representations. Furthermore, the DMVN is end-to-end op- timized to produce the probability estimates and the segmentation masks. Our extensive experiments demonstrate that this approach outperforms state-of-the-art splicing detection methods by a large margin in terms of both AUC score and speed. Deep Matching Autoencoder(DMAE) Increasingly many real world tasks involve data in multiple modalities or views. This has motivated the development of many effective algorithms for learning a common latent space to relate multiple domains. However, most existing cross-view learning algorithms assume access to paired data for training. Their applicability is thus limited as the paired data assumption is often violated in practice: many tasks have only a small subset of data available with pairing annotation, or even no paired data at all. In this paper we introduce Deep Matching Autoencoders (DMAE), which learn a common latent space and pairing from unpaired multi-modal data. Specifically we formulate this as a cross-domain representation learning and object matching problem. We simultaneously optimise parameters of representation learning auto-encoders and the pairing of unpaired multi-modal data. This framework elegantly spans the full regime from fully supervised, semi-supervised, and unsupervised (no paired data) multi-modal learning. We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning. Deep Maze ➘ “Dreaming Variational Autoencoder” Deep Mean Maps(DMM) The use of distributions and high-level features from deep architecture has become commonplace in modern computer vision. Both of these methodologies have separately achieved a great deal of success in many computer vision tasks. However, there has been little work attempting to leverage the power of these to methodologies jointly. To this end, this paper presents the Deep Mean Maps (DMMs) framework, a novel family of methods to non-parametrically represent distributions of features in convolutional neural network models. DMMs are able to both classify images using the distribution of top-level features, and to tune the top-level features for performing this task. We show how to implement DMMs using a special mean map layer composed of typical CNN operations, making both forward and backward propagation simple. Deep Memory(MemGEN) We propose a new learning paradigm called Deep Memory. It has the potential to completely revolutionize the Machine Learning field. Surprisingly, this paradigm has not been reinvented yet, unlike Deep Learning. At the core of this approach is the \textit{Learning By Heart} principle, well studied in primary schools all over the world. Inspired by poem recitation, or by $\pi$ decimal memorization, we propose a concrete algorithm that mimics human behavior. We implement this paradigm on the task of generative modeling, and apply to images, natural language and even the $\pi$ decimals as long as one can print them as text. The proposed algorithm even generated this paper, in a one-shot learning setting. In carefully designed experiments, we show that the generated samples are indistinguishable from the training examples, as measured by any statistical tests or metrics. Deep Meta-Learning Few-shot learning remains challenging for meta-learning that learns a learning algorithm (meta-learner) from many related tasks. In this work, we argue that this is due to the lack of a good representation for meta-learning, and propose deep meta-learning to integrate the representation power of deep learning into meta-learning. The framework is composed of three modules, a concept generator, a meta-learner, and a concept discriminator, which are learned jointly. The concept generator, e.g. a deep residual net, extracts a representation for each instance that captures its high-level concept, on which the meta-learner performs few-shot learning, and the concept discriminator recognizes the concepts. By learning to learn in the concept space rather than in the complicated instance space, deep meta-learning can substantially improve vanilla meta-learning, which is demonstrated on various few-shot image recognition problems. For example, on 5-way-1-shot image recognition on CIFAR-100 and CUB-200, it improves Matching Nets from 50.53% and 56.53% to 58.18% and 63.47%, improves MAML from 49.28% and 50.45% to 56.65% and 64.63%, and improves Meta-SGD from 53.83% and 53.34% to 61.62% and 66.95%, respectively. Deep Model Consolidation(DMC) Deep neural networks (DNNs) often suffer from ‘catastrophic forgetting’ during incremental learning (IL) — an abrupt degradation of performance on the original set of classes when the training objective is adapted to a newly added set of classes. Existing IL approaches attempting to overcome catastrophic forgetting tend to produce a model that is biased towards either the old classes or new classes, unless with the help of exemplars of the old data. To address this issue, we propose a class-incremental learning paradigm called Deep Model Consolidation (DMC), which works well even when the original training data is not available. The idea is to first train a separate model only for the new classes, and then combine the two individual models trained on data of two distinct set of classes (old classes and new classes) via a novel dual distillation training objective. The two models are consolidated by exploiting publicly available unlabeled auxiliary data. This overcomes the potential difficulties due to unavailability of original training data. Compared to the state-of-the-art techniques, DMC demonstrates significantly better performance in CIFAR-100 image classification and PASCAL VOC 2007 object detection benchmarks in the IL setting. Deep Modulation and Spectrum Assignment(DeepRMSA) This paper proposes DeepRMSA, a deep reinforcement learning framework for routing, modulation and spectrum assignment (RMSA) in elastic optical networks (EONs). DeepRMSA learns the correct online RMSA policies by parameterizing the policies with deep neural networks (DNNs) that can sense complex EON states. The DNNs are trained with experiences of dynamic lightpath provisioning. We first modify the asynchronous advantage actor-critic algorithm and present an episode-based training mechanism for DeepRMSA, namely, DeepRMSA-EP. DeepRMSA-EP divides the dynamic provisioning process into multiple episodes (each containing the servicing of a fixed number of lightpath requests) and performs training by the end of each episode. The optimization target of DeepRMSA-EP at each step of servicing a request is to maximize the cumulative reward within the rest of the episode. Thus, we obviate the need for estimating the rewards related to unknown future states. To overcome the instability issue in the training of DeepRMSA-EP due to the oscillations of cumulative rewards, we further propose a window-based flexible training mechanism, i.e., DeepRMSA-FLX. DeepRMSA-FLX attempts to smooth out the oscillations by defining the optimization scope at each step as a sliding window, and ensuring that the cumulative rewards always include rewards from a fixed number of requests. Evaluations with the two sample topologies show that DeepRMSA-FLX can effectively stabilize the training while achieving blocking probability reductions of more than 20.3% and 14.3%, when compared with the baselines. Deep Modulation Embedding Deep neural network has recently shown very promising applications in different research directions and attracted the industry attention as well. Although the idea was introduced in the past but just recently the main limitation of using this class of algorithms is solved by enabling parallel computing on GPU hardware. Opening the possibility of hardware prototyping with proven superiority of this class of algorithm, trigger several research directions in communication system too. Among them cognitive radio, modulation recognition, learning based receiver and transceiver are already given very interesting result in simulation and real experimental evaluation implemented on software defined radio. Specifically, modulation recognition is mostly approached as a classification problem which is a supervised learning framework. But it is here addressed as an unsupervised problem with introducing new features for training, a new loss function and investigating the robustness of the pipeline against several mismatch conditions. Deep Momentum Network While time series momentum is a well-studied phenomenon in finance, common strategies require the explicit definition of both a trend estimator and a position sizing rule. In this paper, we introduce Deep Momentum Networks — a hybrid approach which injects deep learning based trading rules into the volatility scaling framework of time series momentum. The model also simultaneously learns both trend estimation and position sizing in a data-driven manner, with networks directly trained by optimising the Sharpe ratio of the signal. Backtesting on a portfolio of 88 continuous futures contracts, we demonstrate that the Sharpe-optimised LSTM improved traditional methods by more than two times in the absence of transactions costs, and continue outperforming when considering transaction costs up to 2-3 basis points. To account for more illiquid assets, we also propose a turnover regularisation term which trains the network to factor in costs at run-time. Deep Motion Boundary Detection(MoBoNet) Motion boundary detection is a crucial yet challenging problem. Prior methods focus on analyzing the gradients and distributions of optical flow fields, or use hand-crafted features for motion boundary learning. In this paper, we propose the first dedicated end-to-end deep learning approach for motion boundary detection, which we term as MoBoNet. We introduce a refinement network structure which takes source input images, initial forward and backward optical flows as well as corresponding warping errors as inputs and produces high-resolution motion boundaries. Furthermore, we show that the obtained motion boundaries, through a fusion sub-network we design, can in turn guide the optical flows for removing the artifacts. The proposed MoBoNet is generic and works with any optical flows. Our motion boundary detection and the refined optical flow estimation achieve results superior to the state of the art. Deep Multimodal Attention Network(DMAN) Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search. Deep Multimodal Learning(DML) Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google. Deep Multimodal Subspace Clustering Network We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages – multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressiveness layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods. Deep Multimodality Model for Multi-task Multi-view Learning(Deep-MTMV) Many real-world problems exhibit the coexistence of multiple types of heterogeneity, such as view heterogeneity (i.e., multi-view property) and task heterogeneity (i.e., multi-task property). For example, in an image classification problem containing multiple poses of the same object, each pose can be considered as one view, and the detection of each type of object can be treated as one task. Furthermore, in some problems, the data type of multiple views might be different. In a web classification problem, for instance, we might be provided an image and text mixed data set, where the web pages are characterized by both images and texts. A common strategy to solve this kind of problem is to leverage the consistency of views and the relatedness of tasks to build the prediction model. In the context of deep neural network, multi-task relatedness is usually realized by grouping tasks at each layer, while multi-view consistency is usually enforced by finding the maximal correlation coefficient between views. However, there is no existing deep learning algorithm that jointly models task and view dual heterogeneity, particularly for a data set with multiple modalities (text and image mixed data set or text and video mixed data set, etc.). In this paper, we bridge this gap by proposing a deep multi-task multi-view learning framework that learns a deep representation for such dual-heterogeneity problems. Empirical studies on multiple real-world data sets demonstrate the effectiveness of our proposed Deep-MTMV algorithm. Deep Multiscale Model Learning The objective of this paper is to design novel multi-layer neural network architectures for multiscale simulations of flows taking into account the observed data and physical modeling concepts. Our approaches use deep learning concepts combined with local multiscale model reduction methodologies to predict flow dynamics. Using reduced-order model concepts is important for constructing robust deep learning architectures since the reduced-order models provide fewer degrees of freedom. Flow dynamics can be thought of as multi-layer networks. More precisely, the solution (e.g., pressures and saturations) at the time instant $n+1$ depends on the solution at the time instant $n$ and input parameters, such as permeability fields, forcing terms, and initial conditions. One can regard the solution as a multi-layer network, where each layer, in general, is a nonlinear forward map and the number of layers relates to the internal time steps. We will rely on rigorous model reduction concepts to define unknowns and connections for each layer. In each layer, our reduced-order models will provide a forward map, which will be modified (‘trained’) using available data. It is critical to use reduced-order models for this purpose, which will identify the regions of influence and the appropriate number of variables. Because of the lack of available data, the training will be supplemented with computational data as needed and the interpolation between data-rich and data-deficient models. We will also use deep learning algorithms to train the elements of the reduced model discrete system. We will present main ingredients of our approach and numerical results. Numerical results show that using deep learning and multiscale models, we can improve the forward models, which are conditioned to the available data. Deep Multiset Canonical Correlation Analysis(dMCCA) We propose Deep Multiset Canonical Correlation Analysis (dMCCA) as an extension to representation learning using CCA when the underlying signal is observed across multiple (more than two) modalities. We use deep learning framework to learn non-linear transformations from different modalities to a shared subspace such that the representations maximize the ratio of between- and within-modality covariance of the observations. Unlike linear discriminant analysis, we do not need class information to learn these representations, and we show that this model can be trained for complex data using mini-batches. Using synthetic data experiments, we show that dMCCA can effectively recover the common signal across the different modalities corrupted by multiplicative and additive noise. We also analyze the sensitivity of our model to recover the correlated components with respect to mini-batch size and dimension of the embeddings. Performance evaluation on noisy handwritten datasets shows that our model outperforms other CCA-based approaches and is comparable to deep neural network models trained end-to-end on this dataset. Deep Mutual Learning(DML) Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary — mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher. Deep Nearest Neighbor Descent(D-NND) Most density-based clustering methods largely rely on how well the underlying density is estimated. However, density estimation itself is also a challenging problem, especially the determination of the kernel bandwidth. A large bandwidth could lead to the over-smoothed density estimation in which the number of density peaks could be less than the true clusters, while a small bandwidth could lead to the under-smoothed density estimation in which spurious density peaks, or called the ‘ripple noise’, would be generated in the estimated density. In this paper, we propose a density-based hierarchical clustering method, called the Deep Nearest Neighbor Descent (D-NND), which could learn the underlying density structure layer by layer and capture the cluster structure at the same time. The over-smoothed density estimation could be largely avoided and the negative effect of the under-estimated cases could be also largely reduced. Overall, D-NND presents not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters. Deep Nearest Neighbor Neural Network(DN4) Few-shot learning in image classification aims to learn a classifier to classify images when only few training examples are available for each class. Recent work has achieved promising classification performance, where an image-level feature based measure is usually used. In this paper, we argue that a measure at such a level may not be effective enough in light of the scarcity of examples in few-shot learning. Instead, we think a local descriptor based image-to-class measure should be taken, inspired by its surprising success in the heydays of local invariant features. Specifically, building upon the recent episodic training mechanism, we propose a Deep Nearest Neighbor Neural Network (DN4 in short) and train it in an end-to-end manner. Its key difference from the literature is the replacement of the image-level feature based measure in the final layer by a local descriptor based image-to-class measure. This measure is conducted online via a $k$-nearest neighbor search over the deep local descriptors of convolutional feature maps. The proposed DN4 not only learns the optimal deep local descriptors for the image-to-class measure, but also utilizes the higher efficiency of such a measure in the case of example scarcity, thanks to the exchangeability of visual patterns across the images in the same class. Our work leads to a simple, effective, and computationally efficient framework for few-shot learning. Experimental study on benchmark datasets consistently shows its superiority over the related state-of-the-art, with the largest absolute improvement of $17\%$ over the next best. The source code can be available from \UrlFont{https://…/DN4.git}. Deep Nested Agent Framework Deep hierarchical reinforcement learning has gained a lot of attention in recent years due to its ability to produce state-of-the-art results in challenging environments where non-hierarchical frameworks fail to learn useful policies. However, as problem domains become more complex, deep hierarchical reinforcement learning can become inefficient, leading to longer convergence times and poor performance. We introduce the Deep Nested Agent framework, which is a variant of deep hierarchical reinforcement learning where information from the main agent is propagated to the low level $nested$ agent by incorporating this information into the nested agent’s state. We demonstrate the effectiveness and performance of the Deep Nested Agent framework by applying it to three scenarios in Minecraft with comparisons to a deep non-hierarchical single agent framework, as well as, a deep hierarchical framework. Deep Neural Decision Forests We present Deep Neural Decision Forests – a novel approach that unifies classification trees with the representation learning functionality known from deep convolutional networks, by training them in an end-to-end manner. To combine these two worlds, we introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network. Our model differs from conventional deep networks because a decision forest provides the final predictions and it differs from conventional decision forests since we propose a principled, joint and global optimization of split and leaf node parameters. We show experimental results on benchmark machine learning datasets like MNIST and ImageNet and find onpar or superior results when compared to state-of-the-art deep models. Most remarkably, we obtain Top5-Errors of only 7:84%=6:38% on ImageNet validation data when integrating our forests in a single-crop, single/seven model GoogLeNet architecture, respectively. Thus, even without any form of training data set augmentation we are improving on the 6.67% error obtained by the best GoogLeNet architecture (7 models, 144 crops). Deep Neural Decision Tree(DNDT) Deep neural networks have been proven powerful at processing perceptual data, such as images and audio. However for tabular data, tree-based models are more popular. A nice property of tree-based models is their natural interpretability. In this work, we present Deep Neural Decision Trees (DNDT) — tree models realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting. We evaluate DNDT on several tabular datasets, verify its efficacy, and investigate similarities and differences between DNDT and vanilla decision trees. Interestingly, DNDT self-prunes at both split and feature-level. Deep Neural Linear Bandit We study the neural-linear bandit model for solving sequential decision-making problems with high dimensional side information. Neural-linear bandits leverage the representation power of deep neural networks and combine it with efficient exploration mechanisms, designed for linear contextual bandits, on top of the last hidden layer. Since the representation is being optimized during learning, information regarding exploration with ‘old’ features is lost. Here, we propose the first limited memory neural-linear bandit that is resilient to this phenomenon, which we term catastrophic forgetting. We evaluate our method on a variety of real-world data sets, including regression, classification, and sentiment analysis, and observe that our algorithm is resilient to catastrophic forgetting and achieves superior performance. Deep Neural Map(DNM) We introduce a new unsupervised representation learning and visualization using deep convolutional networks and self organizing maps called Deep Neural Maps (DNM). DNM jointly learns an embedding of the input data and a mapping from the embedding space to a two-dimensional lattice. We compare visualizations of DNM with those of t-SNE and LLE on the MNIST and COIL-20 data sets. Our experiments show that the DNM can learn efficient representations of the input data, which reflects characteristics of each class. This is shown via back-projecting the neurons of the map on the data space. Deep Neural Network Ensemble Current deep neural networks suffer from two problems; first, they are hard to interpret, and second, they suffer from overfitting. There have been many attempts to define interpretability in neural networks, but they typically lack causality or generality. A myriad of regularization techniques have been developed to prevent overfitting, and this has driven deep learning to become the hot topic it is today; however, while most regularization techniques are justified empirically and even intuitively, there is not much underlying theory. This paper argues that to extract the features used in neural networks to make decisions, it’s important to look at the paths between clusters existing in the hidden spaces of neural networks. These features are of particular interest because they reflect the true decision making process of the neural network. This analysis is then furthered to present an ensemble algorithm for arbitrary neural networks which has guarantees for test accuracy. Finally, a discussion detailing the aforementioned guarantees is introduced and the implications to neural networks, including an intuitive explanation for all current regularization methods, are presented. The ensemble algorithm has generated state-of-the-art results for Wide-ResNet on CIFAR-10 and has improved test accuracy for all models it has been applied to. Deep Octonion Network Deep learning is a research hot topic in the field of machine learning. Real-value neural networks (Real NNs), especially deep real networks (DRNs), have been widely used in many research fields. In recent years, the deep complex networks (DCNs) and the deep quaternion networks (DQNs) have attracted more and more attentions. The octonion algebra, which is an extension of complex algebra and quaternion algebra, can provide more efficient and compact expression. This paper constructs a general framework of deep octonion networks (DONs) and provides the main building blocks of DONs such as octonion convolution, octonion batch normalization and octonion weight initialization; DONs are then used in image classification tasks for CIFAR-10 and CIFAR-100 data sets. Compared with the DRNs, the DCNs, and the DQNs, the proposed DONs have better convergence and higher classification accuracy. The success of DONs is also explained by multi-task learning. Deep Optimisation Deep Optimisation (DO) combines evolutionary search with Deep Neural Networks (DNNs) in a novel way – not for optimising a learning algorithm, but for finding a solution to an optimisation problem. Deep learning has been successfully applied to classification, regression, decision and generative tasks and in this paper we extend its application to solving optimisation problems. Model Building Optimisation Algorithms (MBOAs), a branch of evolutionary algorithms, have been successful in combining machine learning methods and evolutionary search but, until now, they have not utilised DNNs. DO is the first algorithm to use a DNN to learn and exploit the problem structure to adapt the variation operator (changing the neighbourhood structure of the search process). We demonstrate the performance of DO using two theoretical optimisation problems within the MAXSAT class. The Hierarchical Transformation Optimisation Problem (HTOP) has controllable deep structure that provides a clear evaluation of how DO works and why using a layerwise technique is essential for learning and exploiting problem structure. The Parity Modular Constraint Problem (MCparity) is a simplistic example of a problem containing higher-order dependencies (greater than pairwise) which DO can solve and state of the art MBOAs cannot. Further, we show that DO can exploit deep structure in TSP instances. Together these results show that there exists problems that DO can find and exploit deep problem structure that other algorithms cannot. Making this connection between DNNs and optimisation allows for the utilisation of advanced tools applicable to DNNs that current MBOAs are unable to use. Deep Optimistic Linear Support Learning(DOL) We propose Deep Optimistic Linear Support Learning (DOL) to solve high-dimensional multi-objective decision problems where the relative importances of the objectives are not known a priori. Using features from the high-dimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives. To our knowledge, this is the first time that deep reinforcement learning has succeeded in learning multi-objective policies. In addition, we provide a testbed with two experiments to be used as a benchmark for deep multi-objective reinforcement learning. Deep Ordinal Reinforcement Learning Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties. Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years. In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated. We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable to the numerical variations for a number of problems. We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals. Deep Perm-Set Net We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. Even though the output of many real-world problems, e.g. object detection, are naturally expressed as sets of entities, existing deep learning architectures hinder a trivial extension to deal with this unstructured output. Even deep architectures that handle sequential data, such as recurrent neural networks, can only output an ordered set and may not guarantee a valid solution, i.e. a set with unique elements. In this paper, we derive a mathematical formulation for set prediction using feed-forward neural networks, where the output has unknown and unfixed cardinality and permutation. Specifically, in our formulation we incorporate the permutation as unobservable variable and estimate its distribution during the learning process using alternating optimization. We demonstrate the validity of this formulation on two relevant problems including object detection and a complex CAPTCHA test. Deep plAckeTt-luce modEL wIth uNcertainty mEasurements(DATELINE) The aggregation of k-ary preferences is a historical and important problem, since it has many real-world applications, such as peer grading, presidential elections and restaurant ranking. Meanwhile, variants of Plackett-Luce model has been applied to aggregate k-ary preferences. However, there are two urgent issues still existing in the current variants. First, most of them ignore feature information. Namely, they consider k-ary preferences instead of instance-dependent k-ary preferences. Second, these variants barely consider the uncertainty in k-ary preferences provided by agnostic crowds. In this paper, we propose Deep plAckeTt-luce modEL wIth uNcertainty mEasurements (DATELINE), which can address both issues simultaneously. To address the first issue, we employ deep neural networks mapping each instance into its ranking score in Plackett-Luce model. Then, we present a weighted Plackett-Luce model to solve the second issue, where the weight is a dynamic uncertainty vector measuring the worker quality. More importantly, we provide theoretical guarantees for DATELINE to justify its robustness. Deep Planning Network(PlaNet) Planning has been very successful for control tasks with known environment dynamics. To leverage planning in unknown environments, the agent needs to learn the dynamics from interactions with the world. However, learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains. We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from pixels and chooses actions through online planning in latent space. To achieve high performance, the dynamics model must accurately predict the rewards ahead for multiple time steps. We approach this problem using a latent dynamics model with both deterministic and stochastic transition function and a generalized variational inference objective that we name latent overshooting. Using only pixel observations, our agent solves continuous control tasks with contact dynamics, partial observability, and sparse rewards. PlaNet uses significantly fewer episodes and reaches final performance close to and sometimes higher than top model-free algorithms. Deep Poisson-Gamma Dynamical System(DPGDS) We develop deep Poisson-gamma dynamical systems (DPGDS) to model sequentially observed multivariate count data, improving previously proposed models by not only mining deep hierarchical latent structure from the data, but also capturing both first-order and long-range temporal dependencies. Using sophisticated but simple-to-implement data augmentation techniques, we derived closed-form Gibbs sampling update equations by first backward and upward propagating auxiliary latent counts, and then forward and downward sampling latent variables. Moreover, we develop stochastic gradient MCMC inference that is scalable to very long multivariate count time series. Experiments on both synthetic and a variety of real-world data demonstrate that the proposed model not only has excellent predictive performance, but also provides highly interpretable multilayer latent structure to represent hierarchical and temporal information propagation. Deep Policy Inference Q-Network(DPIQN) We present DPIQN, a deep policy inference Q-network that targets multi-agent systems composed of controllable agents, collaborators, and opponents that interact with each other. We focus on one challenging issue in such systems—modeling agents with varying strategies—and propose to employ ‘policy features’ learned from raw observations (e.g., raw images) of collaborators and opponents by inferring their policies. DPIQN incorporates the learned policy features as a hidden vector into its own deep Q-network (DQN), such that it is able to predict better Q values for the controllable agents than the state-of-the-art deep reinforcement learning models. We further propose an enhanced version of DPIQN, called deep recurrent policy inference Q-network (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network’s attention to learn the policy features and its own Q-values at different phases of the training process. We present a comprehensive analysis of DPIQN and DRPIQN, and highlight their effectiveness and generalizability in various multi-agent settings. Our models are evaluated in a classic soccer game involving both competitive and collaborative scenarios. Experimental results performed on 1 vs. 1 and 2 vs. 2 games show that DPIQN and DRPIQN demonstrate superior performance to the baseline DQN and deep recurrent Q-network (DRQN) models. We also explore scenarios in which collaborators or opponents dynamically change their policies, and show that DPIQN and DRPIQN do lead to better overall performance in terms of stability and mean scores. Deep Polynomial Network(DPN) The cloud-based speech recognition/API provides developers or enterprises an easy way to create speech-enabled features in their applications. However, sending audios about personal or company internal information to the cloud, raises concerns about the privacy and security issues. The recognition results generated in cloud may also reveal some sensitive information. This paper proposes a deep polynomial network (DPN) that can be applied to the encrypted speech as an acoustic model. It allows clients to send their data in an encrypted form to the cloud to ensure that their data remains confidential, at mean while the DPN can still make frame-level predictions over the encrypted speech and return them in encrypted form. One good property of the DPN is that it can be trained on unencrypted speech features in the traditional way. To keep the cloud away from the raw audio and recognition results, a cloud-local joint decoding framework is also proposed. We demonstrate the effectiveness of model and framework on the Switchboard and Cortana voice assistant tasks with small performance degradation and latency increased comparing with the traditional cloud-based DNNs. Deep Priority Hashing(DPH) Deep hashing enables image retrieval by end-to-end learning of deep representations and hash codes from training data with pairwise similarity information. Subject to the distribution skewness underlying the similarity information, most existing deep hashing methods may underperform for imbalanced data due to misspecified loss functions. This paper presents Deep Priority Hashing (DPH), an end-to-end architecture that generates compact and balanced hash codes in a Bayesian learning framework. The main idea is to reshape the standard cross-entropy loss for similarity-preserving learning such that it down-weighs the loss associated to highly-confident pairs. This idea leads to a novel priority cross-entropy loss, which prioritizes the training on uncertain pairs over confident pairs. Also, we propose another priority quantization loss, which prioritizes hard-to-quantize examples for generation of nearly lossless hash codes. Extensive experiments demonstrate that DPH can generate high-quality hash codes and yield state-of-the-art image retrieval results on three datasets, ImageNet, NUS-WIDE, and MS-COCO. Deep Private-Feature Extractor(DPFE) We present and evaluate Deep Private-Feature Extractor (DPFE), a deep model which is trained and evaluated based on information theoretic constraints. Using the selective exchange of information between a user’s device and a service provider, DPFE enables the user to prevent certain sensitive information from being shared with a service provider, while allowing them to extract approved information using their model. We introduce and utilize the log-rank privacy, a novel measure to assess the effectiveness of DPFE in removing sensitive information and compare different models based on their accuracy-privacy tradeoff. We then implement and evaluate the performance of DPFE on smartphones to understand its complexity, resource demands, and efficiency tradeoffs. Our results on benchmark image datasets demonstrate that under moderate resource utilization, DPFE can achieve high accuracy for primary tasks while preserving the privacy of sensitive features. Deep Probabilistic Ensemble(DPE) In this paper, we introduce Deep Probabilistic Ensembles (DPEs), a scalable technique that uses a regularized ensemble to approximate a deep Bayesian Neural Network (BNN). We do so by incorporating a KL divergence penalty term into the training objective of an ensemble, derived from the evidence lower bound used in variational inference. We evaluate the uncertainty estimates obtained from our models for active learning on visual classification, consistently outperforming baselines and existing approaches. Deep Probabilistic Programming Deep probabilistic programming combines deep neural networks (for automatic hierarchical representation learning) with probabilistic models (for principled handling of uncertainty). Unfortunately, it is difficult to write deep probabilistic models, because existing programming frameworks lack concise, high-level, and clean ways to express them. To ease this task, we extend Stan, a popular high-level probabilistic programming language, to use deep neural networks written in PyTorch. Training deep probabilistic models works best with variational inference, so we also extend Stan for that. We implement these extensions by translating Stan programs to Pyro. Our translation clarifies the relationship between different families of probabilistic programming languages. Overall, our paper is a step towards making deep probabilistic programming easier. Deep Product Quantization(DPQ) Despite their widespread adoption, Product Quantization techniques were recently shown to be inferior to other hashing techniques. In this work, we present an improved Deep Product Quantization (DPQ) technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a representation that is inspired by Product Quantization and which is learned end-to-end, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straight-through estimator. A novel loss function, Joint Central Loss, is introduced, which both improves the retrieval performance, and decreases the discrepancy between the soft and the hard representations. Finally, by using a normalization technique, we improve the results for cross-domain category retrieval. Deep Q-Network(DQN) We propose Deep Q-Networks (DQN) with model-based exploration, an algorithm combining both model-free and model-based approaches that explores better and learns environments with sparse rewards more efficiently. DQN is a general-purpose, model-free algorithm and has been proven to perform well in a variety of tasks including Atari 2600 games since it’s first proposed by Minh et el. However, like many other reinforcement learning (RL) algorithms, DQN suffers from poor sample efficiency when rewards are sparse in an environment. As a result, most of the transitions stored in the replay memory have no informative reward signal, and provide limited value to the convergence and training of the Q-Network. However, one insight is that these transitions can be used to learn the dynamics of the environment as a supervised learning problem. The transitions also provide information of the distribution of visited states. Our algorithm utilizes these two observations to perform a one-step planning during exploration to pick an action that leads to states least likely to be seen, thus improving the performance of exploration. We demonstrate our agent’s performance in two classic environments with sparse rewards in OpenAI gym: Mountain Car and Lunar Lander. Deep Quality-Value Learning(DQV) We introduce a novel Deep Reinforcement Learning (DRL) algorithm called Deep Quality-Value (DQV) Learning. Similarly to Advantage-Actor-Critic methods, DQV uses a Value neural network for estimating the temporal-difference errors which are then used by a second Quality network for directly learning the state-action values. We first test DQV’s update rules with Multilayer Perceptrons as function approximators on two classic RL problems, and then extend DQV with the use of Deep Convolutional Neural Networks, Experience Replay’ and Target Neural Networks’ for tackling four games of the Atari Arcade Learning environment. Our results show that DQV learns significantly faster and better than Deep Q-Learning and Double Deep Q-Learning, suggesting that our algorithm can potentially be a better performing synchronous temporal difference algorithm than what is currently present in DRL. Deep Quaternion Network The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on real-valued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a set parameter budget compared to its real-valued counterpart. In this work, we explore the benefits of generalizing one step further into the hyper-complex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batch-normalization. These pieces are tested by end-to-end training on the CIFAR-10 and CIFAR-100 data sets to show the improved convergence to a real-valued network. Deep Reasoning Deep Reasoning is the field of enabling machines to understand implicit relationships between different things. For example, consider the following: ‘all animals drink water. Cats are animals’. Here, the implicit relationship is that all cats drink water, but that was never explicitly stated. Turns out humans are really good at this kind of relational reasoning understanding how different things relate to one another, but it doesn’t come so easily to computers which operate on strict, explicit rules. Deep Regionlets In this paper, we propose a novel object detection algorithm named ‘Deep Regionlets’ by integrating deep neural networks and conventional detection schema for accurate generic object detection. Motivated by the advantages of regionlets on modeling object deformation and multiple aspect ratios, we incorporate regionlets into an end-to-end trainable deep learning framework. The deep regionlets framework consists of a region selection network and a deep regionlet learning module. Specifically, given a detection bounding box proposal, the region selection network provides guidance on where to select regions from which features can be learned from. The regionlet learning module focuses on local feature selection and transformation to alleviate the effects of appearance variations. To this end, we first realize non-rectangular region selection within the detection framework to accommodate variations in object appearance. Moreover, we design a ‘gating network’ within the regionlet leaning module to enable soft regionlet selection and pooling. The Deep Regionlets framework is trained end-to-end without additional efforts. We present the results of ablation studies and extensive experiments on PASCAL VOC and Microsoft COCO datasets. The proposed algorithm outperforms state-of-the-art algorithms, such as RetinaNet and Mask R-CNN, even without additional segmentation labels. Deep Reinforcement Learning based Recommendation(DRR) Recommendation is crucial in both academia and industry, and various techniques are proposed such as content-based collaborative filtering, matrix factorization, logistic regression, factorization machines, neural networks and multi-armed bandits. However, most of the previous studies suffer from two limitations: (1) considering the recommendation as a static procedure and ignoring the dynamic interactive nature between users and the recommender systems, (2) focusing on the immediate feedback of recommended items and neglecting the long-term rewards. To address the two limitations, in this paper we propose a novel recommendation framework based on deep reinforcement learning, called DRR. The DRR framework treats recommendation as a sequential decision making procedure and adopts an ‘Actor-Critic’ reinforcement learning scheme to model the interactions between the users and recommender systems, which can consider both the dynamic adaptation and long-term rewards. Furthermore, a state representation module is incorporated into DRR, which can explicitly capture the interactions between items and users. Three instantiation structures are developed. Extensive experiments on four real-world datasets are conducted under both the offline and online evaluation settings. The experimental results demonstrate the proposed DRR method indeed outperforms the state-of-the-art competitors. Deep Reinforcement One-shot Learning(DeROL) In recent years there has been a sharp rise in networking applications, in which significant events need to be classified but only a few training instances are available. These are known as cases of one-shot learning. Examples include analyzing network traffic under zero-day attacks, and computer vision tasks by sensor networks deployed in the field. To handle this challenging task, organizations often use human analysts to classify events under high uncertainty. Existing algorithms use a threshold-based mechanism to decide whether to classify an object automatically or send it to an analyst for deeper inspection. However, this approach leads to a significant waste of resources since it does not take the practical temporal constraints of system resources into account. Our contribution is threefold. First, we develop a novel Deep Reinforcement One-shot Learning (DeROL) framework to address this challenge. The basic idea of the DeROL algorithm is to train a deep-Q network to obtain a policy which is oblivious to the unseen classes in the testing data. Then, in real-time, DeROL maps the current state of the one-shot learning process to operational actions based on the trained deep-Q network, to maximize the objective function. Second, we develop the first open-source software for practical artificially intelligent one-shot classification systems with limited resources for the benefit of researchers in related fields. Third, we present an extensive experimental study using the OMNIGLOT dataset for computer vision tasks and the UNSW-NB15 dataset for intrusion detection tasks that demonstrates the versatility and efficiency of the DeROL framework. Deep Rendering Mixture Model(DRMM) A Probabilistic Framework for Deep Learning Deep Rendering Model(DRM) In this paper, we develop a new theoretical framework that provides insights into both the successes and shortcomings of deep learning systems, as well as a principled route to their design and improvement. Our framework is based on a generative probabilistic model that explicitly captures variation due to latent nuisance variables. The Rendering Model (RM) explicitly models nuisance variation through a rendering function that combines the task-specific variables of interest (e.g., object class in an object recognition task) and the collection of nuisance variables. The Deep Rendering Model (DRM) extends the RM in a hierarchical fashion by rendering via a product of affine nuisance transformations across multiple levels of abstraction. The graphical structures of the RM and DRM enable inference via message passing, using, for example, the sum-product or max-sum algorithms, and training via the expectation-maximization (EM) algorithm. A key element of the framework is the relaxation of the RM/DRM generative model to a discriminative one in order to optimize the bias-variance tradeoff. Deep Residual Hashing In this paper, we define an extension of the supersymmetric hyperbolic nonlinear sigma model introduced by Zirnbauer. We show that it arises as a weak joint limit of a time-changed version introduced by Sabot and Tarr\es of the vertex-reinforced jump process. It describes the asymptotics of rescaled crossing numbers, rescaled fluctuations of local times, asymptotic local times on a logarithmic scale, endpoints of paths, and last exit trees. Deep Residual Network(ResNet) Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers-8 deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. Deep Residual Reinforcement Learning We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD($k$) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost. Deep Rewiring(DEEP R) Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently on sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior. Deep Ritz Method We propose a deep learning based method, the Deep Ritz Method, for numerically solving variational problems, particularly the ones that arise from partial differential equations. The Deep Ritz method is naturally nonlinear, naturally adaptive and has the potential to work in rather high dimensions. The framework is quite simple and fits well with the stochastic gradient descent method used in deep learning. We illustrate the method on several problems including some eigenvalue problems. Deep Roots We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of state-of-the-art deep CNNs without compromising accuracy. We validate our approach by using it to train more efficient variants of state-of-the-art CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less compute, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 25% fewer floating point operations and 44% fewer parameters, while maintaining state-of-the-art accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU). Deep Rotation Equivariant Network(DREN) Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR-10 datasets and demonstrate that it can improve the performance of state-of-the-art architectures. Our codes are released on GitHub. Deep Saliency Hashing(DSaH) In recent years, hashing methods have been proved efficient for large-scale Web media search. However, existing general hashing methods have limited discriminative power for describing fine-grained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce attention mechanism to the learning of hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semantic-preserving hashing codes simultaneously. DSaH is a two-step end-to-end model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. The saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both fine-grained and general retrieval datasets for performance evaluation. Experimental results on Oxford Flowers-17 and Stanford Dogs-120 demonstrate that our DSaH performs the best for fine-grained retrieval task and beats the existing best retrieval performance (DPSH) by approximately 12%. DSaH also outperforms several state-of-the-art hashing methods on general datasets, including CIFAR-10 and NUS-WIDE. Deep Self-Organization Human professionals are often required to make decisions based on complex multivariate time series measurements in an online setting, e.g. in health care. Since human cognition is not optimized to work well in high-dimensional spaces, these decisions benefit from interpretable low-dimensional representations. However, many representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose to couple a variational autoencoder to a discrete latent space and introduce a topological structure through the use of self-organizing maps. This allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the latent space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application. In the latter experiment, our representation uncovers meaningful structure in the acute physiological state of a patient. Deep Semantic Multimodal Hashing Network(DSMHN) Hashing has been widely applied to multimodal retrieval on large-scale multimedia data due to its efficiency in computation and storage. Particularly, deep hashing has received unprecedented research attention in recent years, owing to its perfect retrieval performance. However, most of existing deep hashing methods learn binary hash codes by preserving the similarity relationship while without exploiting the semantic labels, which result in suboptimal binary codes. In this work, we propose a novel Deep Semantic Multimodal Hashing Network (DSMHN) for scalable multimodal retrieval. In DSMHN, two sets of modality-specific hash functions are jointly learned by explicitly preserving both the inter-modality similarities and the intra-modality semantic labels. Specifically, with the assumption that the learned hash codes should be optimal for task-specific classification, two stream networks are jointly trained to learn the hash functions by embedding the semantic labels on the resultant hash codes. Different from previous deep hashing methods, which are tied to some particular forms of loss functions, our deep hashing framework can be flexibly integrated with different types of loss functions. In addition, the bit balance property is investigated to generate binary codes with each bit having $50\%$ probability to be $1$ or $-1$. Moreover, a unified deep multimodal hashing framework is proposed to learn compact and high-quality hash codes by exploiting the feature representation learning, inter-modality similarity preserving learning, semantic label preserving learning and hash functions learning with bit balanced constraint simultaneously. We conduct extensive experiments for both unimodal and cross-modal retrieval tasks on three widely-used multimodal retrieval datasets. The experimental result demonstrates that DSMHN significantly outperforms state-of-the-art methods. Deep Sequential Model for Knowledge Graph Completion(DSKG) Knowledge graph (KG) completion aims to fill the missing facts in a KG, where a fact is represented as a triple in the form of $(subject, relation, object)$. Current KG completion models compel two-thirds of a triple provided (e.g., $subject$ and $relation$) to predict the remaining one. In this paper, we propose a new model, which uses a KG-specific multi-layer recurrent neutral network (RNN) to model triples in a KG as sequences. It outperformed several state-of-the-art KG completion models on the conventional entity prediction task for many evaluation metrics, based on two benchmark datasets and a more difficult dataset. Furthermore, our model is enabled by the sequential characteristic and thus capable of predicting the whole triples only given one entity. Our experiments demonstrated that our model achieved promising performance on this new triple prediction task. Deep Smoke Segmentation Inspired by the recent success of fully convolutional networks (FCN) in semantic segmentation, we propose a deep smoke segmentation network to infer high quality segmentation masks from blurry smoke images. To overcome large variations in texture, color and shape of smoke appearance, we divide the proposed network into a coarse path and a fine path. The first path is an encoder-decoder FCN with skip structures, which extracts global context information of smoke and accordingly generates a coarse segmentation mask. To retain fine spatial details of smoke, the second path is also designed as an encoder-decoder FCN with skip structures, but it is shallower than the first path network. Finally, we propose a very small network containing only add, convolution and activation layers to fuse the results of the two paths. Thus, we can easily train the proposed network end to end for simultaneous optimization of network parameters. To avoid the difficulty in manually labelling fuzzy smoke objects, we propose a method to generate synthetic smoke images. According to results of our deep segmentation method, we can easily and accurately perform smoke detection from videos. Experiments on three synthetic smoke datasets and a realistic smoke dataset show that our method achieves much better performance than state-of-the-art segmentation algorithms based on FCNs. Test results of our method on videos are also appealing. Deep Smoke Segmentation Network Inspired by the recent success of fully convolutional networks (FCN) in semantic segmentation, we propose a deep smoke segmentation network to infer high quality segmentation masks from blurry smoke images. To overcome large variations in texture, color and shape of smoke appearance, we divide the proposed network into a coarse path and a fine path. The first path is an encoder-decoder FCN with skip structures, which extracts global context information of smoke and accordingly generates a coarse segmentation mask. To retain fine spatial details of smoke, the second path is also designed as an encoder-decoder FCN with skip structures, but it is shallower than the first path network. Finally, we propose a very small network containing only add, convolution and activation layers to fuse the results of the two paths. Thus, we can easily train the proposed network end to end for simultaneous optimization of network parameters. To avoid the difficulty in manually labelling fuzzy smoke objects, we propose a method to generate synthetic smoke images. According to results of our deep segmentation method, we can easily and accurately perform smoke detection from videos. Experiments on three synthetic smoke datasets and a realistic smoke dataset show that our method achieves much better performance than state-of-the-art segmentation algorithms based on FCNs. Test results of our method on videos are also appealing. Deep SNR-Based Metric Learning(DSML) Deep metric learning, which learns discriminative features to process image clustering and retrieval tasks, has attracted extensive attention in recent years. A number of deep metric learning methods, which ensure that similar examples are mapped close to each other and dissimilar examples are mapped farther apart, have been proposed to construct effective structures for loss functions and have shown promising results. In this paper, different from the approaches on learning the loss structures, we propose a robust SNR distance metric based on Signal-to-Noise Ratio (SNR) for measuring the similarity of image pairs for deep metric learning. By exploring the properties of our SNR distance metric from the view of geometry space and statistical theory, we analyze the properties of our metric and show that it can preserve the semantic similarity between image pairs, which well justify its suitability for deep metric learning. Compared with Euclidean distance metric, our SNR distance metric can further jointly reduce the intra-class distances and enlarge the inter-class distances for learned features. Leveraging our SNR distance metric, we propose Deep SNR-based Metric Learning (DSML) to generate discriminative feature embeddings. By extensive experiments on three widely adopted benchmarks, including CARS196, CUB200-2011 and CIFAR10, our DSML has shown its superiority over other state-of-the-art methods. Additionally, we extend our SNR distance metric to deep hashing learning, and conduct experiments on two benchmarks, including CIFAR10 and NUS-WIDE, to demonstrate the effectiveness and generality of our SNR distance metric. Deep Sparse Subspace Clustering In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four real-world datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering. Deep Stacked Stochastic Configuration Network(DSSCN) The concept of stochastic configuration networks (SCNs) others a solid framework for fast implementation of feedforward neural networks through randomized learning. Unlike conventional randomized approaches, SCNs provide an avenue to select appropriate scope of random parameters to ensure the universal approximation property. In this paper, a deep version of stochastic configuration networks, namely deep stacked stochastic configuration network (DSSCN), is proposed for modeling non-stationary data streams. As an extension of evolving stochastic connfiguration networks (eSCNs), this work contributes a way to grow and shrink the structure of deep stochastic configuration networks autonomously from data streams. The performance of DSSCN is evaluated by six benchmark datasets. Simulation results, compared with prominent data stream algorithms, show that the proposed method is capable of achieving comparable accuracy and evolving compact and parsimonious deep stacked network architecture. Deep Structured Generative Model Deep generative models have shown promising results in generating realistic images, but it is still non-trivial to generate images with complicated structures. The main reason is that most of the current generative models fail to explore the structures in the images including spatial layout and semantic relations between objects. To address this issue, we propose a novel deep structured generative model which boosts generative adversarial networks (GANs) with the aid of structure information. In particular, the layout or structure of the scene is encoded by a stochastic and-or graph (sAOG), in which the terminal nodes represent single objects and edges represent relations between objects. With the sAOG appropriately harnessed, our model can successfully capture the intrinsic structure in the scenes and generate images of complicated scenes accordingly. Furthermore, a detection network is introduced to infer scene structures from a image. Experimental results demonstrate the effectiveness of our proposed method on both modeling the intrinsic structures, and generating realistic images. Deep Subspace Clustering Network We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep auto-encoders, which non-linearly map the input data into a latent space. Our key idea is to introduce a novel self-expressive layer between the encoder and the decoder to mimic the ‘self-expressiveness’ property that has proven effective in traditional subspace clustering. Being differentiable, our new self-expressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard back-propagation procedure. Being nonlinear, our neural-network based method is able to cluster data points having complex (often nonlinear) structures. We further propose pre-training and fine-tuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the state-of-the-art unsupervised subspace clustering methods. Deep Successor Feature Network(DSFN) We introduce a novel apprenticeship learning algorithm to learn an expert’s underlying reward structure in off-policy model-free \emph{batch} settings. Unlike existing methods that require a dynamics model or additional data acquisition for on-policy evaluation, our algorithm requires only the batch data of observed expert behavior. Such settings are common in real-world tasks—health care, finance or industrial processes —where accurate simulators do not exist or data acquisition is costly. To address challenges in batch settings, we introduce Deep Successor Feature Networks(DSFN) that estimate feature expectations in an off-policy setting and a transition-regularized imitation network that produces a near-expert initial policy and an efficient feature representation. Our algorithm achieves superior results in batch settings on both control benchmarks and a vital clinical task of sepsis management in the Intensive Care Unit. Deep Successor Reinforcement Learning(DSR) Learning robust value functions given raw observations and rewards is now possible with model-free and model-based deep reinforcement learning algorithms. There is a third alternative, called Successor Representations (SR), which decomposes the value function into two components — a reward predictor and a successor map. The successor map represents the expected future state occupancy from any given state and the reward predictor maps states to scalar rewards. The value function of a state can be computed as the inner product between the successor map and the reward weights. In this paper, we present DSR, which generalizes SR within an end-to-end deep reinforcement learning framework. DSR has several appealing properties including: increased sensitivity to distal reward changes due to factorization of reward and world dynamics, and the ability to extract bottleneck states (subgoals) given successor maps trained under a random policy. We show the efficacy of our approach on two diverse environments given raw pixel observations — simple grid-world domains (MazeBase) and the Doom game engine. Deep Super Learning Deep learning has become very popular for tasks such as predictive modeling and pattern recognition in handling big data. Deep learning is a powerful machine learning method that extracts lower level features and feeds them forward for the next layer to identify higher level features that improve performance. However, deep neural networks have drawbacks, which include many hyper-parameters and infinite architectures, opaqueness into results, and relatively slower convergence on smaller datasets. While traditional machine learning algorithms can address these drawbacks, they are not typically capable of the performance levels achieved by deep neural networks. To improve performance, ensemble methods are used to combine multiple base learners. Super learning is an ensemble that finds the optimal combination of diverse learning algorithms. This paper proposes deep super learning as an approach which achieves log loss and accuracy results competitive to deep neural networks while employing traditional machine learning algorithms in a hierarchical structure. The deep super learner is flexible, adaptable, and easy to train with good performance across different tasks using identical hyper-parameter values. Using traditional machine learning requires fewer hyper-parameters, allows transparency into results, and has relatively fast convergence on smaller datasets. Experimental results show that the deep super learner has superior performance compared to the individual base learners, single-layer ensembles, and in some cases deep neural networks. Performance of the deep super learner may further be improved with task-specific tuning. Deep Survival Previous research has shown that neural networks can model survival data in situations in which some patients’ death times are unknown, e.g. right-censored. However, neural networks have rarely been shown to outperform their linear counterparts such as the Cox proportional hazards model. In this paper, we run simulated experiments and use real survival data to build upon the risk-regression architecture proposed by Faraggi and Simon. We demonstrate that our model, DeepSurv, not only works as well as the standard linear Cox proportional hazards model but actually outperforms it in predictive ability on survival data with linear and nonlinear risk functions. We then show that the neural network can also serve as a recommender system by including a categorical variable representing a treatment group. This can be used to provide personalized treatment recommendations based on an individual’s calculated risk. We provide an open source Python module that implements these methods in order to advance research on deep learning and survival analysis. Deep Switch Network ➘ “Multilayer Switch Network” Deep Symbolic Network(DSN) We introduce the Deep Symbolic Network (DSN) model, which aims at becoming the white-box version of Deep Neural Networks (DNN). The DSN model provides a simple, universal yet powerful structure, similar to DNN, to represent any knowledge of the world, which is transparent to humans. The conjecture behind the DSN model is that any type of real world objects sharing enough common features are mapped into human brains as a symbol. Those symbols are connected by links, representing the composition, correlation, causality, or other relationships between them, forming a deep, hierarchical symbolic network structure. Powered by such a structure, the DSN model is expected to learn like humans, because of its unique characteristics. First, it is universal, using the same structure to store any knowledge. Second, it can learn symbols from the world and construct the deep symbolic networks automatically, by utilizing the fact that real world objects have been naturally separated by singularities. Third, it is symbolic, with the capacity of performing causal deduction and generalization. Fourth, the symbols and the links between them are transparent to us, and thus we will know what it has learned or not – which is the key for the security of an AI system. Fifth, its transparency enables it to learn with relatively small data. Sixth, its knowledge can be accumulated. Last but not least, it is more friendly to unsupervised learning than DNN. We present the details of the model, the algorithm powering its automatic learning ability, and describe its usefulness in different use cases. The purpose of this paper is to generate broad interest to develop it within an open source project centered on the Deep Symbolic Network (DSN) model towards the development of general AI. Deep Temporal Clustering(DTC) Unsupervised learning of time series data, also known as temporal clustering, is a challenging problem in machine learning. Here we propose a novel algorithm, Deep Temporal Clustering (DTC), to naturally integrate dimensionality reduction and temporal clustering into a single end-to-end learning framework, fully unsupervised. The algorithm utilizes an autoencoder for temporal dimensionality reduction and a novel temporal clustering layer for cluster assignment. Then it jointly optimizes the clustering objective and the dimensionality reduction objec tive. Based on requirement and application, the temporal clustering layer can be customized with any temporal similarity metric. Several similarity metrics and state-of-the-art algorithms are considered and compared. To gain insight into temporal features that the network has learned for its clustering, we apply a visualization method that generates a region of interest heatmap for the time series. The viability of the algorithm is demonstrated using time series data from diverse domains, ranging from earthquakes to spacecraft sensor data. In each case, we show that the proposed algorithm outperforms traditional methods. The superior performance is attributed to the fully integrated temporal dimensionality reduction and clustering criterion. Deep Temporal Network(DTNet) We introduce in this paper the principle of Deep Temporal Networks that allow to add time to convolutional networks by allowing deep integration principles not only using spatial information but also increasingly large temporal window. The concept can be used for conventional image inputs but also event based data. Although inspired by the architecture of brain that inegrates information over increasingly larger spatial but also temporal scales it can operate on conventional hardware using existing architectures. We introduce preliminary results to show the efficiency of the method. More in-depth results and analysis will be reported soon! Deep Tensor Adversarial Generative net(TGAN) Deep generative models have been successfully applied to many applications. However, existing works experience limitations when generating large images (the literature usually generates small images, e.g. 32 * 32 or 128 * 128). In this paper, we propose a novel scheme, called deep tensor adversarial generative nets (TGAN), that generates large high-quality images by exploring tensor structures. Essentially, the adversarial process of TGAN takes place in a tensor space. First, we impose tensor structures for concise image representation, which is superior in capturing the pixel proximity information and the spatial patterns of elementary objects in images, over the vectorization preprocess in existing works. Secondly, we propose TGAN that integrates deep convolutional generative adversarial networks and tensor super-resolution in a cascading manner, to generate high-quality images from random distributions. More specifically, we design a tensor super-resolution process that consists of tensor dictionary learning and tensor coefficients learning. Finally, on three datasets, the proposed TGAN generates images with more realistic textures, compared with state-of-the-art adversarial autoencoders. The size of the generated images is increased by over 8.5 times, namely 374 * 374 in PASCAL2. Deep Tensor Decomposition(DeepTD) In this paper we study the problem of learning the weights of a deep convolutional neural network. We consider a network where convolutions are carried out over non-overlapping patches with a single kernel in each layer. We develop an algorithm for simultaneously learning all the kernels from the training data. Our approach dubbed Deep Tensor Decomposition (DeepTD) is based on a rank-1 tensor decomposition. We theoretically investigate DeepTD under a realizable model for the training data where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted convolutional kernels. We show that DeepTD is data-efficient and provably works as soon as the sample size exceeds the total number of convolutional weights in the network. We carry out a variety of numerical experiments to investigate the effectiveness of DeepTD and verify our theoretical findings. Deep Texture Encoding Network(Deep TEN) We propose a Deep Texture Encoding Network (Deep-TEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate off-the-shelf features such as SIFT descriptors or pre-trained CNN features for material recognition. Our new approach provides an end-to-end learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to state-of-the-art methods using gold-standard databases such as MINC-2500, Flickr Material Database, KTH-TIPS-2b, and two recent databases 4D-Light-Field-Material and GTOS. The source code for the complete system are publicly available. Deep Transfer Learning As a new classification platform, deep learning has recently received increasing attention from researchers and has been successfully applied to many domains. In some domains, like bioinformatics and robotics, it is very difficult to construct a large-scale well-annotated dataset due to the expense of data acquisition and costly annotation, which limits its development. Transfer learning relaxes the hypothesis that the training data must be independent and identically distributed (i.i.d.) with the test data, which motivates us to use transfer learning to solve the problem of insufficient training data. This survey focuses on reviewing the current researches of transfer learning by using deep neural network and its applications. We defined deep transfer learning, category and review the recent research works based on the techniques used in deep transfer learning. Deep Transfer Network(DTN) In recent years, an increasing popularity of deep learning model for intelligent condition monitoring and diagnosis as well as prognostics used for mechanical systems and structures has been observed. In the previous studies, however, a major assumption accepted by default, is that the training and testing data are taking from same feature distribution. Unfortunately, this assumption is mostly invalid in real application, resulting in a certain lack of applicability for the traditional diagnosis approaches. Inspired by the idea of transfer learning that leverages the knowledge learnt from rich labeled data in source domain to facilitate diagnosing a new but similar target task, a new intelligent fault diagnosis framework, i.e., deep transfer network (DTN), which generalizes deep learning model to domain adaptation scenario, is proposed in this paper. By extending the marginal distribution adaptation (MDA) to joint distribution adaptation (JDA), the proposed framework can exploit the discrimination structures associated with the labeled data in source domain to adapt the conditional distribution of unlabeled target data, and thus guarantee a more accurate distribution matching. Extensive empirical evaluations on three fault datasets validate the applicability and practicability of DTN, while achieving many state-of-the-art transfer results in terms of diverse operating conditions, fault severities and fault types. Deep Transition Architecture for Neural Machine Translation(DTMT) Past years have witnessed rapid developments in Neural Machine Translation (NMT). Most recently, with advanced modeling and training techniques, the RNN-based NMT (RNMT) has shown its potential strength, even compared with the well-known Transformer (self-attentional) model. Although the RNMT model can possess very deep architectures through stacking layers, the transition depth between consecutive hidden states along the sequential axis is still shallow. In this paper, we further enhance the RNN-based NMT through increasing the transition depth between consecutive hidden states and build a novel Deep Transition RNN-based Architecture for Neural Machine Translation, named DTMT. This model enhances the hidden-to-hidden transition with multiple non-linear transformations, as well as maintains a linear transformation path throughout this deep transition by the well-designed linear transformation mechanism to alleviate the gradient vanishing problem. Experiments show that with the specially designed deep transition modules, our DTMT can achieve remarkable improvements on translation quality. Experimental results on Chinese->English translation task show that DTMT can outperform the Transformer model by +2.09 BLEU points and achieve the best results ever reported in the same dataset. On WMT14 English->German and English->French translation tasks, DTMT shows superior quality to the state-of-the-art NMT systems, including the Transformer and the RNMT+. Deep Variational Canonical Correlation Analysis(VCCA) We present deep variational canonical correlation analysis (VCCA), a deep multi-view learning model that extends the latent variable model interpretation of linear CCA~\citep{BachJordan05a} to nonlinear observation models parameterized by deep neural networks (DNNs). Marginal data likelihood as well as inference are intractable under this model. We derive a variational lower bound of the data likelihood by parameterizing the posterior density of the latent variables with another DNN, and approximate the lower bound via Monte Carlo sampling. Interestingly, the resulting model resembles that of multi-view autoencoders~\citep{Ngiam_11b}, with the key distinction of an additional sampling procedure at the bottleneck layer. We also propose a variant of VCCA called VCCA-private which can, in addition to the ‘common variables’ underlying both views, extract the ‘private variables’ within each view. We demonstrate that VCCA-private is able to disentangle the shared and private information for multi-view data without hard supervision. Deep Variational Koopman(DVK) Koopman theory asserts that a nonlinear dynamical system can be mapped to a linear system, where the Koopman operator advances observations of the state forward in time. However, the observable functions that map states to observations are generally unknown. We introduce the Deep Variational Koopman (DVK) model, a method for inferring distributions over observations that can be propagated linearly in time. By sampling from the inferred distributions, we obtain a distribution over dynamical models, which in turn provides a distribution over possible outcomes as a modeled system advances in time. Experiments show that the DVK model is effective at long-term prediction for a variety of dynamical systems. Furthermore, we describe how to incorporate the learned models into a control framework, and demonstrate that accounting for the uncertainty present in the distribution over dynamical models enables more effective control. Deep Variational Transfer(DVT) In real-world applications, it is often expensive and time-consuming to obtain labeled examples. In such cases, knowledge transfer from related domains, where labels are abundant, could greatly reduce the need for extensive labeling efforts. In this scenario, transfer learning comes in hand. In this paper, we propose Deep Variational Transfer (DVT), a variational autoencoder that transfers knowledge across domains using a shared latent Gaussian mixture model. Thanks to the combination of a semi-supervised ELBO and parameters sharing across domains, we are able to simultaneously: (i) align all supervised examples of the same class into the same latent Gaussian Mixture component, independently from their domain; (ii) predict the class of unsupervised examples from different domains and use them to better model the occurring shifts. We perform tests on MNIST and USPS digits datasets, showing DVT’s ability to perform transfer learning across heterogeneous datasets. Additionally, we present DVT’s top classification performances on the MNIST semi-supervised learning challenge. We further validate DVT on a astronomical datasets. DVT achieves states-of-the-art classification performances, transferring knowledge across real stars surveys datasets, EROS, MACHO and HiTS, . In the worst performance, we double the achieved F1-score for rare classes. These experiments show DVT’s ability to tackle all major challenges posed by transfer learning: different covariate distributions, different and highly imbalanced class distributions and different feature spaces. Deep Visual Explanation(DVE) The practical impact of deep learning on complex supervised learning problems has been significant, so much so that almost every Artificial Intelligence problem, or at least a portion thereof, has been somehow recast as a deep learning problem. The applications appeal is significant, but this appeal is increasingly challenged by what some call the challenge of explainability, or more generally the more traditional challenge of debuggability: if the outcomes of a deep learning process produce unexpected results (e.g., less than expected performance of a classifier), then there is little available in the way of theories or tools to help investigate the potential causes of such unexpected behavior, especially when this behavior could impact people’s lives. We describe a preliminary framework to help address this issue, which we call ‘deep visual explanation’ (DVE). ‘Deep,’ because it is the development and performance of deep neural network models that we want to understand. ‘Visual,’ because we believe that the most rapid insight into a complex multi-dimensional model is provided by appropriate visualization techniques, and ‘Explanation,’ because in the spectrum from instrumentation by inserting print statements to the abductive inference of explanatory hypotheses, we believe that the key to understanding deep learning relies on the identification and exposure of hypotheses about the performance behavior of a learned deep model. In the exposition of our preliminary framework, we use relatively straightforward image classification examples and a variety of choices on initial configuration of a deep model building scenario. By careful but not complicated instrumentation, we expose classification outcomes of deep models using visualization, and also show initial results for one potential application of interpretability. Deep Visual-Semantic Embedding Model(DeViSE) Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions by up to 65%, achieving hit rates of up to 10% across thousands of novel labels never seen by the visual model. Deep Web The Deep Web, Deep Net, Invisible Web, or Hidden Web, refers to the content on the World Wide Web that is not indexed by standard search engines. Computer scientist Mike Bergman is credited with coining the term in 2000. Deep Weibull Model(DW-RNN) One of the key challenges in predictive maintenance is to predict the impending downtime of an equipment with a reasonable prediction horizon so that countermeasures can be put in place. Classically, this problem has been posed in two different ways which are typically solved independently: (1) Remaining useful life (RUL) estimation as a long-term prediction task to estimate how much time is left in the useful life of the equipment and (2) Failure prediction (FP) as a short-term prediction task to assess the probability of a failure within a pre-specified time window. As these two tasks are related, performing them separately is sub-optimal and might results in inconsistent predictions for the same equipment. In order to alleviate these issues, we propose two methods: Deep Weibull model (DW-RNN) and multi-task learning (MTL-RNN). DW-RNN is able to learn the underlying failure dynamics by fitting Weibull distribution parameters using a deep neural network, learned with a survival likelihood, without training directly on each task. While DW-RNN makes an explicit assumption on the data distribution, MTL-RNN exploits the implicit relationship between the long-term RUL and short-term FP tasks to learn the underlying distribution. Additionally, both our methods can leverage the non-failed equipment data for RUL estimation. We demonstrate that our methods consistently outperform baseline RUL methods that can be used for FP while producing consistent results for RUL and FP. We also show that our methods perform at par with baselines trained on the objectives optimized for either of the two tasks. Deep500 We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads. DeepAM Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches. DeepArchitect In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are tree-structured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential model-based optimization (SMBO). We present experiments comparing the different algorithms on CIFAR-10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available. DeepAtlas Deep convolutional neural networks (CNNs) are state-of-the-art for semantic image segmentation, but typically require many labeled training samples. Obtaining 3D segmentations of medical images for supervised training is difficult and labor intensive. Motivated by classical approaches for joint segmentation and registration we therefore propose a deep learning framework that jointly learns networks for image registration and image segmentation. In contrast to previous work on deep unsupervised image registration, which showed the benefit of weak supervision via image segmentations, our approach can use existing segmentations when available and computes them via the segmentation network otherwise, thereby providing the same registration benefit. Conversely, segmentation network training benefits from the registration, which essentially provides a realistic form of data augmentation. Experiments on knee and brain 3D magnetic resonance (MR) images show that our approach achieves large simultaneous improvements of segmentation and registration accuracy (over independently trained networks) and allows training high-quality models with very limited training data. Specifically, in a one-shot-scenario (with only one manually labeled image) our approach increases Dice scores (%) over an unsupervised registration network by 2.7 and 1.8 on the knee and brain images respectively. DeepBalance Class imbalance problems manifest in domains such as financial fraud detection or network intrusion analysis, where the prevalence of one class is much higher than another. Typically, practitioners are more interested in predicting the minority class than the majority class as the minority class may carry a higher misclassification cost. However, classifier performance deteriorates in the face of class imbalance as oftentimes classifiers may predict every point as the majority class. Methods for dealing with class imbalance include cost-sensitive learning or resampling techniques. In this paper, we introduce DeepBalance, an ensemble of deep belief networks trained with balanced bootstraps and random feature selection. We demonstrate that our proposed method outperforms baseline resampling methods such as SMOTE and under- and over-sampling in metrics such as AUC and sensitivity when applied to a highly imbalanced financial transaction data. Additionally, we explore performance and training time implications of various model parameters. Furthermore, we show that our model is easily parallelizable, which can reduce training times. Finally, we present an implementation of DeepBalance in R. DeepBase Although deep learning models perform remarkably across a range of tasks such as language translation, parsing, and object recognition, it remains unclear whether, and to what extent, these models follow human-understandable logic or procedures when making predictions. Understanding this can lead to more interpretable models, better model design, and faster experimentation. Recent machine learning research has leveraged statistical methods to identify hidden units that behave (e.g., activate) similarly to human understandable logic such as detecting language features, however each analysis requires considerable manual effort. Our insight is that, from a query processing perspective, this high level logic is a query evaluated over a database of neural network hidden unit behaviors. This paper describes DeepBase, a system to inspect neural network behaviors through a query-based interface. We model high-level logic as hypothesis functions that transform an input dataset into time series signals. DeepBase lets users quickly identify individual or groups of units that have strong statistical dependencies with desired hypotheses. In fact, we show how many existing analyses are expressible as a single DeepBase query. We use DeepBase to analyze recurrent neural network models, and propose a set of simple and effective optimizations to speed up existing analysis approaches by up to 413x. We also group and analyze different portions of a real-world neural translation model and show that learns syntactic structure, which is consistent with prior NLP studies, but can be performed with only 3 DeepBase queries. DEEPBEAM Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would otherwise be too complicated. On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels. Also, deep learning approaches introduce a lot of errors, particularly in the presence of unseen noise types and settings. We have therefore proposed an enhancement framework called DEEPBEAM, which combines the two complementary classes of algorithms. DEEPBEAM introduces a beamforming filter to produce natural sounding speech, but the filter coefficients are determined with the help of a monaural speech enhancement neural network. Experiments on synthetic and real-world data show that DEEPBEAM is able to produce clean, dry and natural sounding speech, and is robust against unseen noise. DeepBoost We present a new ensemble learning algorithm, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorithm is a capacity-conscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensembles expressed in terms of the Rademacher complexities of the sub-families composing the base classifier set, and the mixture weight assigned to each sub-family. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance compares favorably to that of AdaBoost and Logistic Regression and their L1-regularized variants. DeepBoost DeepCABAC We present DeepCABAC, a novel context-adaptive binary arithmetic coder for compressing deep neural networks. It quantizes each weight parameter by minimizing a weighted rate-distortion function, which implicitly takes the impact of quantization on to the accuracy of the network into account. Subsequently, it compresses the quantized values into a bitstream representation with minimal redundancies. We show that DeepCABAC is able to reach very high compression ratios across a wide set of different network architectures and datasets. For instance, we are able to compress by x63.6 the VGG16 ImageNet model with no loss of accuracy, thus being able to represent the entire network with merely 8.7MB. DeepCaps Capsule Network is a promising concept in deep learning, yet its true potential is not fully realized thus far, providing sub-par performance on several key benchmark datasets with complex data. Drawing intuition from the success achieved by Convolutional Neural Networks (CNNs) by going deeper, we introduce DeepCaps1, a deep capsule network architecture which uses a novel 3D convolution based dynamic routing algorithm. With DeepCaps, we surpass the state-of-the-art results in the capsule network domain on CIFAR10, SVHN and Fashion MNIST, while achieving a 68% reduction in the number of parameters. Further, we propose a class-independent decoder network, which strengthens the use of reconstruction loss as a regularization term. This leads to an interesting property of the decoder, which allows us to identify and control the physical attributes of the images represented by the instantiation parameters. DeepCheck Code reuse attack (CRA) is a powerful attack that reuses existing codes to hijack the program control flow. Control flow integrity (CFI) is one of the most popular mechanisms to prevent against CRAs. However, current CFI techniques are difficult to be deployed in real applications due to suffering several issues such as modifying binaries or compiler, extending instruction set architectures (ISA) and incurring unacceptable runtime overhead. To address these issues, we propose the first deep learning-based CFI technique, named DeepCheck, where the control flow graph (CFG) is split into chains for deep neural network (DNN) training. Then the integrity features of CFG can be learned by DNN to detect abnormal control flows. DeepCheck does not interrupt the application and hence incurs zero runtime overhead. Experimental results on Adobe Flash Player, Nginx, Proftpd and Firefox show that the average detection accuracy of DeepCheck is as high as 98.9%. In addition, 64 ROP exploits created by ROPGadget and Ropper are used to further test the effectiveness, which shows that the detection success rate reaches 100%. DeepCloud Generative systems have a significant potential to synthesize innovative design alternatives. Still, most of the common systems that have been adopted in design require the designer to explicitly define the specifications of the procedures and in some cases the design space. In contrast, a generative system could potentially learn both aspects through processing a database of existing solutions without the supervision of the designer. To explore this possibility, we review recent advancements of generative models in machine learning and current applications of learning techniques in design. Then, we describe the development of a data-driven generative system titled DeepCloud. It combines an autoencoder architecture for point clouds with a web-based interface and analog input devices to provide an intuitive experience for data-driven generation of design alternatives. We delineate the implementation of two prototypes of DeepCloud, their contributions, and potentials for generative design. DeepDetect DeepDetect is a deep learning API and server written in C++11. It makes state of the art deep learning easy to work with and integrate into existing applications. DeepDive DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domain-specific knowledge and incorporates user feedback to improve the quality of its analysis. DeepDSL In recent years, Deep Learning (DL) has found great success in domains such as multimedia understanding. However, the complex nature of multimedia data makes it difficult to develop DL-based software. The state-of-the art tools, such as Caffe, TensorFlow, Torch7, and CNTK, while are successful in their applicable domains, are programming libraries with fixed user interface, internal representation, and execution environment. This makes it difficult to implement portable and customized DL applications. In this paper, we present DeepDSL, a domain specific language (DSL) embedded in Scala, that compiles deep networks written in DeepDSL to Java source code. Deep DSL provides (1) intuitive constructs to support compact encoding of deep networks; (2) symbolic gradient derivation of the networks; (3) static analysis for memory consumption and error detection; and (4) DSL-level optimization to improve memory and runtime efficiency. DeepDSL programs are compiled into compact, efficient, customizable, and portable Java source code, which operates the CUDA and CUDNN interfaces running on Nvidia GPU via a Java Native Interface (JNI) library. We evaluated DeepDSL with a number of popular DL networks. Our experiments show that the compiled programs have very competitive runtime performance and memory efficiency compared to the existing libraries. DeepER Entity Resolution (ER) is a fundamental problem with many applications. Machine learning (ML)-based and rule-based approaches have been widely studied for decades, with many efforts being geared towards which features/attributes to select, which similarity functions to employ, and which blocking function to use – complicating the deployment of an ER system as a turn-key system. In this paper, we present DeepER, a turn-key ER system powered by deep learning (DL) techniques. The central idea is that distributed representations and representation learning from DL can alleviate the above human efforts for tuning existing ER systems. DeepER makes several notable contributions: encoding a tuple as a distributed representation of attribute values, building classifiers using these representations and a semantic aware blocking based on LSH, and learning and tuning the distributed representations for ER. We evaluate our algorithms on multiple benchmark datasets and achieve competitive results while requiring minimal interaction with experts. DeepFault Deep Neural Networks (DNNs) are increasingly deployed in safety-critical applications including autonomous vehicles and medical diagnostics. To reduce the residual risk for unexpected DNN behaviour and provide evidence for their trustworthy operation, DNNs should be thoroughly tested. The DeepFault whitebox DNN testing approach presented in our paper addresses this challenge by employing suspiciousness measures inspired by fault localization to establish the hit spectrum of neurons and identify suspicious neurons whose weights have not been calibrated correctly and thus are considered responsible for inadequate DNN performance. DeepFault also uses a suspiciousness-guided algorithm to synthesize new inputs, from correctly classified inputs, that increase the activation values of suspicious neurons. Our empirical evaluation on several DNN instances trained on MNIST and CIFAR-10 datasets shows that DeepFault is effective in identifying suspicious neurons. Also, the inputs synthesized by DeepFault closely resemble the original inputs, exercise the identified suspicious neurons and are highly adversarial. DeepFeat A deep feature based saliency model (DeepFeat) is developed to leverage the understanding of the prediction of human fixations. Traditional saliency models often predict the human visual attention relying on few level image cues. Although such models predict fixations on a variety of image complexities, their approaches are limited to the incorporated features. In this study, we aim to provide an intuitive interpretation of convolu- tional neural network deep features by combining low and high level visual factors. We exploit four evaluation metrics to evaluate the correspondence between the proposed framework and the ground-truth fixations. The key findings of the results demon- strate that the DeepFeat algorithm, incorporation of bottom up and top down saliency maps, outperforms the individual bottom up and top down approach. Moreover, in comparison to nine 9 state-of-the-art saliency models, our proposed DeepFeat model achieves satisfactory performance based on all four evaluation metrics. DeepFirearm There are great demands for automatically regulating inappropriate appearance of shocking firearm images in social media or identifying firearm types in forensics. Image retrieval techniques have great potential to solve these problems. To facilitate research in this area, we introduce Firearm 14k, a large dataset consisting of over 14,000 images in 167 categories. It can be used for both fine-grained recognition and retrieval of firearm images. Recent advances in image retrieval are mainly driven by fine-tuning state-of-the-art convolutional neural networks for retrieval task. The conventional single margin contrastive loss, known for its simplicity and good performance, has been widely used. We find that it performs poorly on the Firearm 14k dataset due to: (1) Loss contributed by positive and negative image pairs is unbalanced during training process. (2) A huge domain gap exists between this dataset and ImageNet. We propose to deal with the unbalanced loss by employing a double margin contrastive loss. We tackle the domain gap issue with a two-stage training strategy, where we first fine-tune the network for classification, and then fine-tune it for retrieval. Experimental results show that our approach outperforms the conventional single margin approach by a large margin (up to 88.5% relative improvement) and even surpasses the strong triplet-loss-based approach. DeepFlow The calibration of a reservoir model with observed transient data of fluid pressures and rates is a key task in obtaining a predictive model of the flow and transport behaviour of the earth’s subsurface. The model calibration task, commonly referred to as ‘history matching’, can be formalised as an ill-posed inverse problem where we aim to find the underlying spatial distribution of petrophysical properties that explain the observed dynamic data. We use a generative adversarial network pretrained on geostatistical object-based models to represent the distribution of rock properties for a synthetic model of a hydrocarbon reservoir. The dynamic behaviour of the reservoir fluids is modelled using a transient two-phase incompressible Darcy formulation. We invert for the underlying reservoir properties by first modeling property distributions using the pre-trained generative model then using the adjoint equations of the forward problem to perform gradient descent on the latent variables that control the output of the generative model. In addition to the dynamic observation data, we include well rock-type constraints by introducing an additional objective function. Our contribution shows that for a synthetic test case, we are able to obtain solutions to the inverse problem by optimising in the latent variable space of a deep generative model, given a set of transient observations of a non-linear forward problem. DeepFool State-of-the-art deep neural networks have achieved impressive results on many image classification tasks. However, these same architectures have been shown to be unstable to small, well sought, perturbations of the images. Despite the importance of this phenomenon, no effective methods have been proposed to accurately compute the robustness of state-of-the-art deep classifiers to such perturbations on large-scale datasets. In this paper, we fill this gap and propose the DeepFool algorithm to efficiently compute perturbations that fool deep networks, and thus reliably quantify the robustness of these classifiers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial perturbations and making classifiers more robust DeepFool – A simple and accurate method to fool Deep Neural Networks. DeepFuse We present a novel deep learning architecture for fusing static multi-exposure images. Current multi-exposure fusion (MEF) approaches use hand-crafted features to fuse input sequence. However, the weak hand-crafted representations are not robust to varying input conditions. Moreover, they perform poorly for extreme exposure image pairs. Thus, it is highly desirable to have a method that is robust to varying input conditions and capable of handling extreme exposure without artifacts. Deep representations have known to be robust to input conditions and have shown phenomenal performance in a supervised setting. However, the stumbling block in using deep learning for MEF was the lack of sufficient training data and an oracle to provide the ground-truth for supervision. To address the above issues, we have gathered a large dataset of multi-exposure image stacks for training and to circumvent the need for ground truth images, we propose an unsupervised deep learning framework for MEF utilizing a no-reference quality metric as loss function. The proposed approach uses a novel CNN architecture trained to learn the fusion operation without reference ground truth image. The model fuses a set of common low level features extracted from each image to generate artifact-free perceptually pleasing results. We perform extensive quantitative and qualitative evaluation and show that the proposed technique outperforms existing state-of-the-art approaches for a variety of natural images. DeepGini Deep neural network (DNN) based systems have been deployed to assist various tasks, including many safety-critical scenarios such as autonomous driving and medical image diagnostics. In company with the DNN-based systems’ fantastic accuracy on the well-defined tasks, these systems could also exhibit incorrect behaviors and thus severe accidents and losses. Therefore, beyond the conventional accuracy-based evaluation, the testing method that can assist developers in detecting incorrect behaviors in the earlier stage is critical for quality assurance of these systems. However, given the fact that automated oracle is often not available, testing DNN-based system usually requires prohibitively expensive human efforts to label the testing data. In this paper, to reduce the efforts in labeling the testing data of DNN-based systems, we propose DeepGini, a test prioritization technique for assisting developers in identifying the tests that can reveal the incorrect behavior. DeepGini is designed based on a statistical perspective of DNN, which allows us to transform the problem of measuring the likelihood of misclassification to the problem of measuing the impurity of data set. To validate our technique, we conduct an extensive empirical study on four popular datasets. The experiment results show that DeepGini outperforms the neuron-coverage-based test prioritization in terms of both efficacy and efficiency. DeepGLO Forecasting high-dimensional time series plays a crucial role in many applications such as demand forecasting and financial predictions. Modern real-world datasets can have millions of correlated time-series that evolve together, i.e they are extremely high dimensional (one dimension for each individual time-series). Thus there is need for exploiting these global patterns and coupling them with local calibration for better prediction. However, most recent deep learning approaches in the literature are one-dimensional, i.e, even though they are trained on the whole dataset, during prediction, the future forecast for a single dimension mainly depends on past values from the same dimension. In this paper, we seek to correct this deficiency and propose DeepGLO, a deep forecasting model which thinks globally and acts locally. In particular, DeepGLO is a hybrid model that combines a global matrix factorization model regularized by a temporal deep network with a local deep temporal model that captures patterns specific to each dimension. The global and local models are combined via a data-driven attention mechanism for each dimension. The proposed deep architecture used is a variation of temporal convolution termed as leveled network which can be trained effectively on high-dimensional but diverse time series, where different time series can have vastly different scales, without a priori normalization or rescaling. Empirical results demonstrate that DeepGLO outperforms state-of-the-art approaches on various datasets; for example, we see more than 30% improvement in WAPE over other methods on a real-world dataset that contains more than 100K-dimensional time series. DeepGraph The topological (or graph) structures of real-world networks are known to be predictive of multiple dynamic properties of the networks. Conventionally, a graph structure is represented using an adjacency matrix or a set of hand-crafted structural features. These representations either fail to highlight local and global properties of the graph or suffer from a severe loss of structural information. There lacks an effective graph representation, which hinges the realization of the predictive power of network structures. In this study, we propose to learn the represention of a graph, or the topological structure of a network, through a deep learning model. This end-to-end prediction model, named DeepGraph, takes the input of the raw adjacency matrix of a real-world network and outputs a prediction of the growth of the network. The adjacency matrix is first represented using a graph descriptor based on the heat kernel signature, which is then passed through a multi-column, multi-resolution convolutional neural network. Extensive experiments on five large collections of real-world networks demonstrate that the proposed prediction model significantly improves the effectiveness of existing methods, including linear or nonlinear regressors that use hand-crafted features, graph kernels, and competing deep learning methods. DeepImageSpam Hackers and spammers are employing innovative and novel techniques to deceive novice and even knowledgeable internet users. Image spam is one of such technique where the spammer varies and changes some portion of the image such that it is indistinguishable from the original image fooling the users. This paper proposes a deep learning based approach for image spam detection using the convolutional neural networks which uses a dataset with 810 natural images and 928 spam images for classification achieving an accuracy of 91.7% outperforming the existing image processing and machine learning techniques DeepInf Social and information networking activities such as on Facebook, Twitter, WeChat, and Weibo have become an indispensable part of our everyday life, where we can easily access friends’ behaviors and are in turn influenced by them. Consequently, an effective social influence prediction for each user is critical for a variety of applications such as online recommendation and advertising. Conventional social influence prediction approaches typically design various hand-crafted rules to extract user- and network-specific features. However, their effectiveness heavily relies on the knowledge of domain experts. As a result, it is usually difficult to generalize them into different domains. Inspired by the recent success of deep neural networks in a wide range of computing applications, we design an end-to-end framework, DeepInf, to learn users’ latent feature representation for predicting social influence. In general, DeepInf takes a user’s local network as the input to a graph neural network for learning her latent social representation. We design strategies to incorporate both network structures and user-specific features into convolutional neural and attention networks. Extensive experiments on Open Academic Graph, Twitter, Weibo, and Digg, representing different types of social and information networks, demonstrate that the proposed end-to-end model, DeepInf, significantly outperforms traditional feature engineering-based approaches, suggesting the effectiveness of representation learning for social applications. DeepInspect Image classification is an important task in today’s world with many applications from socio-technical to safety-critical domains. The recent advent of Deep Neural Network (DNN) is the key behind such a wide-spread success. However, such wide adoption comes with the concerns about the reliability of these systems, as several erroneous behaviors have already been reported in many sensitive and critical circumstances. Thus, it has become crucial to rigorously test the image classifiers to ensure high reliability. Many reported erroneous cases in popular neural image classifiers appear because the models often confuse one class with another, or show biases towards some classes over others. These errors usually violate some group properties. Most existing DNN testing and verification techniques focus on per image violations and thus fail to detect such group-level confusions or biases. In this paper, we design, implement and evaluate DeepInspect, a white box testing tool, for automatically detecting confusion and bias of DNN-driven image classification applications. We evaluate DeepInspect using popular DNN-based image classifiers and detect hundreds of classification mistakes. Some of these cases are able to expose potential biases of the network towards certain populations. DeepInspect further reports many classification errors in state-of-the-art robust models. DeepJDOT In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against state-of-the-art deep domain adaptation methods. DeepLab DeepLab is a state-of-art deep learning model for semantic image segmentation, where the goal is to assign semantic labels (e.g., person, dog, cat and so on) to every pixel in the input image. DeepLearningKit In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the big screen, or OS X desktop applications. The goal is to support using deep learning models trained with popular frameworks such as Caffe, Torch, TensorFlow, Theano, Pylearn, Deeplearning4J and Mocha. Given the massive GPU resources and time required to train Deep Learning models we suggest an App Store like model to distribute and download pretrained and reusable Deep Learning models. DeepLens Advances in deep learning have greatly widened the scope of automatic computer vision algorithms and enable users to ask questions directly about the content in images and video. This paper explores the necessary steps towards a future Visual Data Management System (VDMS), where the predictions of such deep learning models are stored, managed, queried, and indexed. We propose a query and data model that disentangles the neural network models used, the query workload, and the data source semantics from the query processing layer. Our system, DeepLens, is based on dataflow query processing systems and this research prototype presents initial experiments to elicit important open research questions in visual analytics systems. One of our main conclusions is that any future ‘declarative’ VDMS will have to revisit query optimization and automated physical design from a unified perspective of performance and accuracy tradeoffs. Physical design and query optimization choices can not only change performance by orders of magnitude, they can potentially affect the accuracy of results. DeepLink Recently, link prediction has attracted more attentions from various disciplines such as computer science, bioinformatics and economics. In this problem, unknown links between nodes are discovered based on numerous information such as network topology, profile information and user generated contents. Most of the previous researchers have focused on the structural features of the networks. While the recent researches indicate that contextual information can change the network topology. Although, there are number of valuable researches which combine structural and content information, but they face with the scalability issue due to feature engineering. Because, majority of the extracted features are obtained by a supervised or semi supervised algorithm. Moreover, the existing features are not general enough to indicate good performance on different networks with heterogeneous structures. Besides, most of the previous researches are presented for undirected and unweighted networks. In this paper, a novel link prediction framework called ‘DeepLink’ is presented based on deep learning techniques. In contrast to the previous researches which fail to automatically extract best features for the link prediction, deep learning reduces the manual feature engineering. In this framework, both the structural and content information of the nodes are employed. The framework can use different structural feature vectors, which are prepared by various link prediction methods. It considers all proximity orders that are presented in a network during the structural feature learning. We have evaluated the performance of DeepLink on two real social network datasets including Telegram and irBlogs. On both datasets, the proposed framework outperforms several structural and hybrid approaches for link prediction problem. Deeply Supervised Object Detector(DSOD) We propose Deeply Supervised Object Detectors (DSOD), an object detection framework that can be trained from scratch. Recent advances in object detection heavily depend on the off-the-shelf models pre-trained on large-scale classification datasets like ImageNet and OpenImage. However, one problem is that adopting pre-trained models from classification to detection task may incur learning bias due to the different objective function and diverse distributions of object categories. Techniques like fine-tuning on detection task could alleviate this issue to some extent but are still not fundamental. Furthermore, transferring these pre-trained models across discrepant domains will be more difficult (e.g., from RGB to depth images). Thus, a better solution to handle these critical problems is to train object detectors from scratch, which motivates our proposed method. Previous efforts on this direction mainly failed by reasons of the limited training data and naive backbone network structures for object detection. In DSOD, we contribute a set of design principles for learning object detectors from scratch. One of the key principles is the deep supervision, enabled by layer-wise dense connections in both backbone networks and prediction layers, plays a critical role in learning good detectors from scratch. After involving several other principles, we build our DSOD based on the single-shot detection framework (SSD). We evaluate our method on PASCAL VOC 2007, 2012 and COCO datasets. DSOD achieves consistently better results than the state-of-the-art methods with much more compact models. Specifically, DSOD outperforms baseline method SSD on all three benchmarks, while requiring only 1/2 parameters. We also observe that DSOD can achieve comparable/slightly better results than Mask RCNN + FPN (under similar input size) with only 1/3 parameters, using no extra data or pre-trained models. Deeply-Recursive Network(DR-ResNet) The estimation of crowd count in images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. Recently, the convolutional neural network (CNN) based approaches have been shown to be more effective in crowd counting than traditional methods that use handcrafted features. However, the existing CNN-based methods still suffer from large number of parameters and large storage space, which require high storage and computing resources and thus limit the real-world application. Consequently, we propose a deeply-recursive network (DR-ResNet) based on ResNet blocks for crowd counting. The recursive structure makes the network deeper while keeping the number of parameters unchanged, which enhances network capability to capture statistical regularities in the context of the crowd. Besides, we generate a new dataset from the video-monitoring data of Beijing bus station. Experimental results have demonstrated that proposed method outperforms most state-of-the-art methods with far less number of parameters. Deeply-Supervised Nets(DSN) Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce ‘companion objective’ to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN). DeepMatch We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings. DeepMNE Networks are ubiquitous structure that describes complex relationships between different entities in the real world. As a critical component of prediction task over nodes in networks, learning the feature representation of nodes has become one of the most active areas recently. Network Embedding, aiming to learn non-linear and low-dimensional feature representation based on network topology, has been proved to be helpful on tasks of network analysis, especially node classification. For many real-world systems, multiple types of relations are naturally represented by multiple networks. However, existing network embedding methods mainly focus on single network embedding and neglect the information shared among different networks. In this paper, we propose a novel multiple network embedding method based on semisupervised autoencoder, named DeepMNE, which captures complex topological structures of multi-networks and takes the correlation among multi-networks into account. We evaluate DeepMNE on the task of node classification with two real-world datasets. The experimental results demonstrate the superior performance of our method over four state-of-the-art algorithms. DeepMoD We introduce DeepMoD, a deep learning based model discovery algorithm which seeks the partial differential equation underlying a spatio-temporal data set. DeepMoD employs sparse regression on a library of basis functions and their corresponding spatial derivatives. A feed-forward neural network approximates the data set and automatic differentiation is used to construct this function library and perform regression within the neural network. This construction makes it extremely robust to noise and applicable to small data sets and, contrary to other deep learning methods, does not require a training set and is impervious to overfitting. We illustrate this approach on several physical problems, such as the Burgers’, Korteweg-de Vries, advection-diffusion and Keller-Segel equations, and find that it requires as few as O(10^2) samples and works at noise levels up to 75%. This resilience to noise and high performance at very few samples highlights the potential of this method to be applied on experimental data. Code and examples available at https://…/DeePyMoD. DeepMutation Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models. DeepPath We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multi-hop relational paths: we use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a path-ranking based algorithm and knowledge graph embedding methods on Freebase and Never-Ending Language Learning datasets. DeepPavlov DeepPavlov is an open-source conversational AI library built on TensorFlow and Keras. It is designed for • development of production ready chat-bots and complex conversational systems, • NLP and dialog systems research. DeepPool The success of modern ride-sharing platforms crucially depends on the profit of the ride-sharing fleet operating companies, and how efficiently the resources are managed. Further, ride-sharing allows sharing costs and, hence, reduces the congestion and emission by making better use of vehicle capacities. In this work, we develop a distributed model-free, DeepPool, that uses deep Q-network (DQN) techniques to learn optimal dispatch policies by interacting with the environment. Further, DeepPool efficiently incorporates travel demand statistics and deep learning models to manage dispatching vehicles for improved ride sharing services. Using real-world dataset of taxi trip records in New York City, DeepPool performs better than other strategies, proposed in the literature, that do not consider ride sharing or do not dispatch the vehicles to regions where the future demand is anticipated. Finally, DeepPool can adapt rapidly to dynamic environments since it is implemented in a distributed manner in which each vehicle solves its own DQN individually without coordination. DeepPos The widespread mobile devices facilitated the emergence of many new applications and services. Among them are location-based services (LBS) that provide services based on user’s location. Several techniques have been presented to enable LBS even in indoor environments where Global Positioning System (GPS) has low localization accuracy. These methods use some environment measurements (like Channel State Information (CSI) or Received Signal Strength (RSS)) for user localization. In this paper, we will use CSI and a novel deep learning algorithm to design a robust and efficient system for indoor localization. More precisely, we use supervised autoencoder (SAE) to model the environment using the data collected during the training phase. Then, during the testing phase, we use the trained model and estimate the coordinates of the unknown point by checking different possible labels. Unlike the previous fingerprinting approaches, in this work, we do not store the {CSI/RSS} of fingerprints and instead we model the environment only with a single SAE. The performance of the proposed scheme is then evaluated in two indoor environments and compared with that of similar approaches. DeepProbe Information extraction and user intention identification are central topics in modern query understanding and recommendation systems. In this paper, we propose DeepProbe, a generic information-directed interaction framework which is built around an attention-based sequence to sequence (seq2seq) recurrent neural network. DeepProbe can rephrase, evaluate, and even actively ask questions, leveraging the generative ability and likelihood estimation made possible by seq2seq models. DeepProbe makes decisions based on a derived uncertainty (entropy) measure conditioned on user inputs, possibly with multiple rounds of interactions. Three applications, namely a rewritter, a relevance scorer and a chatbot for ad recommendation, were built around DeepProbe, with the first two serving as precursory building blocks for the third. We first use the seq2seq model in DeepProbe to rewrite a user query into one of standard query form, which is submitted to an ordinary recommendation system. Secondly, we evaluate DeepProbe’s seq2seq model-based relevance scoring. Finally, we build a chatbot prototype capable of making active user interactions, which can ask questions that maximize information gain, allowing for a more efficient user intention idenfication process. We evaluate first two applications by 1) comparing with baselines by BLEU and AUC, and 2) human judge evaluation. Both demonstrate significant improvements compared with current state-of-the-art systems, proving their values as useful tools on their own, and at the same time laying a good foundation for the ongoing chatbot application. DeepProbLog We introduce DeepProbLog, a probabilistic logic programming language that incorporates deep learning by means of neural predicates. We show how existing inference and learning techniques can be adapted for the new language. Our experiments demonstrate that DeepProbLog supports both symbolic and subsymbolic representations and inference, 1) program induction, 2) probabilistic (logic) programming, and 3) (deep) learning from examples. To the best of our knowledge, this work is the first to propose a framework where general-purpose neural networks and expressive probabilistic-logical modeling and reasoning are integrated in a way that exploits the full expressiveness and strengths of both worlds and can be trained end-to-end based on examples. DeepRank This paper concerns a deep learning approach to relevance ranking in information retrieval (IR). Existing deep IR models such as DSSM and CDSSM directly apply neural networks to generate ranking scores, without explicit understandings of the relevance. According to the human judgement process, a relevance label is generated by the following three steps: 1) relevant locations are detected, 2) local relevances are determined, 3) local relevances are aggregated to output the relevance label. In this paper we propose a new deep learning architecture, namely DeepRank, to simulate the above human judgment process. Firstly, a detection strategy is designed to extract the relevant contexts. Then, a measure network is applied to determine the local relevances by utilizing a convolutional neural network (CNN) or two-dimensional gated recurrent units (2D-GRU). Finally, an aggregation network with sequential integration and term gating mechanism is used to produce a global relevance score. DeepRank well captures important IR characteristics, including exact/semantic matching signals, proximity heuristics, query term importance, and diverse relevance requirement. Experiments on both benchmark LETOR dataset and a large scale clickthrough data show that DeepRank can significantly outperform learning to ranking methods, and existing deep learning methods. DeepSaucer In recent years, a number of methods for verifying DNNs have been developed. Because the approaches of the methods differ and have their own limitations, we think that a number of verification methods should be applied to a developed DNN. To apply a number of methods to the DNN, it is necessary to translate either the implementation of the DNN or the verification method so that one runs in the same environment as the other. Since those translations are time-consuming, a utility tool, named DeepSaucer, which helps to retain and reuse implementations of DNNs, verification methods, and their environments, is proposed. In DeepSaucer, code snippets of loading DNNs, running verification methods, and creating their environments are retained and reused as software assets in order to reduce cost of verifying DNNs. The feasibility of DeepSaucer is confirmed by implementing it on the basis of Anaconda, which provides virtual environment for loading a DNN and running a verification method. In addition, the effectiveness of DeepSaucer is demonstrated by usecase examples. DeepSense Mobile sensing applications usually require time-series inputs from sensors. Some applications, such as tracking, can use sensed acceleration and rate of rotation to calculate displacement based on physical system models. Other applications, such as activity recognition, extract manually designed features from sensor inputs for classification. Such applications face two challenges. On one hand, on-device sensor measurements are noisy. For many mobile applications, it is hard to find a distribution that exactly describes the noise in practice. Unfortunately, calculating target quantities based on physical system and noise models is only as accurate as the noise assumptions. Similarly, in classification applications, although manually designed features have proven to be effective, it is not always straightforward to find the most robust features to accommodate diverse sensor noise patterns and user behaviors. To this end, we propose DeepSense, a deep learning framework that directly addresses the aforementioned noise and feature customization challenges in a unified manner. DeepSense integrates convolutional and recurrent neural networks to exploit local interactions among similar mobile sensors, merge local interactions of different sensory modalities into global interactions, and extract temporal relationships to model signal dynamics. DeepSense thus provides a general signal estimation and classification framework that accommodates a wide range of applications. We demonstrate the effectiveness of DeepSense using three representative and challenging tasks: car tracking with motion sensors, heterogeneous human activity recognition, and user identification with biometric motion analysis. DeepSense significantly outperforms the state-of-the-art methods for all three tasks. In addition, DeepSense is feasible to implement on smartphones due to its moderate energy consumption and low latency Deep-Shallow Incremental Learning(DeeSIL) Incremental Learning (IL) is an interesting AI problem when the algorithm is assumed to work on a budget. This is especially true when IL is modeled using a deep learning approach, where two complex challenges arise due to limited memory, which induces catastrophic forgetting and delays related to the retraining needed in order to incorporate new classes. Here we introduce DeeSIL, an adaptation of a known transfer learning scheme that combines a fixed deep representation used as feature extractor and learning independent shallow classifiers to increase recognition capacity. This scheme tackles the two aforementioned challenges since it works well with a limited memory budget and each new concept can be added within a minute. Moreover, since no deep retraining is needed when the model is incremented, DeeSIL can integrate larger amounts of initial data that provide more transferable features. Performance is evaluated on ImageNet LSVRC 2012 against three state of the art algorithms. Results show that, at scale, DeeSIL performance is 23 and 33 points higher than the best baseline when using the same and more initial data respectively. DeepSOFA Traditional methods for assessing illness severity and predicting in-hospital mortality among critically ill patients require manual, time-consuming, and error-prone calculations that are further hindered by the use of static variable thresholds derived from aggregate patient populations. These coarse frameworks do not capture time-sensitive individual physiological patterns and are not suitable for instantaneous assessment of patients’ acuity trajectories, a critical task for the ICU where conditions often change rapidly. Furthermore, they are ill-suited to capitalize on the emerging availability of streaming electronic health record data. We propose a novel acuity score framework (DeepSOFA) that leverages temporal patient measurements in conjunction with deep learning models to make accurate assessments of a patient’s illness severity at any point during their ICU stay. We compare DeepSOFA with SOFA baseline models using the same predictors and find that at any point during an ICU admission, DeepSOFA yields more accurate predictions of in-hospital mortality. DeepSSM Statistical shape modeling is an important tool to characterize variation in anatomical morphology. Typical shapes of interest are measured using 3D imaging and a subsequent pipeline of registration, segmentation, and some extraction of shape features or projections onto some lower-dimensional shape space, which facilitates subsequent statistical analysis. Many methods for constructing compact shape representations have been proposed, but are often impractical due to the sequence of image preprocessing operations, which involve significant parameter tuning, manual delineation, and/or quality control by the users. We propose DeepSSM: a deep learning approach to extract a low-dimensional shape representation directly from 3D images, requiring virtually no parameter tuning or user assistance. DeepSSM uses a convolutional neural network (CNN) that simultaneously localizes the biological structure of interest, establishes correspondences, and projects these points onto a low-dimensional shape representation in the form of PCA loadings within a point distribution model. To overcome the challenge of the limited availability of training images, we present a novel data augmentation procedure that uses existing correspondences on a relatively small set of processed images with shape statistics to create plausible training samples with known shape parameters. Hence, we leverage the limited CT/MRI scans (40-50) into thousands of images needed to train a CNN. After the training, the CNN automatically produces accurate low-dimensional shape representations for unseen images. We validate DeepSSM for three different applications pertaining to modeling pediatric cranial CT for characterization of metopic craniosynostosis, femur CT scans identifying morphologic deformities of the hip due to femoroacetabular impingement, and left atrium MRI scans for atrial fibrillation recurrence prediction. DeepSurvival Pedestrian’s road crossing behaviour is one of the important aspects of urban dynamics that will be affected by the introduction of autonomous vehicles. In this study we introduce DeepSurvival, a novel framework for estimating pedestrian’s waiting time at unsignalized mid-block crosswalks in mixed traffic conditions. We exploit the strengths of deep learning in capturing the nonlinearities in the data and develop a cox proportional hazard model with a deep neural network as the log-risk function. An embedded feature selection algorithm for reducing data dimensionality and enhancing the interpretability of the network is also developed. We test our framework on a dataset collected from 160 participants using an immersive virtual reality environment. Validation results showed that with a C-index of 0.64 our proposed framework outperformed the standard cox proportional hazard-based model with a C-index of 0.58. DeepSwarm In this paper we propose DeepSwarm, a novel neural architecture search (NAS) method based on Swarm Intelligence principles. At its core DeepSwarm uses Ant Colony Optimization (ACO) to generate ant population which uses the pheromone information to collectively search for the best neural architecture. Furthermore, by using local and global pheromone update rules our method ensures the balance between exploitation and exploration. On top of this, to make our method more efficient we combine progressive neural architecture search with weight reusability. Furthermore, due to the nature of ACO our method can incorporate heuristic information which can further speed up the search process. After systematic and extensive evaluation, we discover that on three different datasets (MNIST, Fashion-MNIST, and CIFAR-10) when compared to existing systems our proposed method demonstrates competitive performance. Finally, we open source DeepSwarm as a NAS library and hope it can be used by more deep learning researchers and practitioners. DeepTag In many under-resourced settings, clinicians lack time and expertise to annotate patients with standard medical diagnosis codes. Veterinary medicine is an example of this and clinical encounters are largely captured in free text notes which are not labeled with diagnosis code. The lack of such standard coding makes it challenging to apply data science to improve patient care. It is also a major impediment to translational research, where, for example, we would like to leverage veterinary data to inform drug development for humans. We develop a deep learning algorithm, DeepTag, to automatically infer diagnosis codes from veterinarian free text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multi-task LSTM with an improved hierarchical objective that captures structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defer them to human experts, resulting in improved performance of the model. DeepTag accurately infers disease codes from free text even in challenging out-of-domain settings where the text comes from different clinics than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal pre-processing. The technical framework in this work can be applied in other medical domains that currently lack medical coding infrastructure. DeepTagRec In this paper, we develop a content-cum-user based deep learning framework DeepTagRec to recommend appropriate question tags on Stack Overflow. The proposed system learns the content representation from question title and body. Subsequently, the learnt representation from heterogeneous relationship between user and tags is fused with the content representation for the final tag prediction. On a very large-scale dataset comprising half a million question posts, DeepTagRec beats all the baselines; in particular, it significantly outperforms the best performing baseline T agCombine achieving an overall gain of 60.8% and 36.8% in precision@3 and recall@10 respectively. DeepTagRec also achieves 63% and 33.14% maximum improvement in exact-k accuracy and top-k accuracy respectively over TagCombine DeepThin As the industry deploys increasingly large and complex neural networks to mobile devices, more pressure is put on the memory and compute resources of those devices. Deep compression, or compression of deep neural network weight matrices, is a technique to stretch resources for such scenarios. Existing compression methods cannot effectively compress models smaller than 1-2% of their original size. We develop a new compression technique, DeepThin, building on existing research in the area of low rank factorization. We identify and break artificial constraints imposed by low rank approximations by combining rank factorization with a reshaping process that adds nonlinearity to the approximation function. We deploy DeepThin as a plug-gable library integrated with TensorFlow that enables users to seamlessly compress models at different granularities. We evaluate DeepThin on two state-of-the-art acoustic models, TFKaldi and DeepSpeech, comparing it to previous compression work (Pruning, HashNet, and Rank Factorization), empirical limit study approaches, and hand-tuned models. For TFKaldi, our DeepThin networks show better word error rates (WER) than competing methods at practically all tested compression rates, achieving an average of 60% relative improvement over rank factorization, 57% over pruning, 23% over hand-tuned same-size networks, and 6% over the computationally expensive HashedNets. For DeepSpeech, DeepThin-compressed networks achieve better test loss than all other compression methods, reaching a 28% better result than rank factorization, 27% better than pruning, 20% better than hand-tuned same-size networks, and 12% better than HashedNets. DeepThin also provide inference performance benefits ranging from 2X to 14X speedups, depending on the compression ratio and platform cache sizes. DeepTileBars Most neural Information Retrieval (Neu-IR) models derive query-to-document ranking scores based on term-level matching. Inspired by TileBars, a classic term distribution visualization method, in this paper, we propose a novel Neu-IR model that models query-to-document matching at the subtopic and higher levels. Our system first splits the documents into topical segments, ‘visualizes’ the matching between the query and the segments, and then feeds the interaction matrix into a Neu-IR model, DeepTileBars, to obtain the final ranking score. DeepTileBars models the relevance signals happening at different granularities in a document’s topic hierarchy. It thus better captures the discourse structure of the document and the matching patterns. Although its design and implementation are light-weight, DeepTileBars outperforms other state-of-the-art Neu-IR models on benchmark datasets including the Text REtrieval Conference (TREC) 2010-2012 Web Tracks and LETOR 4.0. DeepTracker Deep convolutional neural networks (CNNs) have achieved remarkable success in various fields. However, training an excellent CNN is practically a trial-and-error process that consumes a tremendous amount of time and computer resources. To accelerate the training process and reduce the number of trials, experts need to understand what has occurred in the training process and why the resulting CNN behaves as such. However, current popular training platforms, such as TensorFlow, only provide very little and general information, such as training/validation errors, which is far from enough to serve this purpose. To bridge this gap and help domain experts with their training tasks in a practical environment, we propose a visual analytics system, DeepTracker, to facilitate the exploration of the rich dynamics of CNN training processes and to identify the unusual patterns that are hidden behind the huge amount of training log. Specifically,we combine a hierarchical index mechanism and a set of hierarchical small multiples to help experts explore the entire training log from different levels of detail. We also introduce a novel cube-style visualization to reveal the complex correlations among multiple types of heterogeneous training data including neuron weights, validation images, and training iterations. Three case studies are conducted to demonstrate how DeepTracker provides its users with valuable knowledge in an industry-level CNN training process, namely in our case, training ResNet-50 on the ImageNet dataset. We show that our method can be easily applied to other state-of-the-art ‘very deep’ CNN models. Deep-Tree Generation(DTG) A novel graph-to-tree conversion mechanism called the deep-tree generation (DTG) algorithm is first proposed to predict text data represented by graphs. The DTG method can generate a richer and more accurate representation for nodes (or vertices) in graphs. It adds flexibility in exploring the vertex neighborhood information to better reflect the second order proximity and homophily equivalence in a graph. Then, a Deep-Tree Recursive Neural Network (DTRNN) method is presented and used to classify vertices that contains text data in graphs. To demonstrate the effectiveness of the DTRNN method, we apply it to three real-world graph datasets and show that the DTRNN method outperforms several state-of-the-art benchmarking methods. Deep-Tree Recursive Neural Network ➘ “Deep-Tree Generation” DeepTurbo Present-day communication systems routinely use codes that approach the channel capacity when coupled with a computationally efficient decoder. However, the decoder is typically designed for the Gaussian noise channel and is known to be sub-optimal for non-Gaussian noise distribution. Deep learning methods offer a new approach for designing decoders that can be trained and tailored for arbitrary channel statistics. We focus on Turbo codes and propose DeepTurbo, a novel deep learning based architecture for Turbo decoding. The standard Turbo decoder (Turbo) iteratively applies the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm with an interleaver in the middle. A neural architecture for Turbo decoding termed (NeuralBCJR), was proposed recently. There, the key idea is to create a module that imitates the BCJR algorithm using supervised learning, and to use the interleaver architecture along with this module, which is then fine-tuned using end-to-end training. However, knowledge of the BCJR algorithm is required to design such an architecture, which also constrains the resulting learned decoder. Here we remedy this requirement and propose a fully end-to-end trained neural decoder – Deep Turbo Decoder (DeepTurbo). With novel learnable decoder structure and training methodology, DeepTurbo reveals superior performance under both AWGN and non-AWGN settings as compared to the other two decoders – Turbo and NeuralBCJR. Furthermore, among all the three, DeepTurbo exhibits the lowest error floor. DeepWalk We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection. DeepXplore Deep learning (DL) systems are increasingly deployed in security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system’s behavior for corner-case inputs are of great importance. However, systematic testing of large-scale DL systems with thousands of neurons and millions of parameters for all possible corner-cases is a hard problem. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose different erroneous behaviors for rare inputs. We present DeepXplore, the first whitebox framework for systematically testing real-world DL systems. We address two problems: (1) generating inputs that trigger different parts of a DL system’s logic and (2) identifying incorrect behaviors of DL systems without manual effort. First, we introduce neuron coverage for estimating the parts of DL system exercised by a set of test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles and thus avoid manual checking for erroneous behaviors. We demonstrate how finding inputs triggering differential behaviors while achieving high neuron coverage for DL algorithms can be represented as a joint optimization problem and solved efficiently using gradient-based optimization techniques. DeepXplore finds thousands of incorrect corner-case behaviors in state-of-the-art DL models trained on five popular datasets. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running on a commodity laptop. The inputs generated by DeepXplore achieved 33.2% higher neuron coverage on average than existing testing methods. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve classification accuracy or identify polluted training data. DeepZip Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this problem, many of the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are known as universal function approximators with the capability to learn arbitrarily complex mappings, and in practice show excellent performance in prediction tasks, we explore and devise methods to compress sequential data using neural network predictors. We combine recurrent neural network predictors with an arithmetic coder and losslessly compress a variety of synthetic, text and genomic datasets. The proposed compressor outperforms Gzip on the real datasets and achieves near-optimal compression for the synthetic datasets. The results also help understand why and where neural networks are good alternatives for traditional finite context models Deequ Deequ’s purpose is to ‘unit-test’ data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. An executable version of the example is available here. Deequ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. For this example, we assume that we work on some kind of Item data, where every item has an id, a name, a description, a priority and a count of how often it has been viewed. DeFactoNLP In this paper, we describe DeFactoNLP, the system we designed for the FEVER 2018 Shared Task. The aim of this task was to conceive a system that can not only automatically assess the veracity of a claim but also retrieve evidence supporting this assessment from Wikipedia. In our approach, the Wikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectors are most similar to the vector of the claim and those documents whose names are similar to those of the named entities (NEs) mentioned in the claim are identified as the documents which might contain evidence. The sentences in these documents are then supplied to a textual entailment recognition module. This module calculates the probability of each sentence supporting the claim, contradicting the claim or not providing any relevant information to assess the veracity of the claim. Various features computed using these probabilities are finally used by a Random Forest classifier to determine the overall truthfulness of the claim. The sentences which support this classification are returned as evidence. Our approach achieved a 0.4277 evidence F1-score, a 0.5136 label accuracy and a 0.3833 FEVER score. Defense-GAN In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. Defense-GAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that Defense-GAN is consistently effective against different attack methods and improves on existing defense strategies. Our code has been made publicly available at https://…/defensegan. Defensive Quantization Neural network quantization is becoming an industry standard to efficiently deploy deep learning models on hardware platforms, such as CPU, GPU, TPU, and FPGAs. However, we observe that the conventional quantization approaches are vulnerable to adversarial attacks. This paper aims to raise people’s awareness about the security of the quantized models, and we designed a novel quantization methodology to jointly optimize the efficiency and robustness of deep learning models. We first conduct an empirical study to show that vanilla quantization suffers more from adversarial attacks. We observe that the inferior robustness comes from the error amplification effect, where the quantization operation further enlarges the distance caused by amplified noise. Then we propose a novel Defensive Quantization (DQ) method by controlling the Lipschitz constant of the network during quantization, such that the magnitude of the adversarial noise remains non-expansive during inference. Extensive experiments on CIFAR-10 and SVHN datasets demonstrate that our new quantization method can defend neural networks against adversarial examples, and even achieves superior robustness than their full-precision counterparts while maintaining the same hardware efficiency as vanilla quantization approaches. As a by-product, DQ can also improve the accuracy of quantized models without adversarial attack. Deferred Acceptance Algorithm(DAA) The Deferred Acceptance Algorithm (DAA) goes back to Gale and Shapley (1962). They introduce a rather simple algorithm that finds a stable matching for example for college admissions or in a marriage market. In a marriage market where M men have preferences over W women, and men take the role of the proposing party, the DAA produces what is called the M-stable matching: each man strictly prefers the M-stable matching to any other potential matching. “Stable” means that no couple of a man and a woman could break the matching by choosing another mate. This is quite a strong result. Variations of this algoritm are used in Hospital assignments in the USA, whereby recently graduated doctors submit preferences over hospitals, and hospitals submit preferences over graduates. Another application is the kidney exchange, where the algorithm is used to find the best match between a set of donors and a set of receivers. matchingMarkets Define Differential Message Importance Measure Data collection is a fundamental problem in the scenario of big data, where the size of sampling sets plays a very important role, especially in the characterization of data structure. This paper considers the information collection process by taking message importance into account, and gives a distribution-free criterion to determine how many samples are required in big data structure characterization. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. The DMIM for many common densities is discussed, and high-precision approximate values for normal distribution are given. Moreover, it is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. Definition Extraction Tool(DefExt) We present DefExt, an easy to use semi supervised Definition Extraction Tool. DefExt is designed to extract from a target corpus those textual fragments where a term is explicitly mentioned together with its core features, i.e. its definition. It works on the back of a Conditional Random Fields based sequential labeling algorithm and a bootstrapping approach. Bootstrapping enables the model to gradually become more aware of the idiosyncrasies of the target corpus. In this paper we describe the main components of the toolkit as well as experimental results stemming from both automatic and manual evaluation. We release DefExt as open source along with the necessary files to run it in any Unix machine. We also provide access to training and test data for immediate use. Deflated Deterministic Parallel Analysis(DDPA) ➘ “Deterministic Parallel Analysis” DefNet Graph neural network (GNN), as a powerful representation learning model on graph data, attracts much attention across various disciplines. However, recent studies show that GNN is vulnerable to adversarial attacks. How to make GNN more robust? What are the key vulnerabilities in GNN? How to address the vulnerabilities and defense GNN against the adversarial attacks? In this paper, we propose DefNet, an effective adversarial defense framework for GNNs. In particular, we first investigate the latent vulnerabilities in every layer of GNNs and propose corresponding strategies including dual-stage aggregation and bottleneck perceptron. Then, to cope with the scarcity of training data, we propose an adversarial contrastive learning method to train the GNN in a conditional GAN manner by leveraging the high-level graph representation. Extensive experiments on three public datasets demonstrate the effectiveness of \defmodel in improving the robustness of popular GNN variants, such as Graph Convolutional Network and GraphSAGE, under various types of adversarial attacks. Deformable Convolution ➘ “Deformable Convolutional Networks” Deformable Convolutional Networks Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released. Deformable RoI Pooling ➘ “Deformable Convolutional Networks” Deformable Volume Network(Devon) We propose a lightweight neural network model, Deformable Volume Network (Devon) for learning optical flow. Devon benefits from a multi-stage framework to iteratively refine its prediction. Each stage is by itself a neural network with an identical architecture. The optical flow between two stages is propagated with a newly proposed module, the deformable cost volume. The deformable cost volume does not distort the original images or their feature maps and therefore avoids the artifacts associated with warping, a common drawback in previous models. Devon only has one million parameters. Experiments show that Devon achieves comparable results to previous neural network models, despite of its small size. Degradation Data Analysis Given that products are more frequently being designed with higher reliability and developed in a shorter amount of time, it is often not possible to test new designs to failure under normal operating conditions. In some cases, it is possible to infer the reliability behavior of unfailed test samples with only the accumulated test time information and assumptions about the distribution. However, this generally leads to a great deal of uncertainty in the results. Another option in this situation is the use of degradation analysis. Degradation analysis involves the measurement of performance data that can be directly related to the presumed failure of the product in question. Many failure mechanisms can be directly linked to the degradation of part of the product, and degradation analysis allows the analyst to extrapolate to an assumed failure time based on the measurements of degradation over time. Degree Penalty Network embedding aims to learn the low-dimensional representations of vertexes in a network, while structure and inherent properties of the network is preserved. Existing network embedding works primarily focus on preserving the microscopic structure, such as the first- and second-order proximity of vertexes, while the macroscopic scale-free property is largely ignored. Scale-free property depicts the fact that vertex degrees follow a heavy-tailed distribution (i.e., only a few vertexes have high degrees) and is a critical property of real-world networks, such as social networks. In this paper, we study the problem of learning representations for scale-free networks. We first theoretically analyze the difficulty of embedding and reconstructing a scale-free network in the Euclidean space, by converting our problem to the sphere packing problem. Then, we propose the ‘degree penalty’ principle for designing scale-free property preserving network embedding algorithm: punishing the proximity between high-degree vertexes. We introduce two implementations of our principle by utilizing the spectral techniques and a skip-gram model respectively. Extensive experiments on six datasets show that our algorithms are able to not only reconstruct heavy-tailed distributed degree distribution, but also outperform state-of-the-art embedding models in various network mining tasks, such as vertex classification and link prediction. Degree Weighted Lasso DWLasso Degree-Redundant-Influence(DRS) Influence overlap is a universal phenomenon in influence spreading for social networks. In this paper, we argue that the redundant influence generated by influence overlap cause negative effect for maximizing spreading influence. Firstly, we present a theoretical method to calculate the influence overlap and record the redundant influence. Then in term of eliminating redundant influence, we present two algorithms, namely, Degree-Redundant-Influence (DRS) and Degree-Second-Neighborhood (DSN) for multiple spreaders identification. The experiments for four empirical social networks successfully verify the methods, and the spreaders selected by the DSN algorithm show smaller degree and k-core values. Degree-Second-Neighborhood(DSN) Influence overlap is a universal phenomenon in influence spreading for social networks. In this paper, we argue that the redundant influence generated by influence overlap cause negative effect for maximizing spreading influence. Firstly, we present a theoretical method to calculate the influence overlap and record the redundant influence. Then in term of eliminating redundant influence, we present two algorithms, namely, Degree-Redundant-Influence (DRS) and Degree-Second-Neighborhood (DSN) for multiple spreaders identification. The experiments for four empirical social networks successfully verify the methods, and the spreaders selected by the DSN algorithm show smaller degree and k-core values. DeGroot-Friedkin Model(DF) The DeGroot-Friedkin model in , contains two stages and studies the evolution of self-confidence, i.e., how confident an individual is for her opinions on a sequence of issues. In the first stage, individuals update their opinions for a particular issue according to the classical DeGroot model, and in the second stage, the self-confidence for the next issue is governed by the reflected appraisal mechanism studied in ,. Reflected appraisal mechanism, in simple words, describes the phenomenon that individuals’ self-appraisals on some dimension (e.g., selfconfidence, self-esteem) are influenced by the appraisals of other individuals on them. Delaunay Diagram Delaunay Outlyingness Outlier detection is a major topic in robust statistics due to the high practical significance of anomalous observations. Many existing methods are, however, either parametric or cease to perform well when the data is far from linearly structured. In this paper, we propose a quantity, Delaunay outlyingness, that is a nonparametric outlyingness score applicable to data with complicated structure. The approach is based a well known triangulation of the sample, which seems to reflect the sparsity of the pointset to different directions in a useful way. In addition to appealing to heuristics, we derive results on the asymptotic behaviour of Delaunay outlyingness in the case of a sufficiently simple set of observations. Simulations and an application to financial data are also discussed. Delayed-Acceptance Markov Chain Monte Carlo(DA-MCMC) Delayed-acceptance Markov chain Monte Carlo (DA-MCMC) samples from a probability distribution, via a two-stages version of the Metropolis-Hastings algorithm, by combining the target distribution with a ‘surrogate’ (i.e. an approximate and computationally cheaper version) of said distribution. DA-MCMC accelerates MCMC sampling in complex applications, while still targeting the exact distribution. Delayed-Action Game(DAG) Stochastic multiplayer games (SMGs) have gained attention in the field of strategy synthesis for multi-agent reactive systems. However, standard SMGs are limited to modeling systems where all agents have full knowledge of the state of the game. In this paper, we introduce delayed-action games (DAGs) formalism that simulates hidden-information games (HIGs) as SMGs, by eliminating hidden information by delaying a player’s actions. The elimination of hidden information enables the usage of SMG off-the-shelf model checkers to implement HIGs. Furthermore, we demonstrate how a DAG can be decomposed into a number of independent subgames. Since each subgame can be independently explored, parallel computation can be utilized to reduce the model checking time, while alleviating the state space explosion problem that SMGs are notorious for. In addition, we propose a DAG-based framework for strategy synthesis and analysis. Finally, we demonstrate applicability of the DAG-based synthesis framework on a case study of a human-on-the-loop unmanned-aerial vehicle system that may be under stealthy attack, where the proposed framework is used to formally model, analyze and synthesize security-aware strategies for the system. Delegative Reinforcement Learning Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.) The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions. DELIMIT DELIMIT is a framework extension for deep learning in diffusion imaging, which extends the basic framework PyTorch towards spherical signals. Based on several novel layers, deep learning can be applied to spherical diffusion imaging data in a very convenient way. First, two spherical harmonic interpolation layers are added to the extension, which allow to transform the signal from spherical surface space into the spherical harmonic space, and vice versa. In addition, a local spherical convolution layer is introduced that adds the possibility to include gradient neighborhood information within the network. Furthermore, these extensions can also be utilized for the preprocessing of diffusion signals. DELIP Partially observable Markov decision processes (POMDPs) are a powerful abstraction for tasks that require decision making under uncertainty, and capture a wide range of real world tasks. Today, effective planning approaches exist that generate effective strategies given black-box models of a POMDP task. Yet, an open question is how to acquire accurate models for complex domains. In this paper we propose DELIP, an approach to model learning for POMDPs that utilizes amortized structured variational inference. We empirically show that our model leads to effective control strategies when coupled with state-of-the-art planners. Intuitively, model-based approaches should be particularly beneficial in environments with changing reward structures, or where rewards are initially unknown. Our experiments confirm that DELIP is particularly effective in this setting. DE-LSTM We present a deep learning model, DE-LSTM, for the simulation of a stochastic process with underlying nonlinear dynamics. The deep learning model aims to approximate the probability density function of a stochastic process via numerical discretization and the underlying nonlinear dynamics is modeled by the Long Short-Term Memory (LSTM) network. After the numerical discretization by a softmax function, the function estimation problem is solved by a multi-label classification problem. A penalized maximum log likelihood method is proposed to impose smoothness in the predicted probability distribution. It is shown that LSTM is a state space model, where the internal dynamics consists of a system of relaxation processes. A sequential Monte Carlo method is outlined to compute the time evolution of the probability distribution. The behavior of DE-LSTM is investigated by using the Ornstein-Uhlenbeck process and noisy observations of Mackey-Glass equation and forced Van der Pol oscillators. While the probability distribution computed by the conventional maximum log likelihood method makes a good prediction of the first and second moments, the Kullback-Leibler divergence shows that the penalized maximum log likelihood method results in a probability distribution closer to the ground truth. It is shown that DE-LSTM makes a good prediction of the probability distribution without assuming any distributional properties of the noise. For a multiple-step forecast, it is found that the prediction uncertainty, denoted by the 95% confidence interval, does not grow monotonically in time. For a chaotic system, Mackey-Glass time series, the 95% confidence interval first grows, then exhibits an oscillatory behavior, instead of growing indefinitely, while for the forced Van der Pol oscillator, the prediction uncertainty does not grow in time even for 3,000-step forecast. DeLTA Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement. Delta Embedding Learning Learning from corpus and learning from supervised NLP tasks both give useful semantics that can be incorporated into a good word representation. We propose an embedding learning method called Delta Embedding Learning, to learn semantic information from high-level supervised tasks like reading comprehension, and combine it with an unsupervised word embedding. The simple technique not only improved the performance of various supervised NLP tasks, but also simultaneously learns improved universal word embeddings out of these tasks. Delta Epsilon Alpha Star Delta Epsilon Alpha Star is a minimal coverage, real-time robotic search algorithm that yields a moderately aggressive search path with minimal backtracking. Search performance is bounded by a placing a combinatorial bound, epsilon and delta, on the maximum deviation from the theoretical shortest path and the probability at which further deviations can occur. Additionally, we formally define the notion of PAC-admissibility — a relaxed admissibility criteria for algorithms, and show that PAC-admissible algorithms are better suited to robotic search situations than epsilon-admissible or strict algorithms. DeltaRho DeltaRho is an open source project with the goal of providing methods and tools that enable deep analysis of large complex data. Big data is usually complex, and to get the most out of the data – to do deep analysis – requires a great deal of flexibility in analytical methods and data structures. Our goal with DeltaRho is to provide flexibility at scale, allowing the thousands of analytic, visualization, and machine learning methods available in R, along with any R data structure, to be used with large complex data. Behind this effort is a statistical approach called Divide and Recombine (D&R). Visualization is essential in all aspects of data analysis. To avoid missing critical insights, it is important to be able to visualize the data in detail, particularly with big data. Our visualization system in DeltaRho, Trelliscope, provides a way to easily and flexibly specify scalable detailed visualizations. Trelliscope is a natural visual extension of the D&R approach. Delta-Screening Community detection is a discovery tool used by network scientists to analyze the structure of real-world networks. It seeks to identify natural divisions that may exist in the input networks that partition the vertices into coherent modules (or communities). While this problem space is rich with efficient algorithms and software, most of this literature caters to the static use-case where the underlying network does not change. However, many emerging real-world use-cases give rise to a need to incorporate dynamic graphs as inputs. In this paper, we present a fast and efficient incremental approach toward dynamic community detection. The key contribution is a generic technique called $\Delta-screening$, which examines the most recent batch of changes made to an input graph and selects a subset of vertices to reevaluate for potential community (re)assignment. This technique can be incorporated into any of the community detection methods that use modularity as its objective function for clustering. For demonstration purposes, we incorporated the technique into two well-known community detection tools. Our experiments demonstrate that our new incremental approach is able to generate performance speedups without compromising on the output quality (despite its heuristic nature). For instance, on a real-world network with 63M temporal edges (over 12 time steps), our approach was able to complete in 1056 seconds, yielding a 3x speedup over a baseline implementation. In addition to demonstrating the performance benefits, we also show how to use our approach to delineate appropriate intervals of temporal resolutions at which to analyze an input network. DelugeNets Human brains are adept at dealing with the deluge of information they continuously receive, by suppressing the non-essential inputs and focusing on the important ones. Inspired by such capability, we propose Deluge Networks (DelugeNets), a novel class of neural networks facilitating massive cross-layer information inflows from preceding layers to succeeding layers. The connections between layers in DelugeNets are efficiently established through cross-layer depthwise convolutional layers with learnable filters, acting as a flexible selection mechanism. By virtue of the massive cross-layer information inflows, DelugeNets can propagate information across many layers with greater flexibility and utilize network parameters more effectively, compared to existing ResNet models. Experiments show the superior performances of DelugeNets in terms of both classification accuracies and parameter efficiencies. Remarkably, a DelugeNet model with just 20.2M parameters achieve state-of-the-art error of 19.02% on CIFAR-100 dataset, outperforming DenseNet model with 27.2M parameters. Moreover, DelugeNet performs comparably to ResNet-200 on ImageNet dataset with merely half of the computations needed by the latter. Demand Sensing Demand Sensing is a next generation forecasting method that leverages new mathematical techniques and near real-time information to create an accurate forecast of demand, based on the current realities of the supply chain. The typical performance of demand sensing systems reduces near-term forecast error by 30% or more compared to traditional time-series forecasting techniques. The jump in forecast accuracy helps companies manage the effects of market volatility and gain the benefits of a demand-driven supply chain, including more efficient operations, increased service levels, and a range of financial benefits including higher revenue, better profit margins, less inventory, better perfect order performance and a shorter cash-to-cash cycle time. Gartner, Inc. insight on demand sensing can be found in its report, “Supply Chain Strategy for Manufacturing Leaders: The Handbook for Becoming Demand Driven.” Deming Regression In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure. deming Demixing-GAN Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing. Democratization Democratization is defined as the action/development of making something accessible to everyone, to the ‘common masses.’ History provides democratization lessons from the Industrial and Information Revolutions. Both of these moments in history were driven by the standardization of parts, tools, architectures, interfaces, designs and trainings that allowed for the creation of common platforms. Instead of being dependent upon a ‘high priesthood’ of specialists to assemble your guns or cars or computer systems, organizations of all sizes where able to leverage common platforms to build their own sources of customer, business and financial differentiation. Dempsterian-Shaferian Belief Network Shenoy and Shafer {Shenoy:90} demonstrated that both for Dempster-Shafer Theory and probability theory there exists a possibility to calculate efficiently marginals of joint belief distributions (by so-called local computations) provided that the joint distribution can be decomposed (factorized) into a belief network. A number of algorithms exists for decomposition of probabilistic joint belief distribution into a bayesian (belief) network from data. For example Spirtes, Glymour and Schein{Spirtes:90b} formulated a Conjecture that a direct dependence test and a head-to-head meeting test would suffice to construe bayesian network from data in such a way that Pearl’s concept of d-separation {Geiger:90} applies. This paper is intended to transfer Spirtes, Glymour and Scheines {Spirtes:90b} approach onto the ground of the Dempster-Shafer Theory (DST). For this purpose, a frequentionistic interpretation of the DST developed in {Klopotek:93b} is exploited. A special notion of conditionality for DST is introduced and demonstrated to behave with respect to Pearl’s d-separation {Geiger:90} much the same way as conditional probability (though some differences like non-uniqueness are evident). Based on this, an algorithm analogous to that from {Spirtes:90b} is developed. The notion of a partially oriented graph (pog) is introduced and within this graph the notion of p-d-separation is defined. If direct dependence test and head-to-head meeting test are used to orient the pog then its p-d-separation is shown to be equivalent to the Pearl’s d-separation for any compatible dag. Dempster’s Rule of Combination How to combine two independent sets of probability mass assignments in specific situations. In case different sources express their beliefs over the frame in terms of belief constraints such as in case of giving hints or in case of expressing preferences, then Dempster’s rule of combination is the appropriate fusion operator. dst Dempster–Shafer Theory(DST) The Dempster-Shafer theory (DST) is a mathematical theory of evidence. It allows one to combine evidence from different sources and arrive at a degree of belief (represented by a belief function) that takes into account all the available evidence. The theory was first developed by Arthur P. Dempster and Glenn Shafer. In a narrow sense, the term Dempster-Shafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. dst Dendrogram A dendrogram (from Greek dendron “tree” and gramma “drawing”) is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Denoising Autoencoder(dA) The idea behind denoising autoencoders is simple. In order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it. The denoising auto-encoder is a stochastic version of the auto-encoder. Intuitively, a denoising auto-encoder does two things: try to encode the input (preserve the information about the input), and try to undo the effect of a corruption process stochastically applied to the input of the auto-encoder. The latter can only be done by capturing the statistical dependencies between the inputs. The denoising auto-encoder can be understood from different perspectives (the manifold learning perspective, stochastic operator perspective, bottom-up – information theoretic perspective, top-down – generative model perspective), all of which are explained in. See also section 7.2 of for an overview of auto-encoders. In , the stochastic corruption process randomly sets some of the inputs (as many as half of them) to zero. Hence the denoising auto-encoder is trying to predict the corrupted (i.e. missing) values from the uncorrupted (i.e., non-missing) values, for randomly selected subsets of missing patterns. Note how being able to predict any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables (this is how Gibbs sampling works). To convert the autoencoder class into a denoising autoencoder class, all we need to do is to add a stochastic corruption step operating on the input. The input can be corrupted in many ways, but in this tutorial we will stick to the original corruption mechanism of randomly masking entries of the input by making them zero. Denoising based Saliency Prediction with Generative Adversarial Network(DSAL-GAN) Synthesizing high quality saliency maps from noisy images is a challenging problem in computer vision and has many practical applications. Samples generated by existing techniques for saliency detection cannot handle the noise perturbations smoothly and fail to delineate the salient objects present in the given scene. In this paper, we present a novel end-to-end coupled Denoising based Saliency Prediction with Generative Adversarial Network (DSAL-GAN) framework to address the problem of salient object detection in noisy images. DSAL-GAN consists of two generative adversarial-networks (GAN) trained end-to-end to perform denoising and saliency prediction altogether in a holistic manner. The first GAN consists of a generator which denoises the noisy input image, and in the discriminator counterpart we check whether the output is a denoised image or ground truth original image. The second GAN predicts the saliency maps from raw pixels of the input denoised image using a data-driven metric based on saliency prediction method with adversarial loss. Cycle consistency loss is also incorporated to further improve salient region prediction. We demonstrate with comprehensive evaluation that the proposed framework outperforms several baseline saliency models on various performance benchmarks. Denoising Random Forest This paper proposes a novel type of random forests called a denoising random forests that are robust against noises contained in test samples. Such noise-corrupted samples cause serious damage to the estimation performances of random forests, since unexpected child nodes are often selected and the leaf nodes that the input sample reaches are sometimes far from those for a clean sample. Our main idea for tackling this problem originates from a binary indicator vector that encodes a traversal path of a sample in the forest. Our proposed method effectively employs this vector by introducing denoising autoencoders into random forests. A denoising autoencoder can be trained with indicator vectors produced from clean and noisy input samples, and non-leaf nodes where incorrect decisions are made can be identified by comparing the input and output of the trained denoising autoencoder. Multiple traversal paths with respect to the nodes with incorrect decisions caused by the noises can then be considered for the estimation. Denoising-GAN Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing. Dense Adaptive Cascade Forest(daForest) Recent research has shown that deep ensemble for forest can achieve a huge increase in classification accuracy compared with the general ensemble learning method. Especially when there are only few training data. In this paper, we decide to take full advantage of this observation and introduce the Dense Adaptive Cascade Forest (daForest), which has better performance than the original one named Cascade Forest. And it is particularly noteworthy that daForest has a powerful ability to handle high-dimensional sparse data without any preprocessing on raw data like PCA or any other dimensional reduction methods. Our model is distinguished by three major features: the first feature is the combination of the SAMME.R boosting algorithm in the model, boosting gives the model the ability to continuously improve as the number of layer increases, which is not possible in stacking model or plain cascade forest. The second feature is our model connects each layer to its subsequent layers in a feed-forward fashion, to some extent this structure enhances the ability of the model to resist degeneration. When number of layers goes up, accuracy of model goes up a little in the first few layers then drop down quickly, we call this phenomenon degeneration in training stacking model. The third feature is that we add a hyper-parameter optimization layer before the first classification layer in the proposed deep model, which can search for the optimal hyper-parameter and set up the model in a brief period and nearly halve the training time without having too much impact on the final performance. Experimental results show that daForest performs particularly well on both high-dimensional low-order features and low-dimensional high-order features, and in some cases, even better than neural networks and achieves state-of-the-art results. Dense Convolutional Network(DenseNet) Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN). Dense Morphological Network(DenMo-Net) Artificial neural networks are built on the basic operation of linear combination and non-linear activation function. Theoretically this structure can approximate any continuous function with three layer architecture. But in practice learning the parameters of such network can be hard. Also the choice of activation function can greatly impact the performance of the network. In this paper we are proposing to replace the basic linear combination operation with non-linear operations that do away with the need of additional non-linear activation function. To this end we are proposing the use of elementary morphological operations (dilation and erosion) as the basic operation in neurons. We show that these networks (Denoted as DenMo-Net) with morphological operations can approximate any smooth function requiring less number of parameters than what is necessary for normal neural networks. The results show that our network perform favorably when compared with similar structured network. Dense Multimodal Fusion(DMF) Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities. However, previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on three typical multimodal learning tasks, including audiovisual speech recognition, cross-modal retrieval, and multimodal classification. The noticeable performance in the experiments demonstrates that our model can learn more effective joint representation. Dense Quantum Measurement Quantum measurement is a fundamental cornerstone of experimental quantum computations. The main issues in current quantum measurement strategies are the high number of measurement rounds to determine a global optimal measurement output and the low success probability of finding a global optimal measurement output. Each measurement round requires preparing the quantum system and applying quantum operations and measurements with high-precision control in the physical layer. These issues result in extremely high-cost measurements with a low probability of success at the end of the measurement rounds. Here, we define a novel measurement for quantum computations called dense quantum measurement. The dense measurement strategy aims at fixing the main drawbacks of standard quantum measurements by achieving a significant reduction in the number of necessary measurement rounds and by radically improving the success probabilities of finding global optimal outputs. We provide application scenarios for quantum circuits with arbitrary unitary sequences, and prove that dense measurement theory provides an experimentally implementable solution for gate-model quantum computer architectures. Dense Relational Captioning Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce ‘relational captioning,’ a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods. Dense Transformer Networks The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoder-decoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods. Dense xUnit Net(DxNet) Deep net architectures have constantly evolved over the past few years, leading to significant advancements in a wide array of computer vision tasks. However, besides high accuracy, many applications also require a low computational load and limited memory footprint. To date, efficiency has typically been achieved either by architectural choices at the macro level (e.g. using skip connections or pruning techniques) or modifications at the level of the individual layers (e.g. using depth-wise convolutions or channel shuffle operations). Interestingly, much less attention has been devoted to the role of the activation functions in constructing efficient nets. Recently, Kligvasser et al. showed that incorporating spatial connections within the activation functions, enables a significant boost in performance in image restoration tasks, at any given budget of parameters. However, the effectiveness of their xUnit module has only been tested on simple small models, which are not characteristic of those used in high-level vision tasks. In this paper, we adopt and improve the xUnit activation, show how it can be incorporated into the DenseNet architecture, and illustrate its high effectiveness for classification and image restoration tasks alike. While the DenseNet architecture is extremely efficient to begin with, our dense xUnit net (DxNet) can typically achieve the same performance with far fewer parameters. For example, on ImageNet, our DxNet outperforms a ReLU-based DenseNet having 30% more parameters and achieves state-of-the-art results for this budget of parameters. Furthermore, in denoising and super-resolution, DxNet significantly improves upon all existing lightweight solutions, including the xUnit-based nets of Kligvasser et al. Densely Connected Convolutional Network(DenseNet) Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in real-time video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (DenseNet), to learn optical flow. This specific architecture is ideal for the problem at hand as it provides shortcut connections throughout the network, which leads to implicit deep supervision. We extend current DenseNet to a fully convolutional network to learn motion estimation in an unsupervised manner. Evaluation results on three standard benchmarks demonstrate that DenseNet is a better fit than other widely adopted CNN architectures for optical flow estimation. Densely Fused Spatial Transformer Network(DeSTNet) Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to the best of our knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness. Densely Supervised Grasp Detector(DSGD) This paper presents Densely Supervised Grasp Detector (DSGD), a deep learning framework which combines CNN structures with layer-wise feature fusion and produces grasps and their confidence scores at different levels of the image hierarchy (i.e., global-, region-, and pixel-levels). Specifically, at the global-level, DSGD uses the entire image information to predict a grasp and its confidence score. At the region-level, DSGD uses a region proposal network to identify salient regions in the image and predicts a grasp for each salient region. At the pixel-level, DSGD uses a fully convolutional network and predicts a grasp and its confidence at every pixel. The grasp with the highest confidence score is selected as the output of DSGD. This selection from hierarchically generated grasp candidates overcomes limitations of the individual models. DSGD outperforms state-of-the-art methods on the Cornell grasp dataset in terms of grasp accuracy. Evaluation on a multi-object dataset and real-world robotic grasping experiments show that DSGD produces highly stable grasps on a set of unseen objects in new environments. It achieves 96% grasp detection accuracy and 90% robotic grasping success rate with real-time inference speed. DenseNMT Recently, neural machine translation has achieved remarkable progress by introducing well-designed deep neural networks into its encoder-decoder framework. From the optimization perspective, residual connections are adopted to improve learning performance for both encoder and decoder in most of these deep architectures, and advanced attention connections are applied as well. Inspired by the success of the DenseNet model in computer vision problems, in this paper, we propose a densely connected NMT architecture (DenseNMT) that is able to train more efficiently for NMT. The proposed DenseNMT not only allows dense connection in creating new features for both encoder and decoder, but also uses the dense attention structure to improve attention quality. Our experiments on multiple datasets show that DenseNMT structure is more competitive and efficient. Density-based spatial clustering of applications with noise(DBSCAN) Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the e parameter with a maximum search radius. dbscan DensRay Word embeddings are useful for a wide variety of tasks, but they lack interpretability. By rotating word spaces, interpretable dimensions can be identified while preserving the information contained in the embeddings without any loss. In this work, we investigate three methods for making word spaces interpretable by rotation: Densifier (Rothe et al., 2016), linear SVMs and DensRay, a new method we propose. While DensRay is very closely related to the Densifier, it can be computed in closed form, is hyperparameter-free and thus more robust than the Densifier. We evaluate the methods on lexicon induction and set-based word analogy and conclude that analytical methods such as DensRay and SVMs are preferable. For word analogy we propose a new method to solve the task which outperforms the previous state of the art by large margins. DensSiam Convolutional Siamese neural networks have been recently used to track objects using deep features. Siamese architecture can achieve real time speed, however it is still difficult to find a Siamese architecture that maintains the generalization capability, high accuracy and speed while decreasing the number of shared parameters especially when it is very deep. Furthermore, a conventional Siamese architecture usually processes one local neighborhood at a time, which makes the appearance model local and non-robust to appearance changes. To overcome these two problems, this paper proposes DensSiam, a novel convolutional Siamese architecture, which uses the concept of dense layers and connects each dense layer to all layers in a feed-forward fashion with a similarity-learning function. DensSiam also includes a Self-Attention mechanism to force the network to pay more attention to the non-local features during offline training. Extensive experiments are performed on four tracking benchmarks: OTB2013 and OTB2015 for validation set; and VOT2015, VOT2016 and VOT2017 for testing set. The obtained results show that DensSiam achieves superior results on these benchmarks compared to other current state-of-the-art methods. Deontic Logic Deontic logic is the field of philosophical logic that is concerned with obligation, permission, and related concepts. Alternatively, a deontic logic is a formal system that attempts to capture the essential logical features of these concepts. Typically, a deontic logic uses OA to mean it is obligatory that A, (or it ought to be (the case) that A), and PA to mean it is permitted (or permissible) that A. Stanford Encyclopedia of Philosophy:Deontic Logic DepecheMood++ Several lexica for sentiment analysis have been developed and made available in the NLP community. While most of these come with word polarity annotations (e.g. positive/negative), attempts at building lexica for finer-grained emotion analysis (e.g. happiness, sadness) have recently attracted significant attention. Such lexica are often exploited as a building block in the process of developing learning models for which emotion recognition is needed, and/or used as baselines to which compare the performance of the models. In this work, we contribute two new resources to the community: a) an extension of an existing and widely used emotion lexicon for English; and b) a novel version of the lexicon targeting Italian. Furthermore, we show how simple techniques can be used, both in supervised and unsupervised experimental settings, to boost performances on datasets and tasks of varying degree of domain-specificity. Dependability In systems engineering, dependability is a measure of a system’s availability, reliability, and its maintainability, and maintenance support performance, and, in some cases, other characteristics such as durability, safety and security. In software engineering, dependability is the ability to provide services that can defensibly be trusted within a time-period. This may also encompass mechanisms designed to increase and maintain the dependability of a system or software. The International Electrotechnical Commission (IEC), via its Technical Committee TC 56 develops and maintains international standards that provide systematic methods and tools for dependability assessment and management of equipment, services, and systems throughout their life cycles. Dependability can be broken down into three elements: • Attributes – A way to assess the dependability of a system • Threats – An understanding of the things that can affect the dependability of a system • Means – Ways to increase a system’s dependability Dependence Modeling ➚ “Copula” http://…/9781466583221 http://…/7699_chap01.pdf Dependency Leakage A phenomenon in which learning on noisy clusters biases cross-validation and model selection results. Dependency Leakage: Analysis and Scalable Estimators Dependency Network The dependency network approach provides a new system level analysis of the activity and topology of directed networks. The approach extracts causal topological relations between the network’s nodes (when the network structure is analyzed), and provides an important step towards inference of causal activity relations between the network nodes (when analyzing the network activity). This methodology has originally been introduced for the study of financial data, it has been extended and applied to other systems, such as the immune system, and semantic networks. In the case of network activity, the analysis is based on partial correlations, which are becoming ever more widely used to investigate complex systems. In simple words, the partial (or residual) correlation is a measure of the effect (or contribution) of a given node, say j, on the correlations between another pair of nodes, say i and k. Using this concept, the dependency of one node on another node, is calculated for the entire network. This results in a directed weighted adjacency matrix, of a fully connected network. Once the adjacency matrix has been constructed, different algorithms can be used to construct the network, such as a threshold network, Minimal Spanning Tree (MST), Planar Maximally Filtered Graph (PMFG), and others. Dependency Weighted Aggregation We study a new class of aggregation problems, called dependency weighted aggregation. The underlying idea is to aggregate the answer tuples of a query while accounting for dependencies between them, where two tuples are considered dependent when they have the same value on some attribute. The main problem we are interested in is to compute the dependency weighted count of a conjunctive query. This aggregate can be seen as a form of weighted counting, where the weights of the answer tuples are computed by solving a linear program. This linear program enforces that dependent tuples are not over represented in the final weighted count. The dependency weighted count can be used to compute the s-measure, a measure that is used in data mining to estimate the frequency of a pattern in a graph database. Computing the dependency weighted count of a conjunctive query is NP-hard in general. In this paper, we show that this problem is actually tractable for a large class of structurally restricted conjunctive queries such as acyclic or bounded hypertree width queries. Our algorithm works on a factorized representation of the answer set, in order to avoid enumerating it exhaustively. Our technique produces a succinct representation of the weighting of the answers. It can be used to solve other dependency weighted aggregation tasks, such as computing the (dependency) weighted average of the value of an attribute in the answers set. DeployR DeployR is a server-based framework that provides simple, secure R integration for application developers. It’s available in two editions: DeployR Open, which is free and open-source; and Revolution R Enterprise DeployR, which adds a scalable grid framework and enterprise authentication features for production applications integrated with R. If you’re looking for an overview of what DeployR is and how you can use it to access R from other applications, we’ve just released a new white paper, Using DeployR to Solve the R Integration Problem. DeployR Data I/O Depth Coefficient(DC) Depth completion involves estimating a dense depth image from sparse depth measurements, often guided by a color image. While linear upsampling is straight forward, it results in artifacts including depth pixels being interpolated in empty space across discontinuities between objects. Current methods use deep networks to upsample and ‘complete’ the missing depth pixels. Nevertheless, depth smearing between objects remains a challenge. We propose a new representation for depth called Depth Coefficients (DC) to address this problem. It enables convolutions to more easily avoid inter-object depth mixing. We also show that the standard Mean Squared Error (MSE) loss function can promote depth mixing, and thus propose instead to use cross-entropy loss for DC. With quantitative and qualitative evaluation on benchmarks, we show that switching out sparse depth input and MSE loss with our DC representation and cross-entropy loss is a simple way to improve depth completion performance, and reduce pixel depth mixing, which leads to improved depth-based object detection. Depth-first Search(DFS) Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. One starts at the root (selecting some arbitrary node as the root in the case of a graph) and explores as far as possible along each branch before backtracking. A version of depth-first search was investigated in the 19th century by French mathematician Charles Pierre Tremaux as a strategy for solving mazes. Descriptive Statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of information, or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. Design of Experiments(DoE) The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associated with experiments in which the design introduces conditions that directly affect the variation, but may also refer to the design of quasi-experiments, in which natural conditions that influence the variation are selected for observation. In its simplest form, an experiment aims at predicting the outcome by introducing a change of the preconditions, which is represented by one or more independent variables, also referred to as ‘input variables’ or ‘predictor variables.’ The change in one or more independent variables is generally hypothesized to result in a change in one or more dependent variables, also referred to as ‘output variables’ or ‘response variables.’ The experimental design may also identify control variables that must be held constant to prevent external factors from affecting the results. Experimental design involves not only the selection of suitable independent, dependent, and control variables, but planning the delivery of the experiment under statistically optimal conditions given the constraints of available resources. There are multiple approaches for determining the set of design points (unique combinations of the settings of the independent variables) to be used in the experiment. Main concerns in experimental design include the establishment of validity, reliability, and replicability. For example, these concerns can be partially addressed by carefully choosing the independent variable, reducing the risk of measurement error, and ensuring that the documentation of the method is sufficiently detailed. Related concerns include achieving appropriate levels of statistical power and sensitivity. Correctly designed experiments advance knowledge in the natural and social sciences and engineering. Other applications include marketing and policy making. Design-Execute-Examine-Deploy – Framework(DEED) Despeckling Residual Neural Network(DRNN) Unsupervised Despeckling Destruction Rate It is difficult to detect and remove secret images that are hidden in natural images using deep-learning algorithms. Our technique is the first work to effectively disable covert communications and transactions that use deep-learning steganography. We address the problem by exploiting sophisticated pixel distributions and edge areas of images using a deep neural network. Based on the given information, we adaptively remove secret information at the pixel level. We also introduce a new quantitative metric called destruction rate since the decoding method of deep-learning steganography is approximate (lossy), which is different from conventional steganography. We evaluate our technique using three public benchmarks in comparison with conventional steganalysis methods and show that the decoding rate improves by 10 ~ 20%. Detail-Preserving Pooling(DPP) Most convolutional neural networks use some method for gradually downscaling the size of the hidden layers. This is commonly referred to as pooling, and is applied to reduce the number of parameters, improve invariance to certain distortions, and increase the receptive field size. Since pooling by nature is a lossy process, it is crucial that each such layer maintains the portion of the activations that is most important for the network’s discriminability. Yet, simple maximization or averaging over blocks, max or average pooling, or plain downsampling in the form of strided convolutions are the standard. In this paper, we aim to leverage recent results on image downscaling for the purposes of deep learning. Inspired by the human visual system, which focuses on local spatial changes, we propose detail-preserving pooling (DPP), an adaptive pooling method that magnifies spatial changes and preserves important structural detail. Importantly, its parameters can be learned jointly with the rest of the network. We analyze some of its theoretical properties and show its empirical benefits on several datasets and networks, where DPP consistently outperforms previous pooling approaches. Detection Network(DetNet) In this paper we consider Multiple-Input-Multiple-Output (MIMO) detection using deep neural networks. We introduce two different deep architectures: a standard fully connected multi-layer network, and a Detection Network (DetNet) which is specifically designed for the task. The structure of DetNet is obtained by unfolding the iterations of a projected gradient descent algorithm into a network. We compare the accuracy and runtime complexity of the purposed approaches and achieve state-of-the-art performance while maintaining low computational requirements. Furthermore, we manage to train a single network to detect over an entire distribution of channels. Finally, we consider detection with soft outputs and show that the networks can easily be modified to produce soft decisions. Determinantal Point Process(DPP) In mathematics, a determinantal point process is a stochastic point process, the probability distribution of which is characterized as a determinant of some function. Such processes arise as important tools in random matrix theory, combinatorics, and physics. Improving the Diversity of Top-N Recommendation via Determinantal Point Process Optimized Algorithms to Sample Determinantal Point Processes DPPy: Sampling Determinantal Point Processes with Python Deep Determinantal Point Processes Determinantal Point Processes Network(DPPNet) Determinantal Point Processes (DPPs) provide an elegant and versatile way to sample sets of items that balance the point-wise quality with the set-wise diversity of selected items. For this reason, they have gained prominence in many machine learning applications that rely on subset selection. However, sampling from a DPP over a ground set of size $N$ is a costly operation, requiring in general an $O(N^3)$ preprocessing cost and an $O(Nk^3)$ sampling cost for subsets of size $k$. We approach this problem by introducing DPPNets: generative deep models that produce DPP-like samples for arbitrary ground sets. We develop an inhibitive attention mechanism based on transformer networks that captures a notion of dissimilarity between feature vectors. We show theoretically that such an approximation is sensible as it maintains the guarantees of inhibition or dissimilarity that makes DPPs so powerful and unique. Empirically, we demonstrate that samples from our model receive high likelihood under the more expensive DPP alternative. Deterministic Heron Inference(Heron) Bayesian graphical models have been shown to be a powerful tool for discovering uncertainty and causal structure from real-world data in many application fields. Current inference methods primarily follow different kinds of trade-offs between computational complexity and predictive accuracy. At one end of the spectrum, variational inference approaches perform well in computational efficiency, while at the other end, Gibbs sampling approaches are known to be relatively accurate for prediction in practice. In this paper, we extend an existing Gibbs sampling method, and propose a new deterministic Heron inference (Heron) for a family of Bayesian graphical models. In addition to the support for nontrivial distributability, one more benefit of Heron is that it is able to not only allow us to easily assess the convergence status but also largely improve the running efficiency. We evaluate Heron against the standard collapsed Gibbs sampler and state-of-the-art state augmentation method in inference for well-known graphical models. Experimental results using publicly available real-life data have demonstrated that Heron significantly outperforms the baseline methods for inferring Bayesian graphical models. Deterministic Parallel Analysis(DPA) Factor analysis is widely used in many application areas. The first step, choosing the number of factors, remains a serious challenge. One of the most popular methods is parallel analysis (PA), which compares the observed factor strengths to simulated ones under a noise-only model. % Abstracts are commonly just one paragraph. This paper presents a deterministic version of PA (DPA), which is faster and more reproducible than PA. We show that DPA selects large factors and does not select small factors just like [Dobriban, 2017] shows for PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We develop a deflated version of DPA (DDPA) that counters shadowing. By raising the decision threshold in DDPA, a new method (DDPA+) also improves estimation accuracy. We illustrate our methods on data from the Human Genome Diversity Project (HGDP). There PA and DPA select seemingly too many factors, while DDPA+ selects only a few. A Matlab implementation is available. Deterministic Statistical Machine E.g. you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result. Deterministic Stretchy Regression An extension of the regularized least-squares in which the estimation parameters are stretchable is introduced and studied in this paper. The solution of this ridge regression with stretchable parameters is given in primal and dual spaces and in closed-form. Essentially, the proposed solution stretches the covariance computation by a power term, thereby compressing or amplifying the estimation parameters. To maintain the computation of power root terms within the real space, an input transformation is proposed. The results of an empirical evaluation in both synthetic and real-world data illustrate that the proposed method is effective for compressive learning with high-dimensional data. Determinized Sparse Partially Observable Tree(DESPOT) The partially observable Markov decision process (POMDP) provides a principled general framework for planning under uncertainty, but solving POMDPs optimally is computationally intractable, due to the ‘curse of dimensionality’ and the ‘curse of history’. To overcome these challenges, we introduce the Determinized Sparse Partially Observable Tree (DESPOT), a sparse approximation of the standard belief tree, for online planning under uncertainty. A DESPOT focuses online planning on a set of randomly sampled scenarios and compactly captures the ‘execution’ of all policies under these scenarios. We show that the best policy obtained from a DESPOT is near-optimal, with a regret bound that depends on the representation size of the optimal policy. Leveraging this result, we give an anytime online planning algorithm, which searches a DESPOT for a policy that optimizes a regularized objective function. Regularization balances the estimated value of a policy under the sampled scenarios and the policy size, thus avoiding overfitting. The algorithm demonstrates strong experimental results, compared with some of the best online POMDP algorithms available. It has also been incorporated into an autonomous driving system for realtime vehicle control. DetMCD Algorithm Most algorithms for highly robust estimators of multivariate location and scatter start by drawing a large number of random subsets. For instance, the FASTMCD algorithm of Rousseeuw and Van Driessen starts in this way, and then takes so-called concentration steps to obtain a more accurate approximation to the MCD. The FASTMCD algorithm is affine equivariant but not permutation invariant. In this article,we present a deterministic algorithm, denoted as DetMCD, which does not use random subsets and is even faster. It computes a small number of deterministic initial estimators, followed by concentration steps. DetMCD is permutation invariant and very close to affine equivariant. DetMCD DetNAS Object detectors are usually equipped with networks designed for image classification as backbones, e.g., ResNet. Although it is publicly known that there is a gap between the task of image classification and object detection, designing a suitable detector backbone is still manually exhaustive. In this paper, we propose DetNAS to automatically search neural architectures for the backbones of object detectors. In DetNAS, the search space is formulated into a supernet and the search method relies on evolution algorithm (EA). In experiments, we show the effectiveness of DetNAS on various detectors, the one-stage detector, RetinaNet, and the two-stage detector, FPN. For each case, we search in both training from scratch scheme and ImageNet pre-training scheme. There is a consistent superiority compared to the architectures searched on ImageNet classification. Our main result architecture achieves better performance than ResNet-101 on COCO with the FPN detector. In addition, we illustrate the architectures searched by DetNAS and find some meaningful patterns. Detrended Fluctuation Analysis(DFA) In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical self-affinity of a signal. It is useful for analysing time series that appear to be long-memory processes (diverging correlation time, e.g. power-law decaying autocorrelation function) or 1/f noise. The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are non-stationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform. Peng et al. introduced DFA in 1994 in a paper that has been cited over 2000 times as of 2013 and represents an extension of the (ordinary) fluctuation analysis (FA), which is affected by non-stationarities. Developed Lagrange Function(DLF) ➘ “Developed Lagrange Interpolation” Developed Lagrange Interpolation(DLI) In this work, we introduce the new class of functions which can use to solve the nonlinear/linear multi-dimensional differential equations. Based on these functions, a numerical method is provided which is called the Developed Lagrange Interpolation (DLI). For this, firstly, we define the new class of the functions, called the Developed Lagrange Functions (DLFs), which satisfy in the Kronecker Delta at the collocation points. Then, for the DLFs, the first-order derivative operational matrix of $\textbf{D}^{(1)}$ is obtained, and a recurrence relation is provided to compute the high-order derivative operational matrices of $\textbf{D}^{(m)}$, $m\in \mathbb{N}$; that is, we develop the theorem of the derivative operational matrices of the classical Lagrange polynomials for the DLFs and show that the relation of $\textbf{D}^{(m)}=(\textbf{D}^{(1)})^{m}$ for the DLFs is not established and is developable. Finally, we develop the error analysis of the classical Lagrange interpolation for the developed Lagrange interpolation. Finally, for demonstrating the convergence and efficiency of the DLI, some well-known differential equations, which are applicable in applied sciences, have been investigated based upon the various choices of the points of interpolation/collocation. Deviance In statistics, deviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where model-fitting is achieved by maximum likelihood. Deviance Information Criterion(DIC) The deviance information criterion (DIC) is a hierarchical modeling generalization of the AIC (Akaike information criterion) and BIC (Bayesian information criterion, also known as the Schwarz criterion). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. Like AIC and BIC it is an asymptotic approximation as the sample size becomes large. It is only valid when the posterior distribution is approximately multivariate normal. The idea is that models with smaller DIC should be preferred to models with larger DIC. DevSecOps DevSecOps is a methodology similar to DevOps in that both of them are within an agile framework that breaks projects into smaller chunks. However, DevSecOps incorporates security into every step of the development process. It requires ongoing communications between the development and security department?-?departments that didn’t typically communicate until the later stages. It’s also useful to note that DevOps and DevSecOps teams both often integrate automation into their practices. In that way, and others, these methodologies are much more similar than different. DevSecOps merely prioritizes security, a factor that developers may have limited knowledge about and not be overly concerned about unless the problem results in a bug. Dex This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention. DEXON A blockchain system is a replicated state machine that must be fault tolerant. When designing a blockchain system, there is usually a trade-off between decentralization, scalability, and security. In this paper, we propose a novel blockchain system, DEXON, which achieves high scalability while remaining decentralized and robust in the real-world environment. We have two main contributions. First, we present a highly scalable sharding framework for blockchain. This framework takes an arbitrary number of single chains and transforms them into the \textit{blocklattice} data structure, enabling \textit{high scalability} and \textit{low transaction confirmation latency} with asymptotically optimal communication overhead. Second, we propose a single-chain protocol based on our novel verifiable random function and a new Byzantine agreement that achieves high decentralization and low latency. Dfuntest New ideas in distributed systems (algorithms or protocols) are commonly tested by simulation, because experimenting with a prototype deployed on a realistic platform is cumbersome. However, a prototype not only measures performance but also verifies assumptions about the underlying system. We developed dfuntest – a testing framework for distributed applications that defines abstractions and test structure, and automates experiments on distributed platforms. Dfuntest aims to be jUnit’s analogue for distributed applications; a framework that enables the programmer to write robust and flexible scenarios of experiments. Dfuntest requires minimal bindings that specify how to deploy and interact with the application. Dfuntest’s abstractions allow execution of a scenario on a single machine, a cluster, a cloud, or any other distributed infrastructure, e.g. on PlanetLab. A scenario is a procedure; thus, our framework can be used both for functional tests and for performance measurements. We show how to use dfuntest to deploy our DHT prototype on 60 PlanetLab nodes and verify whether the prototype maintains a correct topology. DGD+LOCAL We study the convergence of a variant of distributed gradient descent (DGD) on a distributed low-rank matrix approximation problem wherein some optimization variables are used for consensus (as in classical DGD) and some optimization variables appear only locally at a single node in the network. We term the resulting algorithm DGD+LOCAL. Using algorithmic connections to gradient descent and geometric connections to the well-behaved landscape of the centralized low-rank matrix approximation problem, we identify sufficient conditions where DGD+LOCAL is guaranteed to converge with exact consensus to a global minimizer of the original centralized problem. For the distributed low-rank matrix approximation problem, these guarantees are stronger—in terms of consensus and optimality—than what appear in the literature for classical DGD and more general problems. dhSegment In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks. We show that a single CNN-architecture can be used across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach. Diagonalwise Refactorization Depthwise convolutions provide significant performance benefits owing to the reduction in both parameters and mult-adds. However, training depthwise convolution layers with GPUs is slow in current deep learning frameworks because their implementations cannot fully utilize the GPU capacity. To address this problem, in this paper we present an efficient method (called diagonalwise refactorization) for accelerating the training of depthwise convolution layers. Our key idea is to rearrange the weight vectors of a depthwise convolution into a large diagonal weight matrix so as to convert the depthwise convolution into one single standard convolution, which is well supported by the cuDNN library that is highly-optimized for GPU computations. We have implemented our training method in five popular deep learning frameworks. Evaluation results show that our proposed method gains $15.4\times$ training speedup on Darknet, $8.4\times$ on Caffe, $5.4\times$ on PyTorch, $3.5\times$ on MXNet, and $1.4\times$ on TensorFlow, compared to their original implementations of depthwise convolutions. Diagram Generating Function(DGF) The recently-introduced self-learning Monte Carlo method is a general-purpose numerical method that speeds up Monte Carlo simulations by training an effective model to propose uncorrelated configurations in the Markov chain. We implement this method in the framework of continuous time Monte Carlo method with auxiliary field in quantum impurity models. We introduce and train a diagram generating function (DGF) to model the probability distribution of auxiliary field configurations in continuous imaginary time, at all orders of diagrammatic expansion. By using DGF to propose global moves in configuration space, we show that the self-learning continuous-time Monte Carlo method can significantly reduce the computational complexity of the simulation. Diagrammatic AI Language(DIAL) Currently, there is no consistent model for visually or formally representing the architecture of AI systems. This lack of representation brings interpretability, correctness and completeness challenges in the description of existing models and systems. DIAL (The Diagrammatic AI Language) has been created with the aspiration of being an ‘engineering schematic’ for AI Systems. It is presented here as a starting point for a community dialogue towards a common diagrammatic language for AI Systems. Diagrammatic Reasoning Diagrammatic reasoning is reasoning by means of visual representations. The study of diagrammatic reasoning is about the understanding of concepts and ideas, visualized with the use of diagrams and imagery instead of by linguistic or algebraic means. Can We Automate Diagrammatic Reasoning? Dialogue Description(Dial2Desc) We first propose a new task named Dialogue Description (Dial2Desc). Unlike other existing dialogue summarization tasks such as meeting summarization, we do not maintain the natural flow of a conversation but describe an object or an action of what people are talking about. The Dial2Desc system takes a dialogue text as input, then outputs a concise description of the object or the action involved in this conversation. After reading this short description, one can quickly extract the main topic of a conversation and build a clear picture in his mind, without reading or listening to the whole conversation. Based on the existing dialogue dataset, we build a new dataset, which has more than one hundred thousand dialogue-description pairs. As a step forward, we demonstrate that one can get more accurate and descriptive results using a new neural attentive model that exploits the interaction between utterances from different speakers, compared with other baselines. DialogueRNN Emotion detection in conversations is a necessary step for a number of applications, including opinion mining over chat history, social media threads, debates, argumentation mining, understanding consumer feedback in live conversations, etc. Currently, systems do not treat the parties in the conversation individually by adapting to the speaker of each utterance. In this paper, we describe a new method based on recurrent neural networks that keeps track of the individual party states throughout the conversation and uses this information for emotion classification. Our model outperforms the state of the art by a significant margin on two different datasets. DiamondGAN Recent studies on medical image synthesis reported promising results using generative adversarial networks, mostly focusing on one-to-one cross-modality synthesis. Naturally, the idea arises that a target modality would benefit from multi-modal input. Synthesizing MR imaging sequences is highly attractive for clinical practice, as often single sequences are missing or of poor quality (e.g. due to motion). However, existing methods fail to scale up to image volumes with high numbers of modalities and extensive non-aligned volumes, facing common draw-backs of complex multi-modal imaging sequences. To address these limitations, we propose a novel, scalable and multi-modal approach called DiamondGAN. Our model is capable of performing flexible non-aligned cross-modality synthesis and data infill, when given multiple modalities or any of their arbitrary subsets. It learns structured information using non-aligned input modalities in an end-to-end fashion. We synthesize two MRI sequences with clinical relevance (i.e., double inversion recovery (DIR) and contrast-enhanced T1 (T1-c)), which are reconstructed from three common MRI sequences. In addition, we perform multi-rater visual evaluation experiment and find that trained radiologists are unable to distinguish our synthetic DIR images from real ones. Diceware Diceware is a method for creating passphrases, passwords, and other cryptographic variables using an ordinary die from a pair of dice as a hardware random number generator. For each word in the passphrase, five rolls of the dice are required. The numbers from 1 to 6 that come up in the rolls are assembled as a five digit number, e.g. 43146. That number is then used to look up a word in a word list. In the English list 43146 corresponds to munch. Lists have been compiled for several languages, including English, Finnish, German, Italian, Polish, Romanian, Russian, Spanish and Swedish. A Diceware word list is any list of 6^5 = 7,776 unique words, preferably ones the user will find easy to spell and to remember. The contents of the word list do not have to be protected or concealed in any way, as the security of a Diceware passphrase is in the number of words selected, and the number of words each selected word could be taken from. The level of unpredictability of a Diceware passphrase can be easily calculated: each word adds 12.9 bits of entropy to the passphrase (that is, \log_2( 6^5 ) bits). Originally, in 1995, Diceware creator Arnold Reinhold considered five words (64 bits) the minimum length needed by average users. However, starting in 2014, Reinhold recommends that at least six words (77 bits) should be used. This level of unpredictability assumes that a potential attacker knows both that Diceware has been used to generate the passphrase, the particular word list used, and exactly how many words make up the passphrase. If the attacker has less information, the entropy can be greater than 12.9 bits per word. If words were simply concatenated rather than separated by spaces, concatenating could form words that are already in the word list. For example, ‘in’ and ‘put’ form ‘input’; all three words can be found in the above mentioned word list. This could slightly decrease the entropy, when compared with the recommended method of using spaces to separate each word in the list. riceware dIconomy “dIconomy” = “Digital Economy” Dictionary Learning(DL) Dictionary learning is a branch of signal processing and machine learning that aims at finding a frame (called dictionary) in which some training data admits a sparse representation. The sparser the representation, the better the dictionary. Dictionary Learning Algorithms for Sparse Representation Dictionary Learning Dictionary Learning – Separating the Particularity and the Commonality(DL-COPAR) Empirically, we find that, despite the class-specific features owned by the objects appearing in the images, the objects from different categories usually share some common patterns, which do not contribute to the discrimination of them. Concentrating on this observation and under the general dictionary learning (DL) framework, we propose a novel method to explicitly learn a common pattern pool (the commonality) and class-specific dictionaries (the particularity) for classification. We call our method DL-COPAR, which can learn the most compact and most discriminative class-specific dictionaries used for classification. The proposed DL-COPAR is extensively evaluated both on synthetic data and on benchmark image databases in comparison with existing DL-based classification methods. The experimental results demonstrate that DL-COPAR achieves very promising performances in various applications, such as face recognition, handwritten digit recognition, scene classification and object recognition. Dictionary Learning Based Image-Domain Material Decomposition(DLIMD) The potential huge advantage of spectral computed tomography (CT) is its capability to provide accuracy material identification and quantitative tissue information. This can benefit clinical applications, such as brain angiography, early tumor recognition, etc. To achieve more accurate material components with higher material image quality, we develop a dictionary learning based image-domain material decomposition (DLIMD) for spectral CT in this paper. First, we reconstruct spectral CT image from projections and calculate material coefficients matrix by selecting uniform regions of basis materials from image reconstruction results. Second, we employ the direct inversion (DI) method to obtain initial material decomposition results, and a set of image patches are extracted from the mode-1 unfolding of normalized material image tensor to train a united dictionary by the K-SVD technique. Third, the trained dictionary is employed to explore the similarities from decomposed material images by constructing the DLIMD model. Fourth, more constraints (i.e., volume conservation and the bounds of each pixel within material maps) are further integrated into the model to improve the accuracy of material decomposition. Finally, both physical phantom and preclinical experiments are employed to evaluate the performance of the proposed DLIMD in material decomposition accuracy, material image edge preservation and feature recovery. Dictionary Matching With k Mismatches In the $k$-mismatch problem we are given a pattern of length $m$ and a text and must find all locations where the Hamming distance between the pattern and the text is at most $k$. A series of recent breakthroughs has resulted in an ultra-efficient streaming algorithm for this problem that requires only $O(k \log \frac{m}{k})$ space [Clifford, Kociumaka, Porat, 2017]. In this work we consider a strictly harder problem called dictionary matching with $k$ mismatches, where we are given a dictionary of $d$ patterns of lengths $\le m$ and must find all their $k$-mismatch occurrences in the text, and show the first streaming algorithm for it. The algorithm uses $O(k d \log^k d \; \mathrm{polylog} \; {m})$ space and processes each position of the text in $O(k \log^{k+1} d \; \mathrm{polylog} \; m + occ)$ time, where $occ$ is the number of $k$-mismatch occurrences of the patterns that end at this position. DID-MDN Single image rain streak removal is an extremely challenging problem due to the presence of non-uniform rain densities in images. We present a novel density-aware multi-stream densely connected convolutional neural network-based algorithm, called DID-MDN, for joint rain density estimation and de-raining. The proposed method enables the network itself to automatically determine the rain-density information and then efficiently remove the corresponding rain-streaks guided by the estimated rain-density label. To better characterize rain-streaks with different scales and shapes, a multi-stream densely connected de-raining network is proposed which efficiently leverages features from different scales. Furthermore, a new dataset containing images with rain-density labels is created and used to train the proposed density-aware network. Extensive experiments on synthetic and real datasets demonstrate that the proposed method achieves significant improvements over the recent state-of-the-art methods. In addition, an ablation study is performed to demonstrate the improvements obtained by different modules in the proposed method. Code can be found at: https://…/hezhangsprinter Difference Compression Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em quantization} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, we explore a natural question: {\em can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the {\rc convergence} rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth. Difference Differential Description Length This paper introduces a new method for model selection and more generally hyperparameter selection in machine learning. The paper first proves a relationship between generalization error and a difference of description lengths of the training data; we call this difference differential description length (DDL). This allows prediction of generalization error from the training data \emph{alone} by performing encoding of the training data. This can now be used for model selection by choosing the model that has the smallest predicted generalization error. We show how this encoding can be done for linear regression and neural networks. We provide experiments showing that this leads to smaller generalization error than cross-validation and traditional MDL and Bayes methods. Difference Guided Generative Adversarial Network(DGGAN) Predicting the future is a fantasy but practicality work. It is the key component to intelligent agents, such as self-driving vehicles, medical monitoring devices and robotics. In this work, we consider generating unseen future frames from previous obeservations, which is notoriously hard due to the uncertainty in frame dynamics. While recent works based on generative adversarial networks (GANs) made remarkable progress, there is still an obstacle for making accurate and realistic predictions. In this paper, we propose a novel GAN based on inter-frame difference to circumvent the difficulties. More specifically, our model is a multi-stage generative network, which is named the Difference Guided Generative Adversarial Netwok (DGGAN). The DGGAN learns to explicitly enforce future-frame predictions that is guided by synthetic inter-frame difference. Given a sequence of frames, DGGAN first uses dual paths to generate meta information. One path, called Coarse Frame Generator, predicts the coarse details about future frames, and the other path, called Difference Guide Generator, generates the difference image which include complementary fine details. Then our coarse details will then be refined via guidance of difference image under the support of GANs. With this model and novel architecture, we achieve state-of-the-art performance for future video prediction on UCF-101, KITTI. Difference of Convex Functions Algorithm(DCA) The DC programming and its DC algorithm (DCA) address the problem of minimizing a function f=g-h (with g,h being lower semicontinuous proper convex functions on R n ) on the whole space. Based on local optimality conditions and DC duality, DCA was successfully applied to a lot of different and various nondifferentiable nonconvex optimization problems to which it quite often gave global solutions and proved to be more robust and more efficient than related standard methods, especially in the large scale setting. The computational efficiency of DCA suggests to us a deeper and more complete study on DC programming, using the special class of DC programs (when either g or h is polyhedral convex) called polyhedral DC programs. A DCA-Like Algorithm and its Accelerated Version with Application in Data Visualization Differences-in-Differences(DID) Difference in differences (sometimes ‘Difference-in-Differences’, ‘DID’, or ‘DD’) is a statistical technique used in econometrics and quantitative sociology, which attempts to mimic an experimental research design using observational study data. It calculates the effect of a treatment (i.e., an explanatory variable or an independent variable) on an outcome (i.e., a response variable or dependent variable) by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. This method may be subject to certain biases (mean reversion bias, etc.), although it is intended to eliminate some of the effect of selection bias. In contrast to a within-subjects estimate of the treatment effect (which measures differences over time) or a between-subjects estimate of the treatment effect (which measures the difference between the treatment and control groups), the DID measures the difference in the differences between the treatment and control group over time. Differentiable ARchitecture Compression(DARC) In many learning situations, resources at inference time are significantly more constrained than resources at training time. This paper studies a general paradigm, called Differentiable ARchitecture Compression (DARC), that combines model compression and architecture search to learn models that are resource-efficient at inference time. Given a resource-intensive base architecture, DARC utilizes the training data to learn which sub-components can be replaced by cheaper alternatives. The high-level technique can be applied to any neural architecture, and we report experiments on state-of-the-art convolutional neural networks for image classification. For a WideResNet with $97.2\%$ accuracy on CIFAR-10, we improve single-sample inference speed by $2.28\times$ and memory footprint by $5.64\times$, with no accuracy loss. For a ResNet with $79.15\%$ Top1 accuracy on ImageNet, we improve batch inference speed by $1.29\times$ and memory footprint by $3.57\times$ with $1\%$ accuracy loss. We also give theoretical Rademacher complexity bounds in simplified cases, showing how DARC avoids overfitting despite over-parameterization. Differentiable Architecture Search(DARTS) This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques. Differentiable Boundary Sets We introduce a semiparametric approach to neighbor-based classification. We build off the recently proposed Boundary Trees algorithm by Mathy et al.(2015) which enables fast neighbor-based classification, regression and retrieval in large datasets. While boundary trees use an Euclidean measure of similarity, the Differentiable Boundary Tree algorithm by Zoran et al.(2017) was introduced to learn low-dimensional representations of complex input data, on which semantic similarity can be calculated to train boundary trees. As is pointed out by its authors, the differentiable boundary tree approach contains a few limitations that prevents it from scaling to large datasets. In this paper, we introduce Differentiable Boundary Sets, an algorithm that overcomes the computational issues of the differentiable boundary tree scheme and also improves its classification accuracy and data representability. Our algorithm is efficiently implementable with existing tools and offers a significant reduction in training time. We test and compare the algorithms on the well known MNIST handwritten digits dataset and the newer Fashion-MNIST dataset by Xiao et al.(2017). Differentiable Greedy Network(DGN) Optimal selection of a subset of items from a given set is a hard problem that requires combinatorial optimization. In this paper, we propose a subset selection algorithm that is trainable with gradient-based methods yet achieves near-optimal performance via submodular optimization. We focus on the task of identifying a relevant set of sentences for claim verification in the context of the FEVER task. Conventional methods for this task look at sentences on their individual merit and thus do not optimize the informativeness of sentences as a set. We show that our proposed method which builds on the idea of unfolding a greedy algorithm into a computational graph allows both interpretability and gradient-based training. The proposed differentiable greedy network (DGN) outperforms discrete optimization algorithms as well as other baseline methods in terms of precision and recall. Differentiable Lasso(dlasso) DLASSO Differentiable Linearized ADMM(D-LADMM) Recently, a number of learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems, but there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization. In particular, most existing analyses are specific to unconstrained problems but cannot apply to the more general cases where some variables of interest are subject to certain constraints. In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. Specifically, D-LADMM is a K-layer LADMM inspired deep neural network, which is obtained by firstly introducing some learnable weights in the classical Linearized ADMM algorithm and then generalizing the proximal operator to some learnable activation function. Notably, we rigorously prove that there exist a set of learnable parameters for D-LADMM to generate globally converged solutions, and we show that those desired parameters can be attained by training D-LADMM in a proper way. To the best of our knowledge, we are the first to provide the convergence analysis for the learning-based optimization method on constrained problems. Differentiable Neural Architecture Search(DNAS) Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too expensive for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets, a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3 with similar accuracy. Despite higher accuracy and lower latency than MnasNet, we estimate FBNet-B’s search cost is 420x smaller than MnasNet’s, at only 216 GPU-hours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than MobileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-X-optimized model achieves a 1.4x speedup on an iPhone X. Differentiable Particle Filter(DPF) We present differentiable particle filters (DPFs): a differentiable implementation of the particle filter algorithm with learnable motion and measurement models. Since DPFs are end-to-end differentiable, we can efficiently train their models by optimizing end-to-end state estimation performance, rather than proxy objectives such as model accuracy. DPFs encode the structure of recursive state estimation with prediction and measurement update that operate on a probability distribution over states. This structure represents an algorithmic prior that improves learning performance in state estimation problems while enabling explainability of the learned model. Our experiments on simulated and real data show substantial benefits from end-to- end learning with algorithmic priors, e.g. reducing error rates by ~80%. Our experiments also show that, unlike long short-term memory networks, DPFs learn localization in a policy-agnostic way and thus greatly improve generalization. Source code is available at https://…/differentiable-particle-filters. Differentiable Physics-informed Graph Network(DPGN) While physics conveys knowledge of nature built from an interplay between observations and theory, it has been considered less importantly in deep neural networks. Especially, there are few works leveraging physics behaviors when the knowledge is given less explicitly. In this work, we propose a novel architecture called Differentiable Physics-informed Graph Networks (DPGN) to incorporate implicit physics knowledge which is given from domain experts by informing it in latent space. Using the concept of DPGN, we demonstrate that climate prediction tasks are significantly improved. Besides the experiment results, we validate the effectiveness of the proposed module and provide further applications of DPGN, such as inductive learning and multistep predictions. Differentiable Programming Differentiable programming is a programming paradigm in which the programs can be differentiated throughout, usually via automatic differentiation. This allows for gradient based optimization of parameters in the program, often via gradient descent. Differentiable programming has found use in areas such as combining deep learning with physics engines in robotics, differentiable ray tracing, and image processing. Differentiable Satisfiability and Differentiable Answer Set Programming(Differentiable SAT/ASP) We propose Differentiable Satisfiability and Differentiable Answer Set Programming (Differentiable SAT/ASP) for multi-model optimization. Models (answer sets or satisfying truth assignments) are sampled using a novel SAT/ASP solving approach which uses a gradient descent-based branching mechanism. Sampling proceeds until the value of a user-defined multi-model cost function reaches a given threshold. As major use cases for our approach we propose distribution-aware model sampling and expressive yet scalable probabilistic logic programming. As our main algorithmic approach to Differentiable SAT/ASP, we introduce an enhancement of the state-of-the-art CDNL/CDCL algorithm for SAT/ASP solving. Additionally, we present alternative algorithms which use an unmodified ASP solver (Clingo/clasp) and map the optimization task to conventional answer set optimization or use so-called propagators. We also report on the open source software DelSAT, a recent prototype implementation of our main algorithm, and on initial experimental results which indicate that DelSATs performance is, when applied to the use case of probabilistic logic inference, on par with Markov Logic Network (MLN) inference performance, despite having advantageous properties compared to MLNs, such as the ability to express inductive definitions and to work with probabilities as weights directly in all cases. Our experiments also indicate that our main algorithm is strongly superior in terms of performance compared to the presented alternative approaches which reduce a common instance of the general problem to regular SAT/ASP. Differentiable Subset Sampling Many machine learning tasks require sampling a subset of items from a collection. Due to the non-differentiability of subset sampling, the procedure is usually not included in end-to-end deep learning models. We show that through a connection to weighted reservoir sampling, the Gumbel-max trick can be extended to produce exact subset samples, and that a recently proposed top-k relaxation can be used to differentiate through the subset sampling procedure. We test our method on end-to-end tasks requiring subset sampling, including a differentiable k-nearest neighbors task and an instance-wise feature selection task for model interpretability. Differential Adaptive Stress Testing(DAST) Finding the most likely path to a set of failure states is important to the analysis of safety-critical dynamic systems. While efficient solutions exist for certain classes of systems, a scalable general solution for stochastic, partially-observable, and continuous-valued systems remains challenging. Existing approaches in formal and simulation-based methods either cannot scale to large systems or are computationally inefficient. This paper presents adaptive stress testing (AST), a framework for searching a simulator for the most likely path to a failure event. We formulate the problem as a Markov decision process and use reinforcement learning to optimize it. The approach is simulation-based and does not require internal knowledge of the system. As a result, the approach is very suitable for black box testing of large systems. We present formulations for both systems where the state is fully-observable and partially-observable. In the latter case, we present a modified Monte Carlo tree search algorithm that only requires access to the pseudorandom number generator of the simulator to overcome partial observability. We also present an extension of the framework, called differential adaptive stress testing (DAST), that can be used to find failures that occur in one system but not in another. This type of differential analysis is useful in applications such as regression testing, where one is concerned with finding areas of relative weakness compared to a baseline. We demonstrate the effectiveness of the approach on an aircraft collision avoidance application, where we stress test a prototype aircraft collision avoidance system to find high-probability scenarios of near mid-air collisions. Differential Equation Unit(DEU) Most deep neural networks use simple, fixed activation functions, such as sigmoids or rectified linear units, regardless of domain or network structure. We introduce differential equation units (DEUs), an improvement to modern neural networks, which enables each neuron to learn a particular nonlinear activation function from a family of solutions to an ordinary differential equation. Specifically, each neuron may change its functional form during training based on the behavior of the other parts of the network. We show that using neurons with DEU activation functions results in a more compact network capable of achieving comparable, if not superior, performance when is compared to much larger networks. Differential Evolution(DE) In evolutionary computation, differential evolution (DE) is a method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means DE does not require for the optimization problem to be differentiable as is required by classic optimization methods such as gradient descent and quasi-newton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. DE is originally due to Storn and Price. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. http://…/9783540209508 Differential Expression Analysis DEVis Differential Fairness We introduce a measure of fairness for algorithms and data with regard to multiple protected attributes. Our proposed definition, differential fairness, is informed by the framework of intersectionality, which analyzes how interlocking systems of power and oppression affect individuals along overlapping dimensions including race, gender, sexual orientation, class, and disability. We show that our criterion behaves sensibly for any subset of the protected attributes, and we illustrate links to differential privacy. A case study on census data demonstrates the utility of our approach. Differential Generative Adversarial Network(D-GAN) In face-related applications with a public available dataset, synthesizing non-linear facial variations (e.g., facial expression, head-pose, illumination, etc.) through a generative model is helpful in addressing the lack of training data. In reality, however, there is insufficient data to even train the generative model for face synthesis. In this paper, we propose Differential Generative Adversarial Networks (D-GAN) that can perform photo-realistic face synthesis even when training data is small. Two adversarial networks are devised to ensure the generator to approximate a face manifold, which can express face changes as it wants. Experimental results demonstrate that the proposed method is robust to the amount of training data and synthesized images are useful to improve the performance of a face expression classifier. Differential Item Functioning(DIF) Differential item functioning (DIF), also referred to as measurement bias, occurs when people from different groups (commonly gender or ethnicity) with the same latent trait (ability/skill) have a different probability of giving a certain response on a questionnaire or test. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel, item response theory (IRT) based methods, and logistic regression. difR Differential Message Importance Measure(DMIM) Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. Differential Privacy In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. Differential Privacy Cleaning Data cleaning, or the process of detecting and repairing inaccurate or corrupt records in the data, is inherently human-driven. State of the art systems assume cleaning experts can access the data (or a sample of it) to tune the cleaning process. However, in many cases, privacy constraints disallow unfettered access to the data. To address this challenge, we observe and provide empirical evidence that data cleaning can be achieved without access to the sensitive data, but with access to a (noisy) query interface that supports a small set of linear counting query primitives. Motivated by this, we present DPClean, a first of a kind system that allows engineers tune data cleaning workflows while ensuring differential privacy. In DPClean, a cleaning engineer can pose sequences of aggregate counting queries with error tolerances. A privacy engine translates each query into a differentially private mechanism that returns an answer with error matching the specified tolerance, and allows the data owner track the overall privacy loss. With extensive experiments using human and simulated cleaning engineers on blocking and matching tasks, we demonstrate that our approach is able to achieve high cleaning quality while ensuring a reasonable privacy loss. Differential Temporal Difference Learning Value functions derived from Markov decision processes arise as a central component of algorithms as well as performance metrics in many statistics and engineering applications of machine learning techniques. Computation of the solution to the associated Bellman equations is challenging in most practical cases of interest. A popular class of approximation techniques, known as Temporal Difference (TD) learning algorithms, are an important sub-class of general reinforcement learning methods. The algorithms introduced in this paper are intended to resolve two well-known difficulties of TD-learning approaches: Their slow convergence due to very high variance, and the fact that, for the problem of computing the relative value function, consistent algorithms exist only in special cases. First we show that the gradients of these value functions admit a representation that lends itself to algorithm design. Based on this result, a new class of differential TD-learning algorithms is introduced. For Markovian models on Euclidean space with smooth dynamics, the algorithms are shown to be consistent under general conditions. Numerical results show dramatic variance reduction when compared to standard methods. Differentially Private Autoencoder-Based Generative Model(DP-AuGM) Deep neural networks (DNNs) have recently been widely adopted in various applications, and such success is largely due to a combination of algorithmic breakthroughs, computation resource improvements, and access to a large amount of data. However, the large-scale data collections required for deep learning often contain sensitive information, therefore raising many privacy concerns. Prior research has shown several successful attacks in inferring sensitive training data information, such as model inversion, membership inference, and generative adversarial networks (GAN) based leakage attacks against collaborative deep learning. In this paper, to enable learning efficiency as well as to generate data with privacy guarantees and high utility, we propose a differentially private autoencoder-based generative model (DP-AuGM) and a differentially private variational autoencoder-based generative model (DP-VaeGM). We evaluate the robustness of two proposed models. We show that DP-AuGM can effectively defend against the model inversion, membership inference, and GAN-based attacks. We also show that DP-VaeGM is robust against the membership inference attack. We conjecture that the key to defend against the model inversion and GAN-based attacks is not due to differential privacy but the perturbation of training data. Finally, we demonstrate that both DP-AuGM and DP-VaeGM can be easily integrated with real-world machine learning applications, such as machine learning as a service and federated learning, which are otherwise threatened by the membership inference attack and the GAN-based attack, respectively. Differentially Private Continual Learning Catastrophic forgetting can be a significant problem for institutions that must delete historic data for privacy reasons. For example, hospitals might not be able to retain patient data permanently. But neural networks trained on recent data alone will tend to forget lessons learned on old data. We present a differentially private continual learning framework based on variational inference. We estimate the likelihood of past data given the current model using differentially private generative models of old datasets. Differentially Private Markov Chain Monte Carlo(DPMCMC) Recent developments in differentially private (DP) machine learning and DP Bayesian learning have enabled learning under strong privacy guarantees for the training data subjects. In this paper, we further extend the applicability of DP Bayesian learning by presenting the first general DP Markov chain Monte Carlo (MCMC) algorithm whose privacy-guarantees are not subject to unrealistic assumptions on Markov chain convergence and that is applicable to posterior inference in arbitrary models. Our algorithm is based on a decomposition of the Barker acceptance test that allows evaluating the R\’enyi DP privacy cost of the accept-reject choice. We further show how to improve the DP guarantee through data subsampling and approximate acceptance tests. Differentially Private Regression for Discrete-Time Survival Analysis In survival analysis, regression models are used to understand the effects of explanatory variables (e.g., age, sex, weight, etc.) to the survival probability. However, for sensitive survival data such as medical data, there are serious concerns about the privacy of individuals in the data set when medical data is used to fit the regression models. The closest work addressing such privacy concerns is the work on Cox regression which linearly projects the original data to a lower dimensional space. However, the weakness of this approach is that there is no formal privacy guarantee for such projection. In this work, we aim to propose solutions for the regression problem in survival analysis with the protection of differential privacy which is a golden standard of privacy protection in data privacy research. To this end, we extend the Output Perturbation and Objective Perturbation approaches which are originally proposed to protect differential privacy for the Empirical Risk Minimization (ERM) problems. In addition, we also propose a novel sampling approach based on the Markov Chain Monte Carlo (MCMC) method to practically guarantee differential privacy with better accuracy. We show that our proposed approaches achieve good accuracy as compared to the non-private results while guaranteeing differential privacy for individuals in the private data set. Diffix A longstanding open problem is that of how to get high quality statistics through direct queries to databases containing information about individuals without revealing information specific to those individuals. Diffix is a new framework for anonymous database query that adds noise based on the filter conditions in the query. A previous paper described Diffix for a simplified query semantics. This paper extends that description to include a wide variety of common features found in SQL. It describes attacks associated with various features, and the anonymization steps used to defend against those attacks. This paper describes the version of Diffix used for bounty program sponsored by Aircloak starting December 2017. DiffPool Recently, graph neural networks (GNNs) have revolutionized the field of graph representation learning through effectively learned node embeddings, and achieved state-of-the-art results in tasks such as node classification and link prediction. However, current GNN methods are inherently flat and do not learn hierarchical representations of graphs—a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph. Here we propose DiffPool, a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an end-to-end fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer. Our experimental results show that combining existing GNN methods with DiffPool yields an average improvement of 5-10% accuracy on graph classification benchmarks, compared to all existing pooling approaches, achieving a new state-of-the-art on four out of five benchmark data sets. DiffSharp DiffSharp is an algorithmic differentiation or automatic differentiation (AD) library for the .NET ecosystem, which is targeted by the C# and F# languages, among others. The library has been designed with machine learning applications in mind, allowing very succinct implementations of models and optimization routines. DiffSharp is implemented in F# and exposes forward and reverse AD operators as general nestable higher-order functions, usable by any .NET language. It provides high-performance linear algebra primitives—scalars, vectors, and matrices, with a generalization to tensors underway—that are fully supported by all the AD operators, and which use a BLAS/LAPACK backend via the highly optimized OpenBLAS library. DiffSharp currently uses operator overloading, but we are developing a transformation-based version of the library using F#’s ‘code quotation’ metaprogramming facility. Work on a CUDA-based GPU backend is also underway. Diffusion Based Network Embedding In network embedding, random walks play a fundamental role in preserving network structures. However, random walk based embedding methods have two limitations. First, random walk methods are fragile when the sampling frequency or the number of node sequences changes. Second, in disequilibrium networks such as highly biases networks, random walk methods often perform poorly due to the lack of global network information. In order to solve the limitations, we propose in this paper a network diffusion based embedding method. To solve the first limitation, our method employs a diffusion driven process to capture both depth information and breadth information. The time dimension is also attached to node sequences that can strengthen information preserving. To solve the second limitation, our method uses the network inference technique based on cascades to capture the global network information. To verify the performance, we conduct experiments on node classification tasks using the learned representations. Results show that compared with random walk based methods, diffusion based models are more robust when samplings under each node is rare. We also conduct experiments on a highly imbalanced network. Results shows that the proposed model are more robust under the biased network structure. Diffusion Map Diffusion maps is a machine learning algorithm introduced by R. R. Coifman and S. Lafon. It computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the “diffusion distance” between probability distributions centered at those points. Different from other dimensionality reduction methods such as principal component analysis (PCA) and multi-dimensional scaling (MDS), diffusion maps is a non-linear method that focuses on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps gives a global description of the data-set. Compared with other methods, the diffusion maps algorithm is robust to noise perturbation and is computationally inexpensive. Diffusion Variational Autoencoder A standard Variational Autoencoder, with a Euclidean latent space, is structurally incapable of capturing topological properties of certain datasets. To remove topological obstructions, we introduce Diffusion Variational Autoencoders with arbitrary manifolds as a latent space. A Diffusion Variational Autoencoder uses transition kernels of Brownian motion on the manifold. In particular, it uses properties of the Brownian motion to implement the reparametrization trick and fast approximations to the KL divergence. We show that the Diffusion Variational Autoencoder is capable of capturing topological properties of synthetic datasets. Additionally, we train MNIST on spheres, tori, projective spaces, SO(3), and a torus embedded in R3. Although a natural dataset like MNIST does not have latent variables with a clear-cut topological structure, training it on a manifold can still highlight topological and geometrical properties. digGOF This paper concerns the problem of applying the generalized goodness-of-fit (gGOF) type tests for analyzing correlated data. The gGOF family broadly covers the maximum-based testing procedures by ordered input $p$-values, such as the false discovery rate procedure, the Kolmogorov-Smirnov type statistics, the $\phi$-divergence family, etc. Data analysis framework and a novel $p$-value calculation approach is developed under the Gaussian mean model and the generalized linear model (GLM). We reveal the influence of data transformations to the signal-to-noise ratio and the statistical power under both sparse and dense signal patterns and various correlation structures. In particular, the innovated transformation (IT), which is shown equivalent to the marginal model-fitting under the GLM, is often preferred for detecting sparse signals in correlated data. We propose a testing strategy called the digGOF, which combines a double-adaptation procedure (i.e., adapting to both the statistic’s formula and the truncation scheme of the input $p$-values) and the IT within the gGOF family. It features efficient computation and robust adaptation to the family-retained advantages for given data. Relevant approaches are assessed by extensive simulations and by genetic studies of Crohn’s disease and amyotrophic lateral sclerosis. Computations have been included into the R package SetTest available on CRAN. Digital Analytics Digital analytics is the analysis of qualitative and quantitative data from your business and the competition to drive a continual improvement of the online experience that your customers and potential customers have which translates to your desired outcomes (both online and offline). One of the most important steps of digital analytics is determining what your ultimate business objectives or outcomes are and how you expect to measure those outcomes. In the online world, there are five common business objectives: · For ecommerce sites, an obvious objective is selling products or services. · For lead generation sites, the goal is to collect user information for sales teams to connect with potential leads. · For content publishers, the goal is to encourage engagement and frequent visitation. · For online informational or support sites, helping users find the information they need at the right time is of primary importance. · For branding, the main objective is to drive awareness, engagement and loyalty. There are key actions on any website or mobile application that tie back to a business’ objectives. The actions can indicate an objective, like a purchase on an ecommerce site, has been fully met. These are ‘macro’ conversions. Some of the actions on a site might also be behavioral indicators that a customer hasn’t fully reached your main objectives but is coming closer, like, in the ecommerce example, signing up to receive an email coupon or a new product notification. These are ‘micro’ conversions. It’s important to measure both micro and macro conversions so that you are equipped with more behavioral data to understand what experiences help drive the right outcomes for your site. ➘ “Web Analytics” Digital Analytics Association(DAA) The Digital Analytics Association makes analytics professionals more effective and valuable through professional development and community. Digital Asset Management(DAM) Digital asset management (DAM) consists of management tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution of digital assets. Digital photographs, animations, videos and music exemplify the target areas of media asset management (a sub-category of DAM). Digital asset management systems (DAMS) include computer software and hardware systems that aid in the process of digital asset management. The term “digital asset management” (DAM) also refers to the protocol for downloading, renaming, backing up, rating, grouping, archiving, optimizing, maintaining, thinning, and exporting files. The “media asset management” (MAM) sub-category of digital asset management mainly addresses audio, video and other media content. The more recent concept of enterprise content management (ECM) often deals with solutions which address similar features but in a wider range of industries or applications. Digital Decisioning Platform Digital Decisioning Platforms is a new segment identified by Forrester that marries Business Process Automation, Business Rules Management, and Advanced Analytics. For platform developers it’s a new way to slice the market. For users it eases integration of predictive models into the production environment. Digital Hoarding Digital hoarding (also known as e-hoarding) is excessive acquisition and reluctance to delete electronic material no longer valuable to the user. The behavior includes the mass storage of digital artifacts and the retainment of unnecessary or irrelevant electronic data. The term is increasingly common in pop culture, used to describe the habitual characteristics of compulsive hoarding, but in cyberspace. As with physical space in which excess items are described as ‘clutter’ or ‘junk,’ excess digital media is often referred to as ‘digital clutter.’ Digital Native(DN) The term Digital Native was coined and popularized by education consultant, Marc Prensky in his 2001 article entitled Digital Natives, Digital Immigrants, in which he relates the contemporaneous decline in American education to educators’ failure to understand the needs of modern students. His article posited that ‘the arrival and rapid dissemination of digital technology in the last decade of the 20th century’ had fundamentally changed the way students think and process information, making it impossible for them to excel academically using the outdated teaching methods of the day. In other words, children raised in the post-digital, media saturated world, require a media-rich learning environment to hold their attention. Contextually, his ideas were introduced after a decade of worry over increased diagnosis of children with ADD and ADHD, which itself turned out to be largely overblown. Prensky did not strictly define the Digital Native in his 2001 article, but it was later, somewhat arbitrarily, applied to children born after 1980, due to the fact that computer bulletin board systems, and Usenet were already in use at the time. The idea became popular among educators and parents, whose children fell within Prensky’s definition of a Digital Native, and has since been embraced as an effective marketing tool. Digital Neuron We propose a Digital Neuron, a hardware inference accelerator for convolutional deep neural networks with integer inputs and integer weights for embedded systems. The main idea to reduce circuit area and power consumption is manipulating dot products between input feature and weight vectors by Barrel shifters and parallel adders. The reduced area allows the more computational engines to be mounted on an inference accelerator, resulting in high throughput compared to prior HW accelerators. We verified that the multiplication of integer numbers with 3-partial sub-integers does not cause significant loss of inference accuracy compared to 32-bit floating point calculation. The proposed digital neuron can perform 800 MAC operations in one clock for computation for convolution as well as full-connection. This paper provides a scheme that reuses input, weight, and output of all layers to reduce DRAM access. In addition, this paper proposes a configurable architecture that can provide inference of adaptable feature of convolutional neural networks. The throughput in terms of Watt of the digital neuron is achieved 754.7 GMACs/W. Digital Passport In order to prevent deep neural networks from being infringed by unauthorized parties, we propose a generic solution which embeds a designated digital passport into a network, and subsequently, either paralyzes the network functionalities for unauthorized usages or maintain its functionalities in the presence of a verified passport. Such a desired network behavior is successfully demonstrated in a number of implementation schemes, which provide reliable, preventive and timely protections against tens of thousands of fake-passport deceptions. Extensive experiments also show that the deep neural network performance under unauthorized usages deteriorate significantly (e.g. with 33% to 82% reductions of CIFAR10 classification accuracies), while networks endorsed with valid passports remain intact. Digital Twin Digital twin refers to a digital replica of physical assets, processes and systems that can be used for various purposes. The digital representation provides both the elements and the dynamics of how an Internet of Things device operates and lives throughout its life cycle. Digital Twins integrate artificial intelligence, machine learning and software analytics with data to create living digital Simulation models that update and change as their physical counterparts’ change. A digital twin continuously learns and updates itself from multiple sources to represent their near real-time status, working condition or position. This learning system, learns from itself, using sensor data that conveys various aspects of its operating condition; from human experts, such as engineers with deep and relevant industry domain knowledge; from other similar machines; from other similar fleets of machines; and from the larger systems and environment in which it may be a part of. A digital twin also integrates historical data from past machine usage to factor into its digital model. Dijkstra Algorithm Dijkstra’s algorithm, conceived by computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the single-source shortest path problem for a graph with non-negative edge path costs, producing a shortest path tree. This algorithm is often used in routing and as a subroutine in other graph algorithms. Dilated Convolution The idea of Dilated Convolution is come from the wavelet decomposition. It is also called ‘atrous convolution’, ‘algorithme à trous’ and ‘hole algorithm’. Thus, any ideas from the past are still useful if we can turn them into the deep learning framework. Dilated Recurrent Neural Network(DILATEDRNN) Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching state-of-the-art performance (even with Vanilla RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture’s advantages, we introduce a memory capacity measure – the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures. Dilated-Residual U-Net Deep Learning Network(DRUNET) Given that the neural and connective tissues of the optic nerve head (ONH) exhibit complex morphological changes with the development and progression of glaucoma, their simultaneous isolation from optical coherence tomography (OCT) images may be of great interest for the clinical diagnosis and management of this pathology. A deep learning algorithm was designed and trained to digitally stain (i.e. highlight) 6 ONH tissue layers by capturing both the local (tissue texture) and contextual information (spatial arrangement of tissues). The overall dice coefficient (mean of all tissues) was $0.91 \pm 0.05$ when assessed against manual segmentations performed by an expert observer. We offer here a robust segmentation framework that could be extended for the automated parametric study of the ONH tissues. Dimensional Clustering This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimension of a dataset around the points in that dataset using the distance of each point from its $n^{\text{th}}$ nearest neighbor. We demonstrate the applicability of our technique to the analysis of dynamical systems, images, and complex human movements. Dimensional Collapse … When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis ‘connections’. When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the ‘connection rules’, can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it ‘Dimensional Collapse’. … Dimensionality Reduction In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. DimmWitted A storage abstraction that captures the access patterns of popular statistical analytics tasks and a prototype called DimmWitted. dimple dimple is a simple-to-use charting API powered by D3.js. The aim of dimple is to open up the power and flexibility of d3 to analysts. It aims to give a gentle learning curve and minimal code to achieve something productive. It also exposes the d3 objects so you can pick them up and run to create some really cool stuff. DiNoDB As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of ad-hoc queries compared to alternatives %such as Hive, Stado, SparkSQL and Impala. Dionysius We address the following problem: How do we incorporate user item interaction signals as part of the relevance model in a large-scale personalized recommendation system such that, (1) the ability to interpret the model and explain recommendations is retained, and (2) the existing infrastructure designed for the (user profile) content-based model can be leveraged? We propose Dionysius, a hierarchical graphical model based framework and system for incorporating user interactions into recommender systems, with minimal change to the underlying infrastructure. We learn a hidden fields vector for each user by considering the hierarchy of interaction signals, and replace the user profile-based vector with this learned vector, thereby not expanding the feature space at all. Thus, our framework allows the use of existing recommendation infrastructure that supports content based features. We implemented and deployed this system as part of the recommendation platform at LinkedIn for more than one year. We validated the efficacy of our approach through extensive offline experiments with different model choices, as well as online A/B testing experiments. Our deployment of this system as part of the job recommendation engine resulted in significant improvement in the quality of retrieved results, thereby generating improved user experience and positive impact for millions of users. dipm-SC How is popularity gained online? Is being successful strictly related to rapidly becoming viral in an online platform or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil the popularity of online content? To answer these questions, we leverage a multi-faceted temporal analysis of the evolution of popular online contents. Here, we present dipm-SC: a multi-dimensional shape-based time-series clustering algorithm with a heuristic to find the optimal number of clusters. First, we validate the accuracy of our algorithm on synthetic datasets generated from benchmark time series models. Second, we show that dipm-SC can uncover meaningful clusters of popularity behaviors in a real-world Twitter dataset. By clustering the multidimensional time-series of the popularity of contents coupled with other domain-specific dimensions, we uncover two main patterns of popularity: bursty and steady temporal behaviors. Moreover, we find that the way popularity is gained over time has no significant impact on the final cumulative popularity. DiracNet Deep neural networks with skip-connections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them – training deeper networks – does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skip-connections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We’re able to achieve 95.5% accuracy on CIFAR-10 with 34-layer deep plain network, surpassing 1001-layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and non-residual networks. The code and models for our experiments are available at https://…/diracnets Direct Optimization The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call ‘direct optimization’, requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires $\mathcal{O}(fp)$ effort; where $f$ is the number of frameworks and $p$ is the number of platforms. While optimized kernels for deep-learning primitives are provided via libraries like Intel Math Kernel Library for Deep Neural Networks (MKL-DNN), there are several compiler-inspired ways in which performance can be further optimized. Building on our experience creating neon (a fast deep learning library on GPUs), we developed Intel nGraph, a soon to be open-sourced C++ library to simplify the realization of optimized deep learning performance across frameworks and hardware platforms. Initially-supported frameworks include TensorFlow, MXNet, and Intel neon framework. Initial backends are Intel Architecture CPUs (CPU), the Intel(R) Nervana Neural Network Processor(R) (NNP), and NVIDIA GPUs. Currently supported compiler optimizations include efficient memory management and data layout abstraction. In this paper, we describe our overall architecture and its core components. In the future, we envision extending nGraph API support to a wider range of frameworks, hardware (including FPGAs and ASICs), and compiler optimizations (training versus inference optimizations, multi-node and multi-device scaling via efficient sub-graph partitioning, and HW-specific compounding of operations). Directed Acyclic Graph(DAG) In mathematics and computer science, a directed acyclic graph (DAG), is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. dimple Directed Acyclic Graph Auto-Regressive(DAGAR) Directed Exploration Learning(DEL) We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration of suboptimal (state, action) pairs. For any arbitrary structure, we derive problem-specific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward function are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c \log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action sub-optimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA \log T$ . We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure. Directed Graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph, or set of nodes connected by edges, where the edges have a direction associated with them. In formal terms, a digraph is a pair G=(V,A) (sometimes G=(V,E)) of: · a set V, whose elements are called vertices or nodes, · a set A of ordered pairs of vertices, called arcs, directed edges, or arrows (and sometimes simply edges with the corresponding set named E instead of A). It differs from an ordinary or undirected graph, in that the latter is defined in terms of unordered pairs of vertices, which are usually called edges. A digraph is called ‘simple’ if it has no loops, and no multiple arcs (arcs with same starting and ending nodes). A directed multigraph, in which the arcs constitute a multiset, rather than a set, of ordered pairs of vertices may have loops (that is, ‘self-loops’ with same starting and ending node) and multiple arcs. Some, but not all, texts allow a digraph, without the qualification simple, to have self loops, multiple arcs, or both. Directed Random Geometric Graph(DRGG) Many real-world networks are intrinsically directed. Such networks include activation of genes, hyperlinks on the internet, and the network of followers on Twitter among many others. The challenge, however, is to create a network model that has many of the properties of real-world networks such as powerlaw degree distributions and the small-world property. To meet these challenges, we introduce the \textit{Directed} Random Geometric Graph (DRGG) model, which is an extension of the random geometric graph model. We prove that it is scale-free with respect to the indegree distribution, has binomial outdegree distribution, has a high clustering coefficient, has few edges and is likely small-world. These are some of the main features of aforementioned real world networks. We empirically observe that word association networks have many of the theoretical properties of the DRGG model. directional Bat Algorithm(dBA) Bat algorithm (BA) is a recent optimization algorithm based on swarm intelligence and inspiration from the echolocation behavior of bats. One of the issues in the standard bat algorithm is the premature convergence that can occur due to the low exploration ability of the algorithm under some conditions. To overcome this deficiency, directional echolocation is introduced to the standard bat algorithm to enhance its exploration and exploitation capabilities. In addition to such directional echolocation, three other improvements have been embedded into the standard bat algorithm to enhance its performance. The new proposed approach, namely the directional Bat Algorithm (dBA), has been then tested using several standard and non-standard benchmarks from the CEC’2005 benchmark suite. The performance of dBA has been compared with ten other algorithms and BA variants using non-parametric statistical tests. The statistical test results show the superiority of the directional bat algorithm. Directional Statistics Directional statistics is the subdiscipline of statistics that deals with directions (unit vectors in Rn), axes (lines through the origin in Rn) or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds. The fact that 0 degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of 2 degrees and 358 degrees, provides one illustration that special statistical methods are required for the analysis of some types of data (in this case, angular data). Other examples of data that may be regarded as directional include statistics involving temporal periods (e.g. time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations and so on. Directional DiReliefF Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving the processing algorithm’s accuracy. However, traditional algorithms designed for executing on a single machine lack scalability to deal with the increasing amount of data that has become available in the current Big Data era. ReliefF is one of the most important algorithms successfully implemented in many FS applications. In this paper, we present a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that we have called DiReliefF. Spark is increasing its popularity due to its much faster processing times compared with Hadoop’s MapReduce model implementation. The effectiveness of our proposal is tested on four publicly available datasets, all of them with a large number of instances and two of them with also a large number of features. Subsets of these datasets were also used to compare the results to a non-distributed implementation of the algorithm. The results show that the non-distributed implementation is unable to handle such large volumes of data without specialized hardware, while our design can process them in a scalable way with much better processing times and memory usage. Dirichlet Distribution When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data. Dirichlet Lasso(DLASSO) Selection of the most important predictor variables in regression analysis is one of the key problems statistical research has been concerned with for long time. In this article, we propose the methodology, Dirichlet Lasso (abbreviated as DLASSO) to address this issue in a Bayesian framework. In many modern regression settings, large set of predictor variables are grouped and the coefficients belonging to any one of these groups are either all redundant or all important in predicting the response; we say in those cases that the predictors exhibit a group structure. We show that DLASSO is particularly useful where the group structure is not fully known. We exploit the clustering property of Dirichlet Process priors to infer the possibly missing group information. The Dirichlet Process has the advantage of simultaneously clustering the variable coefficients and selecting the best set of predictor variables. We compare the predictive performance of DLASSO to Group Lasso and ordinary Lasso with real data and simulation studies. Our results demonstrate that the predictive performance of DLASSO is almost as good as that of Group Lasso when group label information is given; and superior to the ordinary Lasso for missing group information. For high dimensional data (e.g., genetic data) with missing group information, DLASSO will be a powerful approach of variable selection since it provides a superior predictive performance and higher statistical accuracy. ➘ “Least Absolute Shrinkage and Selection Operator” Dirichlet Process(DP) In probability theory, a Dirichlet process is a way of assigning a probability distribution over probability distributions. That is, a Dirichlet process is a probability distribution whose domain is itself a set of probability distributions. The probability distributions in the domain are almost surely discrete and may be infinite dimensional. Assigning an arbitrary probability distribution over a domain of infinite dimensional probability distributions would require an infinite amount of computational resources. The main function of the Dirichlet process is that it allows the specification of a distribution over infinite dimensional distributions in a way that uses only finite resources. Dirichlet Process Forest Methods based on Bayesian decision tree ensembles have proven valuable in constructing high-quality predictions, and are particularly attractive in certain settings because they encourage low-order interaction effects. Despite adapting to the presence of low-order interactions for prediction purpose, we show that Bayesian decision tree ensembles are generally anti-conservative for the purpose of conducting interaction detection. We address this problem by introducing Dirichlet process forests (DP-Forests), which leverage the presence of low-order interactions by clustering the trees so that trees within the same cluster focus on detecting a specific interaction. We show on both simulated and benchmark data that DP-Forests perform well relative to existing interaction detection techniques for detecting low-order interactions, attaining very low false-positive and false-negative rates while maintaining the same performance for prediction using a comparable computational budget. Dirichlet Process Mixture Model(DPMM) The Dirichlet process is a family of non-parametric Bayesian models which are commonly used for density estimation, semi-parametric modelling and model selection/averaging. The Dirichlet processes are non-parametric in a sense that they have infinite number of parameters. Since they are treated in a Bayesian approach we are able to construct large models with infinite parameters which we integrate out to avoid overfitting. Dirichlet Variational Autoencoder(DirVAE) This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases. Dirichlet-Process-Means(DP-Means) Generalized Dirichlet-process-means for f-separable distortion measures DirNet Recurrent neural networks (RNNs) achieve cutting-edge performance on a variety of problems. However, due to their high computational and memory demands, deploying RNNs on resource constrained mobile devices is a challenging task. To guarantee minimum accuracy loss with higher compression rate and driven by the mobile resource requirement, we introduce a novel model compression approach DirNet based on an optimized fast dictionary learning algorithm, which 1) dynamically mines the dictionary atoms of the projection dictionary matrix within layer to adjust the compression rate 2) adaptively changes the sparsity of sparse codes cross the hierarchical layers. Experimental results on language model and an ASR model trained with a 1000h speech dataset demonstrate that our method significantly outperforms prior approaches. Evaluated on off-the-shelf mobile devices, we are able to reduce the size of original model by eight times with real-time model inference and negligible accuracy loss. Disaggregation Disaggregation is the breakdown of observations, usually within a common branch of a hierarchy, to a more detailed level to that at which detailed observations are taken. Disciplined Convex Optimization An object-oriented modeling language for disciplined convex programming (DCP). It allows the user to formulate convex optimization problems in a natural way following mathematical convention and DCP rules. The system analyzes the problem, verifies its convexity, converts it into a canonical form, and hands it off to an appropriate solver to obtain the solution. ➘ “Disciplined Convex Programming” CVXR Disciplined Convex Programming(DCP) Convex programming is a subclass of nonlinear programming (NLP) that unifies and generalizes least squares (LS), linear programming (LP), and convex quadratic programming (QP). It has become quite popular recently for a number of reasons, including its attractive theoretical properties, efficient numerical algorithms, and practical applications. Nevertheless, there remains a significant impediment to the more widespread adoption of convex programming: the high level of expertise required to use it. We introduce a new modeling methodology called disciplined convex programming. As the term ‘disciplined’ suggests, the methodology imposes a set of conventions that one must follow when constructing convex programs. The conventions are simple and teachable, taken from basic principles of convex analysis, and inspired by the practices of those who regularly study and apply convex optimization today. The conventions do not limit generality; but they do allow much of the manipulation and transformation required to analyze and solve convex programs to be automated. ➘ “Disciplined Convex Optimization” Disciplined Quasiconvex Programming We present a composition rule involving quasiconvex functions that generalizes the classical composition rule for convex functions. This rule complements well-known rules for the curvature of quasiconvex functions under increasing functions and pointwise maximums. We refer to the class of optimization problems generated by these rules, along with a base set of quasiconvex and quasiconcave functions, as disciplined quasiconvex programs. Disciplined quasiconvex programming generalizes disciplined convex programming, the class of optimization problems targeted by most modern domain-specific languages for convex optimization. We describe an implementation of disciplined quasiconvex programming that makes it possible to specify and solve quasiconvex programs in CVXPY 1.0. DisCoCat The Mathematics of Text Structure DiscoFuse A problem of considerable importance within the field of uncertainty quantification (UQ) is the development of efficient methods for the construction of accurate surrogate models. Such efforts are particularly important to applications constrained by high-dimensional uncertain parameter spaces. The difficulty of accurate surrogate modeling in such systems, is further compounded by data scarcity brought about by the large cost of forward model evaluations. Traditional response surface techniques, such as Gaussian process regression (or Kriging) and polynomial chaos are difficult to scale to high dimensions. To make surrogate modeling tractable in expensive high-dimensional systems, one must resort to dimensionality reduction of the stochastic parameter space. A recent dimensionality reduction technique that has shown great promise is the method of active subspaces’. The classical formulation of active subspaces, unfortunately, requires gradient information from the forward model – often impossible to obtain. In this work, we present a simple, scalable method for recovering active subspaces in high-dimensional stochastic systems, without gradient-information that relies on a reparameterization of the orthogonal active subspace projection matrix, and couple this formulation with deep neural networks. We demonstrate our approach on synthetic and real world datasets and show favorable predictive comparison to classical active subspaces. Discounted Cumulative Gain(DCG) Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks. Discourse Analysis(DA) Discourse analysis (DA), or discourse studies, is a general term for a number of approaches to analyzing written, vocal, or sign language use or any significant semiotic event. The objects of discourse analysis – discourse, writing, conversation, communicative event – are variously defined in terms of coherent sequences of sentences, propositions, speech, or turns-at-talk. Contrary to much of traditional linguistics, discourse analysts not only study language use ‘beyond the sentence boundary’, but also prefer to analyze ‘naturally occurring’ language use, and not invented examples. Text linguistics is related. The essential difference between discourse analysis and text linguistics is that it aims at revealing socio-psychological characteristics of a person/persons rather than text structure. Discourse analysis has been taken up in a variety of social science disciplines, including linguistics, education, sociology, anthropology, social work, cognitive psychology, social psychology, area studies, cultural studies, international relations, human geography, communication studies, and translation studies, each of which is subject to its own assumptions, dimensions of analysis, and methodologies. Discover, Access, Distill(DAD) DAD is comprised of: · Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts) · Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes in-memory within a database. · Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves · Exploring the data (creating a data dictionary and exploratory analysis) · Cleaning (removing impurities) · Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization)- Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling · Presenting results or integrating results in some automated process Discovery Testing Concept Activation Vectors(DTCAV) Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Due to it’s complexity, i For high-stakes domains such as medical, providing intuitive explanations that can be consumed by domain experts without ML expertise becomes crucial. To this demand, concept-based methods (e.g., TCAV) were introduced to provide explanations using user-chosen high-level concepts rather than individual input features. While these methods successfully leverage rich representations learned by the networks to reveal how human-defined concepts are related to the prediction, they require users to select concepts of their choice and collect labeled examples of those concepts. In this work, we introduce DTCAV (Discovery TCAV) a global concept-based interpretability method that can automatically discover concepts as image segments, along with each concept’s estimated importance for a deep neural network’s predictions. We validate that discovered concepts are as coherent to humans as hand-labeled concepts. We also show that the discovered concepts carry significant signal for prediction by analyzing a network’s performance with stitched/added/deleted concepts. DTCAV results revealed a number of undesirable correlations (e.g., a basketball player’s jersey was a more important concept for predicting the basketball class than the ball itself) and show the potential shallow reasoning of these networks. Discrete Attend Infer Repeat(Discrete-AIR) In this work we present Discrete Attend Infer Repeat (Discrete-AIR), a Recurrent Auto-Encoder with structured latent distributions containing discrete categorical distributions, continuous attribute distributions, and factorised spatial attention. While inspired by the original AIR model andretaining AIR model’s capability in identifying objects in an image, Discrete-AIR provides direct interpretability of the latent codes. We show that for Multi-MNIST and a multiple-objects version of dSprites dataset, the Discrete-AIR model needs just one categorical latent variable, one attribute variable (for Multi-MNIST only), together with spatial attention variables, for efficient inference. We perform analysis to show that the learnt categorical distributions effectively capture the categories of objects in the scene for Multi-MNIST and for Multi-Sprites. Discrete Choice In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods (e.g. first-order conditions) can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard first-order conditions. Thus, instead of examining ‘how much’ as in problems with continuous choice variables, discrete choice analysis examines ‘which one.’ However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own [1] and the number of minutes of telecommunications service a customer decides to purchase.[2] Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice. Book: Discrete Choice Analysis Discrete Dantzig Selector We propose a new high-dimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients, subject to a budget on the maximal absolute correlation between the features and the residuals. We show that the estimator can be expressed as a solution to a Mixed Integer Linear Optimization (MILO) problem—a computationally tractable framework that enables the computation of provably optimal global solutions. Our approach has the appealing characteristic that even if we terminate the optimization problem at an early stage, it exits with a certificate of sub-optimality on the quality of the solution. We develop new discrete first order methods, motivated by recent algorithmic developments in first order continuous convex optimization, to obtain high quality feasible solutions for the Discrete Dantzig Selector problem. Our proposal leads to advantages over the off-the-shelf state-of-the-art integer programming algorithms, which include superior upper bounds obtained for a given computational budget. When a solution obtained from the discrete first order methods is passed as a warm-start to a MILO solver, the performance of the latter improves significantly. Exploiting problem specific information, we propose enhanced MILO formulations that further improve the algorithmic performance of the MILO solvers. We demonstrate, both theoretically and empirically, that, in a wide range of regimes, the statistical properties of the Discrete Dantzig Selector are superior to those of popular $\ell_{1}$-based approaches. For problem instances with $p \approx 2500$ features and $n \approx 900$ observations, our computational framework delivers optimal solutions in a few minutes and certifies optimality within an hour. Discrete Event Simulation(DES) In the field of simulation, a discrete-event simulation (DES), models the operation of a system as a discrete sequence of events in time. Each event occurs at a particular instant in time and marks a change of state in the system. Between consecutive events, no change in the system is assumed to occur; thus the simulation can directly jump in time from one event to the next. This contrasts with continuous simulation in which the simulation continuously tracks the system dynamics over time. Instead of being event-based, this is called an activity-based simulation; time is broken up into small time slices and the system state is updated according to the set of activities happening in the time slice. Because discrete-event simulations do not have to simulate every time slice, they can typically run much faster than the corresponding continuous simulation. Another alternative to event-based simulation is process-based simulation. In this approach, each activity in a system corresponds to a separate process, where a process is typically simulated by a thread in the simulation program. In this case, the discrete events, which are generated by threads, would cause other threads to sleep, wake, and update the system state. A more recent method is the three-phased approach to discrete event simulation (Pidd, 1998). In this approach, the first phase is to jump to the next chronological event. The second phase is to execute all events that unconditionally occur at that time (these are called B-events). The third phase is to execute all events that conditionally occur at that time (these are called C-events). The three phase approach is a refinement of the event-based approach in which simultaneous events are ordered so as to make the most efficient use of computer resources. The three-phase approach is used by a number of commercial simulation software packages, but from the user’s point of view, the specifics of the underlying simulation method are generally hidden. simmer,DES Discrete Event System Specification(DEVS) DEVS abbreviating Discrete Event System Specification is a modular and hierarchical formalism for modeling and analyzing general systems that can be discrete event systems which might be described by state transition tables, and continuous state systems which might be described by differential equations, and hybrid continuous state and discrete event systems. DEVS is a timed event system. Discrete False Discovery Rate(DFRD+/-) This article introduces a discrete false discovery rate (DFRD+/-) controlling method for data snooping testing. We investigate with DFRD+/- the performance of dynamic portfolios constructed upon over 21,000 technical trading rules on 12 categorical and country-specific markets over the study period 2004-2017. The profitability, robustness and persistence of the technical rules are examined. We note that technical analysis has still short-term value in advanced, emerging and frontier markets. A cross-validation exercise highlights the importance of frequent rebalancing and the variability of profitability in trading with technical analysis. Discrete Fourier Cosine Quadrature Transform(FCQT) The Hilbert transform (HT) and associated Gabor analytic signal (GAS) representation are well-known and widely used mathematical formulations for modeling and analysis of signals in various applications. In this study, like the HT, to obtain quadrature component of a signal, we propose the novel discrete Fourier cosine quadrature transforms (FCQTs) and discrete Fourier sine quadrature transforms (FSQTs), designated as Fourier quadrature transforms (FQTs). Using these FQTs, we propose sixteen Fourier-Singh analytic signal (FSAS) representations with following properties: (1) real part of eight FSAS representations is the original signal and imaginary part is the FCQT of the real part, (2) imaginary part of eight FSAS representations is the original signal and real part is the FSQT of the real part, (3) like the GAS, Fourier spectrum of the all FSAS representations has only positive frequencies, however unlike the GAS, the real and imaginary parts of the proposed FSAS representations are not orthogonal to each other. The Fourier decomposition method (FDM) is an adaptive data analysis approach to decompose a signal into a set of small number of Fourier intrinsic band functions which are AM-FM components. This study also proposes a new formulation of the FDM using the discrete cosine transform (DCT) with the GAS and FSAS representations, and demonstrate its efficacy for improved time-frequency-energy representation and analysis of nonlinear and non-stationary time series. Discrete Label Set Although many scalable event matching algorithms have been proposed to achieve scalability for large-scale content-based networks, content-based publish/subscribe networks (especially for large-scale real time systems) still suffer performance deterioration when subscription scale increases. While subscription aggregation techniques can be useful to reduce the amount of subscription dissemination traffic and the subscription table size by exploiting the similarity among subscriptions, efficient subscription aggregation is not a trivial task to accomplish. Previous research works have proved that it is either a NP-Complete or a co-NP complete problem. In this paper, we propose DLS (Discrete Label Set), a novel subscription representation model, and design algorithms to achieve the mapping from traditional Boolean predicate model to the DLS model. Based on the DLS model, we propose a subscription aggregation algorithm with O(1) time complexity in most cases, and an event matching algorithm with O(1) time complexity. The significant performance improvement is at the cost of memory consumption and controllable false positive rate. Our theoretical analysis shows that these algorithms are inherently scalable and can achieve real time event matching in a large-scale content-based publish/subscribe network. We discuss the tradeoff between memory, false positive rate and partition granules of content space. Experimental results show that proposed algorithms achieve expected performance. With the increasing of computer memory capacity and the dropping of memory price, more and more large-scale real time applications can benefit from our proposed DLS model, such as stock quote distribution, earthquake monitoring, and severe weather alert. Discrete Morse Theory Discrete Morse theory is a tool for determining equivalences between topological spaces arising from discrete mathematical structures. This theory was developed by Robin Forman in the 1990s as a combinatorial analog to Morse theory, developed by Marston Morse in the 1920s. The original theory deals with analyzing such equivalences for general topological spaces, while discrete Morse theory provides similar methods of analysis for topological spaces endowed with additional, discrete structure. For these structures, applications of the discrete theory are often more natural, as well as simpler and more straightforward to apply. Discrete Morse theory has applications throughout many fields of pure and applied mathematics. Within pure mathematics, for example, the theory has been widely applied to problems in geometry, topology, and knot theory; and within computer science, the theory has been used to evaluate data compression algorithms and to bound the complexity of algorithms that determine whether graphs have certain properties – for example, whether all components of a graph are connected. If we wish to know whether a given property holds for a certain topological space, our question can often be reduced to the question of whether the space is equivalent to another space for which the property holds. For example, whether a simple algorithm exists for determining if a graph is connected depends on whether the structure that represents the space of not-connected graphs can be shrunken to a point. Alas, it cannot, so any algorithm for testing graph connectedness must, at least in some cases, conduct an exhaustive search. This result has real-world implications: for example, it means that if we want to test a communications system – say, immediately after a disaster – to determine whether it is still connected, there is no guaranteed way of finding the answer without testing every component individually. http://…/Discrete_Morse_theory http://…/Morse_theory http://…/s48forman.pdf TDAmapper Discrete Neural Process Many data generating processes involve latent random variables over discrete combinatorial spaces whose size grows factorially with the dataset. In these settings, existing posterior inference methods can be inaccurate and/or very slow. In this work we develop methods for efficient amortized approximate Bayesian inference over discrete combinatorial spaces, with applications to random permutations, probabilistic clustering (such as Dirichlet process mixture models) and random communities (such as stochastic block models). The approach is based on mapping distributed, symmetry-invariant representations of discrete arrangements into conditional probabilities. The resulting algorithms parallelize easily, yield iid samples from the approximate posteriors, and can easily be applied to both conjugate and non-conjugate models, as training only requires samples from the generative model. Discrete Reasoning Over the Content of Paragraphs(DROP) Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1. Discrete Sparklines Discrete Statistical Model A discrete statistical model is a subset of a probability simplex. Its maximum likelihood estimator (MLE) is a retraction from that simplex onto the model. Discrete Time Markov Chain(DTMC) A discrete-time Markov chain is a sequence of random variables X1, X2, X3, … with the Markov property, namely that the probability of moving to the next state depends only on the present state and not on the previous states. Discrete-Event Systems A discrete event system is a dynamic system with discrete states the transitions of which are triggered by events. This provides a general framework for many man-made systems where the system dynamics not only follow physical laws but also the man-made rules. It is difficult to describe the dynamics of these systems using closed-form expressions. In many cases simulation is the only faithful way to describe the system dynamics and for performance evaluation. Discrete-Time Method of Successive Approximations(MSA) Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin’s maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices. Discretification Discretification’ is the mechanism of making continuous data discrete. If you really grasp the concept, you may be thinking ‘Wait a minute, the type of data we are collecting is discrete in and of itself! Data can EITHER be discrete OR continuous, it can’t be both!’ You would be correct. But what if we manually selected values along that continuous measurement, and declared them to be in a specific category? For instance, if we declare 72.0 degrees and greater to be ‘Hot’, 35.0-71.9 degrees to be ‘Moderate’, and anything lower than 35.0 degrees to be ‘Cold’, we have ‘discretified’ temperature! Our readings that were once continuous now fit into distinct categories. So, where we do we draw the boundaries for these categories? What makes 35.0 degrees ‘Cold’ and 35.1 degrees ‘Moderate’? At is at this juncture that the TRUE decision is being made. The beauty of approaching the challenge in this manner is that it is data-centric, not concept-centric. Let’s walk through our marketing example first without using discretification, then with it. Discriminant Analysis Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. This method is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. HiDimDA Discriminant Function Analysis Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification – the act of distributing things into groups, classes or categories of the same type. Moreover, it is a useful follow-up procedure to a MANOVA instead of doing a series of one-way ANOVAs, for ascertaining how the groups differ on the composite of dependent variables. In this case, a significant F test allows classification based on a linear combination of predictor variables. Terminology can get confusing here, as in MANOVA, the dependent variables are the predictor variables, and the independent variables are the grouping variables. Discriminative and Contrast Invertible(DCI) Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative ability of the descriptor under illumination changes, a Laplace gradient based histogram is proposed. A robust contrast flipping estimate is proposed based on the divergence of a local region. Experiments on fine-grained object recognition and retrieval applications demonstrate the superior performance of DCI descriptor to others. Discriminative Convolutional Analysis Dictionary Learning(DCADL) Discriminative Dictionary Learning (DL) methods have been widely advocated for image classification problems. To further sharpen their discriminative capabilities, most state-of-the-art DL methods have additional constraints included in the learning stages. These various constraints, however, lead to additional computational complexity. We hence propose an efficient Discriminative Convolutional Analysis Dictionary Learning (DCADL) method, as a lower cost Discriminative DL framework, to both characterize the image structures and refine the interclass structure representations. The proposed DCADL jointly learns a convolutional analysis dictionary and a universal classifier, while greatly reducing the time complexity in both training and testing phases, and achieving a competitive accuracy, thus demonstrating great performance in many experiments with standard databases. Discriminative Differentiable Dynamic Time Warping(D^3TW) We address weakly-supervised action alignment and segmentation in videos, where only the order of occurring actions is available during training. We propose Discriminative Differentiable Dynamic Time Warping (D${}^3$TW), which is the first discriminative model for weak ordering supervision. This allows us to bypass the degenerated sequence problem usually encountered in previous work. The key technical challenge for discriminative modeling with weak-supervision is that the loss function of the ordering supervision is usually formulated using dynamic programming and is thus not differentiable. We address this challenge by continuous relaxation of the min-operator in dynamic programming and extend the DTW alignment loss to be differentiable. The proposed D${}^3$TW innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks. We show that our model outperforms the current state-of-the-art across three evaluation metrics in two challenging datasets. Discriminative Encoding for Domain Adaptation(DEDA) The primary objective of domain adaptation methods is to transfer knowledge from a source domain to a target domain that has similar but different data distributions. Thus, in order to correctly classify the unlabeled target domain samples, the standard approach is to learn a common representation for both source and target domain, thereby indirectly addressing the problem of learning a classifier in the target domain. However, such an approach does not address the task of classification in the target domain directly. In contrast, we propose an approach that directly addresses the problem of learning a classifier in the unlabeled target domain. In particular, we train a classifier to correctly classify the training samples while simultaneously classifying the samples in the target domain in an unsupervised manner. The corresponding model is referred to as Discriminative Encoding for Domain Adaptation (DEDA). We show that this simple approach for performing unsupervised domain adaptation is indeed quite powerful. Our method achieves state of the art results in unsupervised adaptation tasks on various image classification benchmarks. We also obtained state of the art performance on domain adaptation in Amazon reviews sentiment classification dataset. We perform additional experiments when the source data has less labeled examples and also on zero-shot domain adaptation task where no target domain samples are used for training. Discriminative k-shot learning This paper introduces a probabilistic framework for k-shot image classification. The goal is to generalise from an initial large-scale classification task to a separate task comprising new classes and small numbers of examples. The new approach not only leverages the feature-based representation learned by a neural network from the initial task (representational transfer), but also information about the form of the classes (concept transfer). The concept information is encapsulated in a probabilistic model for the final layer weights of the neural network which then acts as a prior when probabilistic k-shot learning is performed. Surprisingly, simple probabilistic models and inference schemes outperform many existing k-shot learning approaches and compare favourably with the state-of-the-art method in terms of error-rate. The new probabilistic methods are also able to accurately model uncertainty, leading to well calibrated classifiers, and they are easily extensible and flexible, unlike many recent approaches to k-shot learning. Discriminative Model Discriminative models, also called conditional models, are a class of models used in machine learning for modeling the dependence of an unobserved variable y on an observed variable x. Within a probabilistic framework, this is done by modeling the conditional probability distribution P(y|x), which can be used for predicting y from x. Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of x and y. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance. On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model. Discriminative Optimization(DO) Many computer vision problems are formulated as the optimization of a cost function. This approach faces two main challenges: (i) designing a cost function with a local optimum at an acceptable solution, and (ii) developing an efficient numerical method to search for one (or multiple) of these local optima. While designing such functions is feasible in the noiseless case, the stability and location of local optima are mostly unknown under noise, occlusion, or missing data. In practice, this can result in undesirable local optima or not having a local optimum in the expected place. On the other hand, numerical optimization algorithms in high-dimensional spaces are typically local and often rely on expensive first or second order information to guide the search. To overcome these limitations, this paper proposes Discriminative Optimization (DO), a method that learns search directions from data without the need of a cost function. Specifically, DO explicitly learns a sequence of updates in the search space that leads to stationary points that correspond to desired solutions. We provide a formal analysis of DO and illustrate its benefits in the problem of 3D point cloud registration, camera pose estimation, and image denoising. We show that DO performed comparably or outperformed state-of-the-art algorithms in terms of accuracy, robustness to perturbations, and computational efficiency. Discriminative PCA(dPCA) Principal component analysis (PCA) is widely used for feature extraction and dimensionality reduction, with documented merits in diverse tasks involving high-dimensional data. Standard PCA copes with one dataset at a time, but it is challenged when it comes to analyzing multiple datasets jointly. In certain data science settings however, one is often interested in extracting the most discriminative information from one dataset of particular interest (a.k.a. target data) relative to the other(s) (a.k.a. background data). To this end, this paper puts forth a novel approach, termed discriminative (d) PCA, for such discriminative analytics of multiple datasets. Under certain conditions, dPCA is proved to be least-squares optimal in recovering the component vector unique to the target data relative to background data. To account for nonlinear data correlations, (linear) dPCA models for one or multiple background datasets are generalized through kernel-based learning. Interestingly, all dPCA variants admit an analytical solution obtainable with a single (generalized) eigenvalue decomposition. Finally, corroborating dimensionality reduction tests using both synthetic and real datasets are provided to validate the effectiveness of the proposed methods. Discriminative Principal Component Analysis(Discriminative PCA) In this paper, we propose a novel approach named by Discriminative Principal Component Analysis which is abbreviated as Discriminative PCA in order to enhance separability of PCA by Linear Discriminant Analysis (LDA). The proposed method performs feature extraction by determining a linear projection that captures the most scattered discriminative information. The most innovation of Discriminative PCA is performing PCA on discriminative matrix rather than original sample matrix. For calculating the required discriminative matrix under low complexity, we exploit LDA on a converted matrix to obtain within-class matrix and between-class matrix thereof. During the computation process, we utilise direct linear discriminant analysis (DLDA) to solve the encountered SSS problem. For evaluating the performances of Discriminative PCA in face recognition, we analytically compare it with DLAD and PCA on four well known facial databases, they are PIE, FERET, YALE and ORL respectively. Results in accuracy and running time obtained by nearest neighbour classifier are compared when different number of training images per person used. Not only the superiority and outstanding performance of Discriminative PCA showed in recognition rate, but also the comparable results of running time. Discriminative Regression Machine(DRM) We introduce a discriminative regression approach to supervised classification in this paper. It estimates a representation model while accounting for discriminativeness between classes, thereby enabling accurate derivation of categorical information. This new type of regression models extends existing models such as ridge, lasso, and group lasso through explicitly incorporating discriminative information. As a special case we focus on a quadratic model that admits a closed-form analytical solution. The corresponding classifier is called discriminative regression machine (DRM). Three iterative algorithms are further established for the DRM to enhance the efficiency and scalability for real applications. Our approach and the algorithms are applicable to general types of data including images, high-dimensional data, and imbalanced data. We compare the DRM with currently state-of-the-art classifiers. Our extensive experimental results show superior performance of the DRM and confirm the effectiveness of the proposed approach. Discriminative Subgraph Learning(DSL) The goal in network state prediction (NSP) is to classify the global state (label) associated with features embedded in a graph. This graph structure encoding feature relationships is the key distinctive aspect of NSP compared to classical supervised learning. NSP arises in various applications: gene expression samples embedded in a protein-protein interaction (PPI) network, temporal snapshots of infrastructure or sensor networks, and fMRI coherence network samples from multiple subjects to name a few. Instances from these domains are typically “wide” (more features than samples), and thus, feature sub-selection is required for robust and generalizable prediction. How to best employ the network structure in order to learn succinct connected subgraphs encompassing the most discriminative features becomes a central challenge in NSP. Prior work employs connected subgraph sampling or graph smoothing within optimization frameworks, resulting in either large variance of quality or weak control over the connectivity of selected subgraphs. In this work we propose an optimization framework for discriminative subgraph learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity and (iii) high discriminative power of the resulting subgraphs of features. Our optimization algorithm is a single-step solution for the NSP and the associated feature selection problem. It is rooted in the rich literature on maximal-margin optimization, spectral graph methods and sparse subspace self-representation. DSL simultaneously ensures solution interpretability and superior predictive power (up to 16% improvement in challenging instances compared to baselines), with execution times up to an hour for large instances. Discriminative Supervised Hashing(DSH) With the advantage of low storage cost and high retrieval efficiency, hashing techniques have recently been an emerging topic in cross-modal similarity search. As multiple modal data reflect similar semantic content, many researches aim at learning unified binary codes. However, discriminative hashing features learned by these methods are not adequate. This results in lower accuracy and robustness. We propose a novel hashing learning framework which jointly performs classifier learning, subspace learning and matrix factorization to preserve class-specific semantic content, termed Discriminative Supervised Hashing (DSH), to learn the discrimative unified binary codes for multi-modal data. Besides, reducing the loss of information and preserving the non-linear structure of data, DSH non-linearly projects different modalities into the common space in which the similarity among heterogeneous data points can be measured. Extensive experiments conducted on the three publicly available datasets demonstrate that the framework proposed in this paper outperforms several state-of -the-art methods. Discriminator Rejection Sampling(DRS) We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm – called Discriminator Rejection Sampling (DRS) – that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the state of the art SAGAN model. On ImageNet, we train an improved baseline that increases the best published Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75. Disentangled Attribution Curve(DAC) Tree ensembles, such as random forests and AdaBoost, are ubiquitous machine learning models known for achieving strong predictive performance across a wide variety of domains. However, this strong performance comes at the cost of interpretability (i.e. users are unable to understand the relationships a trained random forest has learned and why it is making its predictions). In particular, it is challenging to understand how the contribution of a particular feature, or group of features, varies as their value changes. To address this, we introduce Disentangled Attribution Curves (DAC), a method to provide interpretations of tree ensemble methods in the form of (multivariate) feature importance curves. For a given variable, or group of variables, DAC plots the importance of a variable(s) as their value changes. We validate DAC on real data by showing that the curves can be used to increase the accuracy of logistic regression while maintaining interpretability, by including DAC as an additional feature. In simulation studies, DAC is shown to out-perform competing methods in the recovery of conditional expectations. Finally, through a case-study on the bike-sharing dataset, we demonstrate the use of DAC to uncover novel insights into a dataset. DisentAngled Representation Learning Agent(DARLA) Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA’s vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts – even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC). DisguiseNet This paper describes our approach for the Disguised Faces in the Wild (DFW) 2018 challenge. The task here is to verify the identity of a person among disguised and impostors images. Given the importance of the task of face verification it is essential to compare methods across a common platform. Our approach is based on VGG-face architecture paired with Contrastive loss based on cosine distance met- ric. For augmenting the data set, we source more data from the internet. The experiments show the effectiveness of the approach on the DFW data. We show that adding extra data to the DFW dataset with noisy labels also helps in increasing the gen 11 eralization performance of the network. The proposed network achieves 27.13% absolute increase in accuracy over the DFW baseline. Disparity Matrix ➚ “Canonical Space” DISPATCH This work presents the first algorithm for the problem of weighted online perfect bipartite matching with i.i.d. arrivals. Previous work only considered adversarial arrival sequences. In this problem, we are given a known set of workers, a distribution over job types, and non-negative utility weights for each worker, job type pair. At each time step, a job is drawn i.i.d. from the distribution over job types. Upon arrival, the job must be irrevocably assigned to a worker. The goal is to maximize the expected sum of utilities after all jobs are assigned. Our work is motivated by the application of ride-hailing, where jobs represent passengers and workers represent drivers. We introduce \algname{}, a 0.5-competitive, randomized algorithm and prove that 0.5-competitive is the best possible. \algname{} first selects a ‘preferred worker’ and assign the job to this worker if it is available. The preferred worker is determined based on an optimal solution to a fractional transportation problem. If the preferred worker is not available, \algname{} randomly selects a worker from the available workers. We show that \algname{} maintains a uniform distribution over the workers even when the distribution over the job types is non-uniform. DISPONTE When modeling real world domains we have to deal with information that is incomplete or that comes from sources with different trust levels. This motivates the need for managing uncertainty in the Semantic Web. To this purpose, we introduced a probabilistic semantics, named DISPONTE, in order to combine description logics with probability theory. The probability of a query can be then computed from the set of its explanations by building a Binary Decision Diagram (BDD). The set of explanations can be found using the tableau algorithm, which has to handle non-determinism. Prolog, with its efficient handling of non-determinism, is suitable for implementing the tableau algorithm. TRILL and TRILLP are systems offering a Prolog implementation of the tableau algorithm. TRILLP builds a pinpointing formula, that compactly represents the set of explanations and can be directly translated into a BDD. Both reasoners were shown to outperform state-of-the-art DL reasoners. In this paper, we present an improvement of TRILLP, named TORNADO, in which the BDD is directly built during the construction of the tableau, further speeding up the overall inference process. An experimental comparison shows the effectiveness of TORNADO. All systems can be tried online in the TRILL on SWISH web application at http://…/. DISsimilarity COefficient Networks(DISCO Nets) We present a new type of probabilistic model which we call DISsimilarity COefficient Networks (DISCO Nets). DISCO Nets allow us to efficiently sample from a posterior distribution parametrised by a neural network. During training, DISCO Nets are learned by minimising the dissimilarity coefficient between the true distribution and the estimated distribution. This allows us to tailor the training to the loss related to the task at hand. We empirically show that (i) by modeling uncertainty on the output value, DISCO Nets outperform equivalent non-probabilistic predictive networks and (ii) DISCO Nets accurately model the uncertainty of the output, outperforming existing probabilistic models based on deep neural networks. Dissimilarity Measure If features are given or can be defined they can be used to define a distance measure between objects. This can be understood as the euclidean distance in a properly scaled feature space. For good features this will also result in good dissimilarities. However, as long a dissimilarities are based on features their performance will be determined by the quality of these. As features do not describe the full objects it is possible that two objects that are different have a zero distance based on the available features. This is an essential cause of class overlap in feature spaces. Dissimilarities offer the possibility to overcome this. If the dissimilarity measure is defined in such a way that objects have a zero distance to itself or to entirely identical copies of themselves (which thereby should belong to the same class) there is no class overlap. Distance Based on Conditional Ordered List(DCOL) nlnet Distance Correlation In statistics and in probability theory, distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient. These distance-based measures can be put into an indirect relationship to the ordinary moments by an alternative formulation (described below) using ideas related to Brownian motion, and this has led to the use of names such as Brownian covariance and Brownian distance covariance. cdcsis Distance Metric Learned Collaborative Representation Classifier(DML-CRC) Any generic deep machine learning algorithm is essentially a function fitting exercise, where the network tunes its weights and parameters to learn discriminatory features by minimizing some cost function. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction. We present a simple effective way of achieving this by learning a generic Mahalanabis distance in a collaborative loss function in an end-to-end fashion with any standard convolutional network as the feature learner. The proposed method DML-CRC gives state-of-the-art performance on benchmark fine-grained classification datasets CUB Birds, Oxford Flowers and Oxford-IIIT Pets using the VGG-19 deep network. The method is network agnostic and can be used for any similar classification tasks. Distance Metric Learning(DML) Distance metric learning (DML), which learns a distance metric from labeled ‘similar’ and ‘dissimilar’ data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness — achieving comparable performance on both frequent and infrequent classes; (2) high compactness — using a small number of projection vectors to achieve a ‘good’ metric; (3) good generalizability — alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR’s capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods. Distance Multivariance We introduce two new measures for the dependence of $n \ge 2$ random variables: distance multivariance’ and `total distance multivariance’. Both measures are based on the weighted $L^2$-distance of quantities related to the characteristic functions of the underlying random variables. They extend distance covariance (introduced by Szekely, Rizzo and Bakirov) and generalized distance covariance (introduced in part I) from pairs of random variables to $n$-tuplets of random variables. We show that total distance multivariance can be used to detect the independence of $n$ random variables and has a simple finite-sample representation in terms of distance matrices of the sample points, where distance is measured by a continuous negative definite function. Based on our theoretical results, we present a test for independence of multiple random vectors which is consistent against all alternatives. multivariance Distance Preservation to Local Mean(DPLM) In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of Symmetric Positive Definite (SPD) matrices that considers the geometry of SPD matrices and provides a low dimensional representation of the manifold with high class discrimination. The proposed algorithm, tries to preserve the local structure of the data by preserving distance to local mean (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples and may use the label information when they are available in order to performance improvement in classification tasks. We performed several experiments on the multi-class dataset IIa from BCI competition IV. The results show that our approach as dimensionality reduction technique – leads to superior results in comparison with other competitor in the related literature because of its robustness against outliers. The experiments confirm that the combination of DPLM with FGMDM as the classifier leads to the state of the art performance on this dataset. Distance to Kernel and Embedding(D2KE) For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine models are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:\mathcal{X} \rightarrow \R$ based solely on a dissimilarity measure $d:\mathcal{X}\times\mathcal{X} \rightarrow \R$ between inputs. In particular, we propose a general framework to derive a family of \emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widely-used \emph{representative-set method} as a special case, and relates to the well-known \emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitz-continuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than Nearest-Neighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distance-based learning methods such as $k$-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embedding, and the representative-set method. Distance to Measure(DTM) Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features (e.g., Betti numbers, normals) of this subset from the approximating point cloud data. It appears that the study of distance functions allows one to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers or with background noise. In this paper, we show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by measures, we introduce a notion of distance function to a probability distribution in R d . These functions share many properties with classical distance functions, which make them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, we show that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers. Moreover, in settings where empirical measures are considered, these functions can be easily evaluated, making them of particular practical interest. Distance Weighted Discrimination(DWD) High Dimension Low Sample Size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, it is seen that the appealing discrimination method called the Support Vector Machine can be improved. The revealing concept is ‘data piling’ at the margin. This leads naturally to the development of ‘Distance Weighted Discrimination’, which also is based on modern computationally intensive optimization methods, and seems to give improved ‘generalizability’. Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS sdwd Distance-Based Independence Screening for Canonical Analysis(DISCA) This paper introduces a new method named Distance-based Independence Screening for Canonical Analysis (DISCA) to reduce dimensions of two random vectors with arbitrary dimensions. The objective of our method is to identify the low dimensional linear projections of two random vectors, such that any dimension reduction based on linear projection with lower dimensions will surely affect some dependent structure — the removed components are not independent. The essence of DISCA is to use the distance correlation to eliminate the ‘redundant’ dimensions until infeasible. Unlike the existing canonical analysis methods, DISCA does not require the dimensions of the reduced subspaces of the two random vectors to be equal, nor does it require certain distributional assumption on the random vectors. We show that under mild conditions, our approach does undercover the lowest possible linear dependency structures between two random vectors, and our conditions are weaker than some sufficient linear subspace-based methods. Numerically, DISCA is to solve a non-convex optimization problem. We formulate it as a difference-of-convex (DC) optimization problem, and then further adopt the alternating direction method of multipliers (ADMM) on the convex step of the DC algorithms to parallelize/accelerate the computation. Some sufficient linear subspace-based methods use potentially numerically-intensive bootstrap method to determine the dimensions of the reduced subspaces in advance; our method avoids this complexity. In simulations, we present cases that DISCA can solve effectively, while other methods cannot. In both the simulation studies and real data cases, when the other state-of-the-art dimension reduction methods are applicable, we observe that DISCA performs either comparably or better than most of them. Codes and an R package can be found in GitHub https://…/DISCA. Distance-Based k-Medoids This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids. kmed Distance-preserving Grid(DGrid) Distance preserving visualization techniques have emerged as one of the fundamental tools for data analysis. One example are the techniques that arrange data instances into two-dimensional grids so that the pairwise distances among the instances are preserved into the produced layouts. Currently, the state-of-the-art approaches produce such grids by solving assignment problems or using permutations to optimize cost functions. Although precise, such strategies are computationally expensive, limited to small datasets or being dependent on specialized hardware to speed up the process. In this paper, we present a new technique, called Distance-preserving Grid (DGrid), that employs a binary space partitioning process in combination with multidimensional projections to create orthogonal regular grid layouts. Our results show that DGrid is as precise as the existing state-of-the-art techniques whereas requiring only a fraction of the running time and computational resources. DistCache Load balancing is critical for distributed storage to meet strict service-level objectives (SLOs). It has been shown that a fast cache can guarantee load balancing for a clustered storage system. However, when the system scales out to multiple clusters, the fast cache itself would become the bottleneck. Traditional mechanisms like cache partition and cache replication either result in load imbalance between cache nodes or have high overhead for cache coherence. We present DistCache, a new distributed caching mechanism that provides provable load balancing for large-scale storage systems. DistCache co-designs cache allocation with cache topology and query routing. The key idea is to partition the hot objects with independent hash functions between cache nodes in different layers, and to adaptively route queries with the power-of-two-choices. We prove that DistCache enables the cache throughput to increase linearly with the number of cache nodes, by unifying techniques from expander graphs, network flows, and queuing theory. DistCache is a general solution that can be applied to many storage systems. We demonstrate the benefits of DistCache by providing the design, implementation, and evaluation of the use case for emerging switch-based caching. Distill and Transfer Learning(Distral) Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a ‘distilled’ policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable—attributes that are critical in deep reinforcement learning. Distilled Dropout Network(DDN) We propose an efficient way to output better calibrated uncertainty scores from neural networks. The Distilled Dropout Network (DDN) makes standard (non-Bayesian) neural networks more introspective by adding a new training loss which prevents them from being overconfident. Our method is more efficient than Bayesian neural networks or model ensembles which, despite providing more reliable uncertainty scores, are more cumbersome to train and slower to test. We evaluate DDN on the the task of image classification on the CIFAR-10 dataset and show that our calibration results are competitive even when compared to 100 Monte Carlo samples from a dropout network while they also increase the classification accuracy. We also propose better calibration within the state of the art Faster R-CNN object detection framework and show, using the COCO dataset, that DDN helps train better calibrated object detectors. Distilled-Exposition Enhanced Matching Network(DEMN) This paper proposes a Distilled-Exposition Enhanced Matching Network (DEMN) for story-cloze test, which is still a challenging task in story comprehension. We divide a complete story into three narrative segments: an \textit{exposition}, a \textit{climax}, and an \textit{ending}. The model consists of three modules: input module, matching module, and distillation module. The input module provides semantic representations for the three segments and then feeds them into the other two modules. The matching module collects interaction features between the ending and the climax. The distillation module distills the crucial semantic information in the exposition and infuses it into the matching module in two different ways. We evaluate our single and ensemble model on ROCStories Corpus \cite{Mostafazadeh2016ACA}, achieving an accuracy of 80.1\% and 81.2\% on the test set respectively. The experimental results demonstrate that our DEMN model achieves a state-of-the-art performance. DistMult Knowledge Base Completion: Baselines Strike Back Distributed Chebyshev-Accelerated Primal-Dual Algorithm We consider a distributed optimization problem over a network of agents aiming to minimize a global objective function that is the sum of local convex and composite cost functions. To this end, we propose a distributed Chebyshev-accelerated primal-dual algorithm to achieve faster ergodic convergence rates. In standard distributed primal-dual algorithms, the speed of convergence towards a global optimum (i.e., a saddle point in the corresponding Lagrangian function) is directly influenced by the eigenvalues of the Laplacian matrix representing the communication graph. In this paper, we use Chebyshev matrix polynomials to generate gossip matrices whose spectral properties result in faster convergence speeds, while allowing for a fully distributed implementation. As a result, the proposed algorithm requires fewer gradient updates at the cost of additional rounds of communications between agents. We illustrate the performance of the proposed algorithm in a distributed signal recovery problem. Our simulations show how the use of Chebyshev matrix polynomials can be used to improve the convergence speed of a primal-dual algorithm over communication networks, especially in networks with poor spectral properties, by trading local computation by communication rounds. Distributed Computing Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including RPC-like connectors and message queues. A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional non-distributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be ‘papered over’ by attempts at ‘transparency’ – see CAP theorem. Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. Distributed Convolutional Dictionary Learning(DiCoDiLe) Convolutional dictionary learning (CDL) estimates shift invariant basis adapted to multidimensional data. CDL has proven useful for image denoising or inpainting, as well as for pattern discovery on multivariate signals. As estimated patterns can be positioned anywhere in signals or images, optimization techniques face the difficulty of working in extremely high dimensions with millions of pixels or time samples, contrarily to standard patch-based dictionary learning. To address this optimization problem, this work proposes a distributed and asynchronous algorithm, employing locally greedy coordinate descent and an asynchronous locking mechanism that does not require a central server. This algorithm can be used to distribute the computation on a number of workers which scales linearly with the encoded signal’s size. Experiments confirm the scaling properties which allows us to learn patterns on large scales images from the Hubble Space Telescope. Distributed Cooperative Logistics Platform(DCLP) Supply Chains and Logistics have a growing importance in global economy. Supply Chain Information Systems over the world are heterogeneous and each one can both produce and receive massive amounts of structured and unstructured data in real-time, which are usually generated by information systems, connected objects or manually by humans. This heterogeneity is due to Logistics Information Systems components and processes that are developed by different modelling methods and running on many platforms; hence, decision making process is difficult in such multi-actor environment. In this paper we identify some current challenges and integration issues between separately designed Logistics Information Systems (LIS), and we propose a Distributed Cooperative Logistics Platform (DCLP) framework based on NoSQL, which facilitates real-time cooperation between stakeholders and improves decision making process in a multi-actor environment. We included also a case study of Hospital Supply Chain (HSC), and a brief discussion on perspectives and future scope of work. Distributed Coordinated Multi-Agent Bidding(DCMAB) Real-time advertising allows advertisers to bid for each impression for a visiting user. To optimize a specific goal such as maximizing the revenue led by ad placements, advertisers not only need to estimate the relevance between the ads and user’s interests, but most importantly require a strategic response with respect to other advertisers bidding in the market. In this paper, we formulate bidding optimization with multi-agent reinforcement learning. To deal with a large number of advertisers, we propose a clustering method and assign each cluster with a strategic bidding agent. A practical Distributed Coordinated Multi-Agent Bidding (DCMAB) has been proposed and implemented to balance the tradeoff between the competition and cooperation among advertisers. The empirical study on our industry-scaled real-world data has demonstrated the effectiveness of our modeling methods. Our results show that a cluster based bidding would largely outperform single-agent and bandit approaches, and the coordinated bidding achieves better overall objectives than the purely self-interested bidding agents. Distributed Data Shuffling Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms. Data shuffling is often considered one of the most significant bottlenecks in such systems due to the heavy communication load. Under a master-worker architecture (where a master has access to the entire dataset and only communications between the master and workers is allowed) coding has been recently proved to considerably reduce the communication load. In this work, we consider a different communication paradigm referred to as distributed data shuffling, where workers, connected by a shared link, are allowed to communicate with one another while no communication between the master and workers is allowed. Under the constraint of uncoded cache placement, we first propose a general coded distributed data shuffling scheme, which achieves the optimal communication load within a factor two. Then, we propose an improved scheme achieving the exact optimality for either large memory size or at most four workers in the system. Distributed Dynamic Data-intensive Science(D3 Science) A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While ‘static’ data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications. Distributed Expectation-Maximization(DEM) The family of Expectation-Maximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EM-type algorithms is time-consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called \emph{workers} and \emph{managers} for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of local E step. The managers perform the M step after they have received results from a $\gamma$-fraction of the workers, where $\gamma$ is a fixed constant in $(0, 1]$. The sequence of parameter estimates generated by the DEM algorithm retains the attractive properties of EM: convergence of the sequence of parameter estimates to a local mode and linear global rate of convergence. Across diverse simulations focused on linear mixed-effects models, the DEM algorithm is significantly faster than competing EM-type algorithms while having a similar accuracy. The DEM algorithm maintains its superior empirical performance on a movie ratings database consisting of 10 million ratings. Distributed Feature Extraction Tool(DIFET) In this paper, we propose distributed feature extraction tool from high spatial resolution remote sensing images. Tool is based on Apache Hadoop framework and Hadoop Image Processing Interface. Two corner detection (Harris and Shi-Tomasi) algorithms and five feature descriptors (SIFT, SURF, FAST, BRIEF, and ORB) are considered. Robustness of the tool in the task of feature extraction from LandSat-8 imageries are evaluated in terms of horizontal scalability. Distributed gRaph cOmputiNg Engine(DRONE) Nowadays, in big data era, social networks, graph database, knowledge graph, electronic commerce and etc. demand efficient and scalable capability to process ever increasingly volume of graph-structured data. To meet the challenge, two mainstream distributed programming models, vertex-centric VC and subgraph-centric (SC) were proposed. Compared to the VC model, the SC model converges faster with less communication overhead on well-partitioned graphs, and is easy to program with due to the ‘think like a graph’ philosophy. However, edge-cut method causes significant performance bottleneck for preprocessing large graphs, especially power-law graphs. Although the edge-cut method is considered as a natural choice of subgraph-centric model for graph partitioning, and adopted by Giraph++, Blogel, GRAPE. Thus, the SC model is less competitive in practice. In this paper, we present an innovative distributed graph computing framework, DRONE(Distributed gRaph cOmputiNg Engine). It combines the subgraph-centric model and the vertex-cut graph partitioning strategy. Experiments show that DRONE outperform the state-of-art distributed graph computing engines on real-world graphs and synthetic power-law graphs. DRONE is capable to scale up to process one-trillion-edges synthetic power-law graphs, which is orders of magnitude larger than previously reported by existing SC-based frameworks. Distributed Heavy-Ball We study distributed optimization to minimize a global objective that is a sum of smooth and strongly-convex local cost functions. Recently, several algorithms over undirected and directed graphs have been proposed that use a gradient tracking method to achieve linear convergence to the global minimizer. However, a connection between these different approaches has been unclear. In this paper, we first show that many of the existing first-order algorithms are in fact related with a simple state transformation, at the heart of which lies the~$\mc{AB}$ algorithm. We then describe \textit{distributed heavy-ball}, denoted as~$\mc{AB}m$, i.e.,~$\mc{AB}$ with momentum, that combines gradient tracking with a momentum term and uses nonidentical local step-sizes. By~simultaneously implementing both row- and column-stochastic weights,~$\mc{AB}m$ removes the conservatism in the related work due to doubly-stochastic weights or eigenvector estimation.~$\mc{AB}m$ thus naturally leads to optimization and average-consensus over both undirected and directed graphs, casting a unifying framework over several well-known consensus algorithms over arbitrary graphs. We show that~$\mathcal{AB}m$ has a global $R$-linear convergence when the largest step-size is positive and sufficiently small. Following the standard practice in the heavy-ball literature, we numerically show that~$\mc{AB}m$ achieves accelerated convergence especially when the objective function is ill-conditioned. Distributed Lag Model A distributed-lag model is a dynamic model in which the effect of a regressor x on y occurs over time rather than all at once. In the simple case of one explanatory variable and a linear relationship. This form is very similar to the infinite-moving-average representation of an ARMA process, except that the lag polynomial on the right-hand side is applied to the explanatory variable x rather than to a white-noise process e. The individual coefficients ßs are called lag weights and the collectively comprise the lag distribution. They define the pattern of how x affects y over time. Distributed Lance-William Clustering Algorithm One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage. Distributed Machine Learning Toolkit(DMTK) Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models, because the task usually requires a large number of computation resources. In order to enable the training of big models using just a modest cluster and in an efficient manner, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient and flexible. Distributed Matrix A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Three types of distributed matrices have been implemented so far. The basic type is called RowMatrix. A RowMatrix is a row-oriented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. An IndexedRowMatrix is similar to a RowMatrix but with row indices, which can be used for identifying rows and executing joins. A CoordinateMatrix is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries. Distributed Newton-Type Method for Gradient-Norm Optimization(DINGO) For optimization of a sum of functions in a distributed computing environment, we present a novel communication efficient Newton-type algorithm that enjoys a variety of advantages over similar existing methods. Similar to Newton-MR, our algorithm, DINGO, is derived by optimization of the gradient’s norm as a surrogate function. DINGO does not impose any specific form on the underlying functions, and its application range extends far beyond convexity. In addition, the distribution of the data across the computing environment can be almost arbitrary. Further, the underlying sub-problems of DINGO are simple linear least-squares, for which a plethora of efficient algorithms exist. Lastly, DINGO involves a few hyper-parameters that are easy to tune. Moreover, we theoretically show that DINGO is not sensitive to the choice of its hyper-parameters in that a strict reduction in the gradient norm is guaranteed, regardless of the selected hyper-parameters. We demonstrate empirical evidence of the effectiveness, stability and versatility of our method compared to other relevant algorithms, on both convex and non-convex problems. Distributed Online Linear Regression We study online linear regression problems in a distributed setting, where the data is spread over a network. In each round, each network node proposes a linear predictor, with the objective of fitting the \emph{network-wide} data. It then updates its predictor for the next round according to the received local feedback and information received from neighboring nodes. The predictions made at a given node are assessed through the notion of regret, defined as the difference between their cumulative network-wide square errors and those of the best off-line network-wide linear predictor. Various scenarios are investigated, depending on the nature of the local feedback (full information or bandit feedback), on the set of available predictors (the decision set), and the way data is generated (by an oblivious or adaptive adversary). We propose simple and natural distributed regression algorithms, involving, at each node and in each round, a local gradient descent step and a communication and averaging step where nodes aim at aligning their predictors to those of their neighbors. We establish regret upper bounds typically in ${\cal O}(T^{3/4})$ when the decision set is unbounded and in ${\cal O}(\sqrt{T})$ in case of bounded decision set. Distributed Randomized Gradient-Free Mirror Descent(DRGFMD) This paper is concerned with multi-agent optimization problem. A distributed randomized gradient-free mirror descent (DRGFMD) method is developed by introducing a randomized gradient-free oracle in the mirror descent scheme where the non-Euclidean Bregman divergence is used. The classical gradient descent method is generalized without using subgradient information of objective functions. The proposed algorithm is the first distributed non-Euclidean zeroth-order method which achieves an $O(1/\sqrt{T})$ convergence rate, recovering the best known optimal rate of distributed compact constrained convex optimization. Also, the DRGFMD algorithm achieves an $O(\ln T/T)$ convergence rate for the strongly convex constrained optimization case. The rate matches the best known non-compact constraint result. Moreover, a decentralized reciprocal weighted average approximating sequence is investigated and first used in distributed algorithm. A class of convergence rates are also achieved for the algorithm with weighted averaging (DRGFMD-WA). The technique on constructing the decentralized weighted average sequence provides new insight in searching for minimizers in distributed algorithms. Distributed Robust Algorithm for Count-based Learning(Dracula) Distributed Self-Paced Learning Method(DSPL) Self-paced learning (SPL) mimics the cognitive process of humans, who generally learn from easy samples to hard ones. One key issue in SPL is the training process required for each instance weight depends on the other samples and thus cannot easily be run in a distributed manner in a large-scale dataset. In this paper, we reformulate the self-paced learning problem into a distributed setting and propose a novel Distributed Self-Paced Learning method (DSPL) to handle large-scale datasets. Specifically, both the model and instance weights can be optimized in parallel for each batch based on a consensus alternating direction method of multipliers. We also prove the convergence of our algorithm under mild conditions. Extensive experiments on both synthetic and real datasets demonstrate that our approach is superior to those of existing methods. Distributed Stream Data Processing System(DSDPS) In this paper, we focus on general-purpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or near-real time. A fundamental problem in a DSDPS is the scheduling problem with the objective of minimizing average end-to-end tuple processing time. A widely-used solution is to distribute workload evenly over machines in the cluster in a round-robin manner, which is obviously not efficient due to lack of consideration for communication delay. Model-based approaches do not work well either due to the high complexity of the system environment. We aim to develop a novel model-free approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs; and present design, implementation and evaluation of a novel and highly effective DRL-based control framework, which minimizes average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks. To validate and evaluate the proposed framework, we implemented it based on a widely-used DSDPS, Apache Storm, and tested it with three representative applications. Extensive experimental results show 1) Compared to Storm’s default scheduler and the state-of-the-art model-based method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs. Distributed Stream Processing Engines(DSPE) Distributed stream processing engines (DSPEs) are a new emergent family of MapReduce inspired technologies that address this issue. These engines allow to express parallel computation on streams, and combine the scalability of distributed processing with the efficiency of streaming algorithms. Examples of these engines include Storm, S4, and Samza. Distribution Networks for Open Set Learning In open set learning, a model must be able to generalize to novel classes when it encounters a sample that does not belong to any of the classes it has seen before. Open set learning poses a realistic learning scenario that is receiving growing attention. Existing studies on open set learning mainly focused on detecting novel classes, but few studies tried to model them for differentiating novel classes. We recognize that novel classes should be different from each other, and propose distribution networks for open set learning that can learn and model different novel classes. We hypothesize that, through a certain mapping, samples from different classes with the same classification criterion should follow different probability distributions from the same distribution family. We estimate the probability distribution for each known class and a novel class is detected when a sample is not likely to belong to any of the known distributions. Due to the large feature dimension in the original feature space, the probability distributions in the original feature space are difficult to estimate. Distribution networks map the samples in the original feature space to a latent space where the distributions of known classes can be jointly learned with the network. In the latent space, we also propose a distribution parameter transfer strategy for novel class detection and modeling. By novel class modeling, the detected novel classes can serve as known classes to the subsequent classification. Our experimental results on image datasets MNIST and CIFAR10 and text dataset Ohsumed show that the distribution networks can detect novel classes accurately and model them well for the subsequent classification tasks. Distribution Regression Linear regression is a fundamental and popular statistical method. There are various kinds of linear regression, such as mean regression and quantile regression. In this paper, we propose a new one called distribution regression, which allows broad-spectrum of the error distribution in the linear regression. Our method uses nonparametric technique to estimate regression parameters. Our studies indicate that our method provides a better alternative than mean regression and quantile regression under many settings, particularly for asymmetrical heavy-tailed distribution or multimodal distribution of the error term. Under some regular conditions, our estimator is $\sqrt n$-consistent and possesses the asymptotically normal distribution. The proof of the asymptotic normality of our estimator is very challenging because our nonparametric likelihood function cannot be transformed into sum of independent and identically distributed random variables. Furthermore, penalized likelihood estimator is proposed and enjoys the so-called oracle property with diverging number of parameters. Numerical studies also demonstrate the effectiveness and the flexibility of the proposed method. Distribution Regression Network(DRN) We introduce our Distribution Regression Network (DRN) which performs regression from input probability distributions to output probability distributions. Compared to existing methods, DRN learns with fewer model parameters and easily extends to multiple input and multiple output distributions. On synthetic and real-world datasets, DRN performs similarly or better than the state-of-the-art. Furthermore, DRN generalizes the conventional multilayer perceptron (MLP). In the framework of MLP, each node encodes a real number, whereas in DRN, each node encodes a probability distribution. Distribution Separation Method(DSM) Distribution Shifting Convolution(DSConv) We introduce a variation of the convolutional layer called DSConv (Distribution Shifting Convolution) that can be readily substituted into standard neural network architectures and achieve both lower memory usage and higher computational speed. DSConv breaks down the traditional convolution kernel into two components: Variable Quantized Kernel (VQK), and Distribution Shifts. Lower memory usage and higher speeds are achieved by storing only integer values in the VQK, whilst preserving the same output as the original convolution by applying both kernel and channel based distribution shifts. We test DSConv in ImageNet on ResNet50 and 34, as well as AlexNet and MobileNet. We achieve a reduction in memory usage of up to 14x in the convolutional kernels and speed up operations of up to 10x by substituting floating point operations to integer operations. Furthermore, unlike other quantization approaches, our work allows for a degree of retraining to new tasks and datasets. Distribution System Repair and Restoration Problem(DSRRP) ➘ “Mixed-Integer Linear Programming” Distributional Adversarial Networks We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and two-sample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various experimental results show that generators trained with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with pointwise prediction discriminators. The application of our framework to domain adaptation also results in considerable improvement over recent state-of-the-art. Distributional Approach to Reinforcement Learning(DiRL) Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN Distributional Reinforcement Learning(Distributional RL) In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman’s equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribut