Have you noticed the growing array of job positions with the word data in front of them? Back in 2012, the job title ‘data scientist’ was deemed ‘the sexiest job of the 21st century’ by Harvard Business Review. Ever since, and as correctly predicted, numerous spin-offs and new job titles with the word data have been propagated, and the ambition to become a priceless ‘data science unicorn’ has been matched with the on-going talent war and feeding frenzy by companies and headhunters desperate to discover and catch one.

If you ask any group of data science students about the types of machine learning algorithms, they will answer without hesitation: supervised and unsupervised. However, if we ask that same group to list different types of unsupervised learning, we are likely to get an answer like clustering but not much more. While supervised methods lead the current wave of innovation in areas such as deep learning, there is very little doubt that the future of artificial intelligence(AI) will transition towards more unsupervised forms of learning. In recent years, we have seen a lot of progress on several new forms of unsupervised learning methods that expand way beyond traditional clustering or principal component analysis(PCA) techniques. Today, I would like to explore some of the most prominent new schools of thought in the unsupervised space and their role in the future of AI.

As a Data Scientist, you Expect to get a job That Lets you do Cool Stuff – Big Data, Big Machine (or Cloud, Like the Grown-Ups) and Deep Neural Networks. Reality Quickly Creeps in as you Realise the Mismatch Between Your Model, Your Project Manager’s Timeline and Your Stakeholder’s Expectation. What They Needed (often) is not a 128-Layer ResNet, but a Simple Select & Group by Query That Delivers Actionable Insights. There you are two Months Into Your new job With Your Shiny Model That Just got Shelved, Grunting: ‘What is Actionable Insights Anyway. Insights for Who? Actioned With What?’ This Article Outlines the Common Traps in (not) Defining the Data Science Project Scope, and Tips and Framework on how to Diffuse or Prevent These Situations From Occuring.
Trap 1: The Scope is too Broad or Undefined
Trap 2: The Scope is too Narrow or Defined
Tip 1: Find out What Business is Really Asking
Tip 2: Don’t Just get Thrown a Problem, Help Define it

During a recent project, I was working on a clustering problem with data collected from users of a mobile app. The goal was to classify the users in terms of their behavior, potentially with the use of K-means clustering. However, after inspecting the data it turned out that some users represented abnormal behavior – they were outliers. A lot of machine learning algorithms suffer in terms of their performance when outliers are not taken care of. In order to avoid this kind of problem you could, for example, drop them from your sample, cap the values at some reasonable point (based on domain knowledge) or transform the data. However, in this article, I would like to focus on identifying them and leave the possible solutions for another time. As in my case, I took a lot of features into consideration, I ideally wanted to have an algorithm that would identify the outliers in a multidimensional space. That is when I came across Isolation Forest, a method which in principle is similar to the well-known and popular Random Forest. In this article, I will focus on the Isolation Forest, without describing in detail the ideas behind decision trees and ensembles, as there is already a plethora of good sources available.

Regularization helps to overcoming a problem of overfitting a model.Overfitting is the concept of balancing of bias and variance. If it is overfitting , model will have less accuracy.When our model tries to learn more property from data, then noise from the training data is added.Here noise means the data point that don’t really present the true property of your data.

According to Gartner’s 2019 CIO Survey, AI adoption by businesses grew 270% over the last four years, and over 37% of businesses have implemented AI in some facet. Businesses are adopting the technology at staggering rates, and Chief Information Officers and data scientists are facing difficult decisions regarding which speed of AI fits their business needs. AI can be broken down into three scoring patterns: batch, event-driven, and real-time. Each scoring pattern provides different capabilities, depending on the goal of the model. For example, while batch computing may work ideally in a payroll setting, it would not be an effective way to track fraud in banking transactions. By exploring these three methods and their potentials, companies are able to maximize their valuable insights.

Time series forecasting is the use of statistical methods to predict future behavior based on historical data. This is similar to other statistical learning approaches, such as supervised or unsupervised learning. However, time series forecasting has many nuances that make it different from regular machine learning. From data processing all the way to model validation, time series forecasting is a different beast. Many companies are exploring time series forecasting as a way of making better business decisions. Take a hotel as an example. If a manager has a good idea of how many hosts to expect next summer, they can use these insights to plan for staff management, budget, or even a facility expansion. Likewise, confident insights for future events can benefit a wide range of industries and problems, from traditional agriculture to on-demand transportation and more. In this article, we explore the fundamentals of time series data. We talk about how very simple forecasting methods work. Plus, we describe the most common patterns found in time series data.

As you move to the cloud, there’s a lot to learn about using containers to simplify the deployment and management of your applications. Advance your skillset with this e-book from Packt, along with an accompanying GitHub repository of hands-on labs to help you gain experience.
Discover how to package and run both monolithic and new, microservice-based applications in containers by following real-world examples of best practices in action. Delve into the design of the Kubernetes platform to understand the steps you’ll need to take to successfully use Kubernetes with Docker to manage containers. Learn how to:
Use containers to manage, scale and orchestrate applications.
Improve the efficiency and security of containers.
Build highly available and scalable Kubernetes clusters.
Design and deploy large clusters on various cloud platforms.
Update or roll back a distributed application with zero downtime.
Monitor and secure your deployments.

Logistic regression is a statistical method for analysing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.

Say the words ‘inferential statistics’ and often people’s eyes will open wide in horror, but they need not be so scary and once you know what they’re doing they can be a useful tool to have. Like it or not but data science is built around interpreting data and telling the story it gives us. To that end, for people to have confidence (p-values anyone?) in what we say we often need to quantify the values we are giving. This requires us to have a knowledge and understanding of statistics and therefore of error estimation. For example being able to tell if a large sudden change on a graph is real or not can be the difference between a company spending resources on correcting it. For that reason, it is vitally important you can understand, interpret and explain what the three main types of intervals are that you may see or use when results are presented. So, I’ll be talking about confidence intervals and its two other siblings (Predictive and Tolerance), along with confidence levels.

At the Summer School of the Swiss Association of Actuaries, in Lausanne, I will start talking about text based data and NLP this Thursday. Slides are available online.

Processing unstructured text data in real-time is challenging when applying NLP or NLU. Find out how Domain-Specific Language Processing can also help mine valuable information from data by following your guidance and using the language of your business.

As a Data Scientist you will often be given a set of data and given a question. Nice logical thinking can really help tease out the solutions, so let us start with a made up farcical example to get your brain cells going.

BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model developed by Google. It represented one of the major machine learning breakthroughs of the year, as it achieved state-of-the-art results across 11 different Natural Language Processing (NLP) tasks. Moreover, Google open-sourced the code and made pretrained models available for download similar to computer vision models pretrained on ImageNet. For these reasons, there continues to be a great deal of interest in BERT (even as other models slightly overtake it). While BERT broke records on many different tasks from Question-Answering (SQuAD v1.1) to Natural Language Inference, text classification remains one of the most practically useful and widely applicable NLP tasks. In this article, we will show how you can apply BERT to the problem of text classification in as little as 3 lines of code. To accomplish this, we will be using ktrain, a fastai-like interface to Keras.

A while back I wrote a post demonstrating how to bootstrap follow-up contrasts for repeated-measure ANOVAs for cases where you data violates some / any assumptions. Here is a demo of how to conduct the same bootstrap analysis, more simply (no need to make your data wide!)

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free here. This is taken from Chapter 8, in which I discuss advanced functional programming methods for modeling. As written just above (note: as written above in the book), map() simply applies a function to a list of inputs, and in the previous section we mapped ggplot() to generate many plots at once. This approach can also be used to map any modeling functions, for instance lm() to a list of datasets. For instance, suppose that you wish to perform a Monte Carlo simulation. Suppose that you are dealing with a binary choice problem; usually, you would use a logistic regression for this.

With the recent progress in machine learning, boosted by techniques such as deep learning, many tasks can be successfully solved once a large enough dataset is available for training. Nonetheless, human-annotated datasets are often expensive to produce, especially when labels are fine-grained, as is the case of Named Entity Recognition (NER), a task that operates with labels on a word-level. In this paper, we propose a method to automatically generate labeled datasets for NER from public data sources by exploiting links and structured data from DBpedia and Wikipedia. Due to the massive size of these data sources, the resulting dataset — SESAME Available at https://sesame-pt.github.io — is composed of millions of labeled sentences. We detail the method to generate the dataset, report relevant statistics, and design a baseline using a neural network, showing that our dataset helps building better NER predictors.

Machine learning algorithms learn from data and use data from databases that are mutable; therefore, the data and the results of machine learning cannot be fully trusted. Also, the machine learning process is often difficult to automate. A unified analytical framework for trustable machine learning has been presented in the literature. It proposed building a trustable machine learning system by using blockchain technology, which can store data in a permanent and immutable way. In addition, smart contracts on blockchain are used to automate the machine learning process. In the proposed framework, a core machine learning algorithm can have three implementations: server layer implementation, streaming layer implementation, and smart contract implementation. However, there are still open questions. First, the streaming layer usually deploys on edge devices and therefore has limited memory and computing power. How can we run machine learning on the streaming layer? Second, most data that are stored on blockchain are financial transactions, for which fraud detection is often needed. However, in some applications, training data are hard to obtain. Can we build good machine learning models to do fraud detection with limited training data? These questions motivated this paper; which makes two contributions. First, it proposes training a machine learning model on the server layer and saving the model with a special binary data format. Then, the streaming layer can take this blob of binary data as input and score incoming data online. The blob of binary data is very compact and can be deployed on edge devices. Second, the paper presents a new method of synthetic data generation that can enrich the training data set. Experiments show that this synthetic data generation is very effective in applications such as fraud detection in financial data.

We present a new local entity disambiguation system. The key to our system is a novel approach for learning entity representations. In our approach we learn an entity aware extension of Embedding for Language Model (ELMo) which we call Entity-ELMo (E-ELMo). Given a paragraph containing one or more named entity mentions, each mention is first defined as a function of the entire paragraph (including other mentions), then they predict the referent entities. Utilizing E-ELMo for local entity disambiguation, we outperform all of the state-of-the-art local and global models on the popular benchmarks by improving about 0.5\% on micro average accuracy for AIDA test-b with Yago candidate set. The evaluation setup of the training data and candidate set are the same as our baselines for fair comparison.

Recently, there has been strong interest in developing natural language applications that live on personal devices such as mobile phones, watches and IoT with the objective to preserve user privacy and have low memory. Advances in Locality-Sensitive Hashing (LSH)-based projection networks have demonstrated state-of-the-art performance without any embedding lookup tables and instead computing on-the-fly text representations. However, previous works have not investigated ‘What makes projection neural networks effective at capturing compact representations for text classification?’ and ‘Are these projection models resistant to perturbations and misspellings in input text?’. In this paper, we analyze and answer these questions through perturbation analyses and by running experiments on multiple dialog act prediction tasks. Our results show that the projections are resistant to perturbations and misspellings compared to widely-used recurrent architectures that use word embeddings. On ATIS intent prediction task, when evaluated with perturbed input data, we observe that the performance of recurrent models that use word embeddings drops significantly by more than 30% compared to just 5% with projection networks, showing that LSH-based projection representations are robust and consistently lead to high quality performance.

Data stream classification methods demonstrate promising performance on a single data stream by exploring the cohesion in the data stream. However, multiple data streams that involve several correlated data streams are common in many practical scenarios, which can be viewed as multi-task data streams. Instead of handling them separately, it is beneficial to consider the correlations among the multi-task data streams for data stream modeling tasks. In this regard, a novel classification method called double-coupling support vector machines (DC-SVM), is proposed for classifying them simultaneously. DC-SVM considers the external correlations between multiple data streams, while handling the internal relationship within the individual data stream. Experimental results on artificial and real-world multi-task data streams demonstrate that the proposed method outperforms traditional data stream classification methods.

Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL’s strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.

The Pearson distance between a pair of random variables with correlation , namely, 1-, has gained widespread use, particularly for clustering, in areas such as gene expression analysis, brain imaging and cyber security. In all these applications it is implicitly assumed/required that the distance measures be metrics, thus satisfying the triangle inequality. We show however, that Pearson distance is not a metric. We go on to show that this can be repaired by recalling the result, (well known in other literature) that is a metric. We similarly show that a related measure of interest, , which is invariant to the sign of , is not a metric but that is. We also give generalizations of these results.

This paper studies stochastic optimization problems with polynomials. We propose an optimization model with sample averages and perturbations. The Lasserre type Moment-SOS relaxations are used to solve the sample average optimization. Properties of the optimization and its relaxations are studied. Numerical experiments are presented.

Predictions and predictive knowledge have seen recent success in improving not only robot control but also other applications ranging from industrial process control to rehabilitation. A property that makes these predictive approaches well suited for robotics is that they can be learned online and incrementally through interaction with the environment. However, a remaining challenge for many prediction-learning approaches is an appropriate choice of prediction-learning parameters, especially parameters that control the magnitude of a learning machine’s updates to its predictions (the learning rate or step size). To begin to address this challenge, we examine the use of online step-size adaptation using a sensor-rich robotic arm. Our method of choice, Temporal-Difference Incremental Delta-Bar-Delta (TIDBD), learns and adapts step sizes on a feature level; importantly, TIDBD allows step-size tuning and representation learning to occur at the same time. We show that TIDBD is a practical alternative for classic Temporal-Difference (TD) learning via an extensive parameter search. Both approaches perform comparably in terms of predicting future aspects of a robotic data stream. Furthermore, the use of a step-size adaptation method like TIDBD appears to allow a system to automatically detect and characterize common sensor failures in a robotic application. Together, these results promise to improve the ability of robotic devices to learn from interactions with their environments in a robust way, providing key capabilities for autonomous agents and robots.

In this paper, we propose a new method to build fair Neural-Network classifiers by using a constraint based on the Wasserstein distance. More specifically, we detail how to efficiently compute the gradients of Wasserstein-2 regularizers for Neural-Networks. The proposed strategy is then used to train Neural-Networks decision rules which favor fair predictions. Our method fully takes into account two specificities of Neural-Networks training: (1) The network parameters are indirectly learned based on automatic differentiation and on the loss gradients, and (2) batch training is the gold standard to approximate the parameter gradients, as it requires a reasonable amount of computations and it can efficiently explore the parameters space. Results are shown on synthetic data, as well as on the UCI Adult Income Dataset. Our method is shown to perform well compared with ‘ZafarICWWW17’ and linear-regression with Wasserstein-1 regularization, as in ‘JiangUAI19’, in particular when non-linear decision rules are required for accurate predictions.

Multimodal language analysis is an emerging research area in natural language processing that models language in a multimodal manner. It aims to understand language from the modalities of text, visual, and acoustic by modeling both intra-modal and cross-modal interactions. BERT (Bidirectional Encoder Representations from Transformers) provides strong contextual language representations after training on large-scale unlabeled corpora. Fine-tuning the vanilla BERT model has shown promising results in building state-of-the-art models for diverse NLP tasks like question answering and language inference. However, fine-tuning BERT in the presence of information from other modalities remains an open research problem. In this paper, we inject multimodal information within the input space of BERT network for modeling multimodal language. The proposed injection method allows BERT to reach a new state of the art of binary accuracy on CMU-MOSI dataset (multimodal sentiment analysis) with a gap of 5.98 percent to the previous state of the art and 1.02 percent to the text-only BERT.

Zero-shot learning (ZSL) is a challenging problem that aims to recognize the target categories without seen data, where semantic information is leveraged to transfer knowledge from some source classes. Although ZSL has made great progress in recent years, most existing approaches are easy to overfit the sources classes in generalized zero-shot learning (GZSL) task, which indicates that they learn little knowledge about target classes. To tackle such problem, we propose a novel Transferable Contrastive Network (TCN) that explicitly transfers knowledge from the source classes to the target classes. It automatically contrasts one image with different classes to judge whether they are consistent or not. By exploiting the class similarities to make knowledge transfer from source images to similar target classes, our approach is more robust to recognize the target images. Experiments on five benchmark datasets show the superiority of our approach for GZSL.

Recent systems for converting natural language descriptions into regular expressions have achieved some success, but typically deal with short, formulaic text and can only produce simple regular expressions, limiting their applicability. Real-world regular expressions are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user’s intent. We present a framework for regular expression synthesis in this setting where both natural language and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regular expression containing holes to denote missing components. Then a program synthesizer enumerates the regular expression space defined by the sketch and finds a regular expression that is consistent with the given string examples. Our semantic parser can be trained from supervised or heuristically-derived sketches and additionally fine-tuned with weak supervision based on correctness of the synthesized regex. We conduct experiments on two public large-scale datasets (Kushman and Barzilay, 2013; Locascio et al., 2016) and a real-world dataset we collected from StackOverflow. Our system achieves state-of-the-art performance on the public datasets and successfully solves 57% of the real-world dataset, which existing neural systems completely fail on.

Learning with minimal data is one of the key challenges in the development of practical, production-ready goal-oriented dialogue systems. In a real-world enterprise setting where dialogue systems are developed rapidly and are expected to work robustly for an ever-growing variety of domains, products, and scenarios, efficient learning from a limited number of examples becomes indispensable. In this paper, we introduce a technique to achieve state-of-the-art dialogue generation performance in a few-shot setup, without using any annotated data. We do this by leveraging background knowledge from a larger, more highly represented dialogue source — namely, the MetaLWOz dataset. We evaluate our model on the Stanford Multi-Domain Dialogue Dataset, consisting of human-human goal-oriented dialogues in in-car navigation, appointment scheduling, and weather information domains. We show that our few-shot approach achieves state-of-the art results on that dataset by consistently outperforming the previous best model in terms of BLEU and Entity F1 scores, while being more data-efficient by not requiring any data annotation.

This paper proposes a dually interactive matching network (DIM) for presenting the personalities of dialogue agents in retrieval-based chatbots. This model develops from the interactive matching network (IMN) which models the matching degree between a context composed of multiple utterances and a response candidate. Compared with previous persona fusion approaches which enhance the representation of a context by calculating its similarity with a given persona, the DIM model adopts a dual matching architecture, which performs interactive matching between responses and contexts and between responses and personas respectively for ranking response candidates. Experimental results on PERSONA-CHAT dataset show that the DIM model outperforms its baseline model, i.e., IMN with persona fusion, by a margin of 14.5% and outperforms the current state-of-the-art model by a margin of 27.7% in terms of top-1 accuracy hits@1.

Most existing person re-identification (ReID) methods have good feature representations to distinguish pedestrians with deep convolutional neural network (CNN) and metric learning methods. However, these works concentrate on the similarity between encoder output and ground-truth, ignoring the correlation between input and encoder output, which affects the performance of identifying different pedestrians. To address this limitation, We design a Deep InfoMax (DIM) network to maximize the mutual information (MI) between the input image and encoder output, which doesn’t need any auxiliary labels. To evaluate the effectiveness of the DIM network, we propose end-to-end Global-DIM and Local-DIM models. Additionally, the DIM network provides a new solution for cross-dataset unsupervised ReID issue as it needs no extra labels. The experiments prove the superiority of MI theory on the ReID issue, which achieves the state-of-the-art results.

We propose a novel domain specific loss, which is a differentiable loss function based on the dose volume histogram, and combine it with an adversarial loss for the training of deep neural networks to generate Pareto optimal dose distributions. The mean squared error (MSE) loss, dose volume histogram (DVH) loss, and adversarial (ADV) loss were used to train 4 instances of the neural network model: 1) MSE, 2) MSE+ADV, 3) MSE+DVH, and 4) MSE+DVH+ADV. 70 prostate patients were acquired, and the dose influence arrays were calculated for each patient. 1200 Pareto surface plans per patient were generated by pseudo-randomizing the tradeoff weights (84,000 plans total). We divided the data into 54 training, 6 validation, and 10 testing patients. Each model was trained for 100,000 iterations, with a batch size of 2. The prediction time of each model is 0.052 seconds. Quantitatively, the MSE+DVH+ADV model had the lowest prediction error of 0.038 (conformation), 0.026 (homogeneity), 0.298 (R50), 1.65% (D95), 2.14% (D98), 2.43% (D99). The MSE model had the worst prediction error of 0.134 (conformation), 0.041 (homogeneity), 0.520 (R50), 3.91% (D95), 4.33% (D98), 4.60% (D99). For both the mean dose PTV error and the max dose PTV, Body, Bladder and rectum error, the MSE+DVH+ADV outperformed all other models. All model’s predictions have an average mean and max dose error less than 2.8% and 4.2%, respectively. Expert human domain specific knowledge can be the largest driver in the performance improvement, and adversarial learning can be used to further capture nuanced features. The real-time prediction capabilities allow for a physician to quickly navigate the tradeoff space, and produce a dose distribution as a tangible endpoint for the dosimetrist to use for planning. This can considerably reduce the treatment planning time, allowing for clinicians to focus their efforts on challenging cases.

One-class learning is the classic problem of fitting a model to data for which annotations are available only for a single class. In this paper, we propose a novel objective for one-class learning. Our key idea is to use a pair of orthonormal frames — as subspaces — to ‘sandwich’ the labeled data via optimizing for two objectives jointly: i) minimize the distance between the origins of the two subspaces, and ii) to maximize the margin between the hyperplanes and the data, either subspace demanding the data to be in its positive and negative orthant respectively. Our proposed objective however leads to a non-convex optimization problem, to which we resort to Riemannian optimization schemes and derive an efficient conjugate gradient scheme on the Stiefel manifold. To study the effectiveness of our scheme, we propose a new dataset~\emph{Dash-Cam-Pose}, consisting of clips with skeleton poses of humans seated in a car, the task being to classify the clips as normal or abnormal; the latter is when any human pose is out-of-position with regard to say an airbag deployment. Our experiments on the proposed Dash-Cam-Pose dataset, as well as several other standard anomaly/novelty detection benchmarks demonstrate the benefits of our scheme, achieving state-of-the-art one-class accuracy.

Outcome regressed on class labels identified by unsupervised clustering is custom in many applications. However, it is common to ignore the misclassification of class labels caused by the learning algorithm, which potentially leads to serious bias of the estimated effect parameters. Due to its generality we suggest to redress the situation by use of the simulation and extrapolation method. Performance is illustrated by simulated data from Gaussian mixture models. Finally, we apply our method to a study which regressed overall survival on class labels derived from unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients.

In this paper, we report our method for the Information Extraction task in 2019 Language and Intelligence Challenge. We incorporate BERT into the multi-head selection framework for joint entity-relation extraction. This model extends existing approaches from three perspectives. First, BERT is adopted as a feature extraction layer at the bottom of the multi-head selection framework. We further optimize BERT by introducing a semantic-enhanced task during BERT pre-training. Second, we introduce a large-scale Baidu Baike corpus for entity recognition pre-training, which is of weekly supervised learning since there is no actual named entity label. Third, soft label embedding is proposed to effectively transmit information between entity recognition and relation extraction. Combining these three contributions, we enhance the information extracting ability of the multi-head selection model and achieve F1-score 0.876 on testset-1 with a single model. By ensembling four variants of our model, we finally achieve F1 score 0.892 (1st place) on testset-1 and F1 score 0.8924 (2nd place) on testset-2.

With the prosperity of business intelligence, recommender systems have evolved into a new stage that we not only care about what to recommend, but why it is recommended. Explainability of recommendations thus emerges as a focal point of research and becomes extremely desired in e-commerce. Existent studies along this line often exploit item attributes and correlations from different perspectives, but they yet lack an effective way to combine both types of information for deep learning of personalized interests. In light of this, we propose a novel graph structure, \emph{attribute network}, based on both items’ co-purchase network and important attributes. A novel neural model called \emph{eRAN} is then proposed to generate recommendations from attribute networks with explainability and cold-start capability. Specifically, eRAN first maps items connected in attribute networks to low-dimensional embedding vectors through a deep autoencoder, and then an attention mechanism is applied to model the attractions of attributes to users, from which personalized item representation can be derived. Moreover, a pairwise ranking loss is constructed into eRAN to improve recommendations, with the assumption that item pairs co-purchased by a user should be more similar than those non-paired with negative sampling in personalized view. Experiments on real-world datasets demonstrate the effectiveness of our method compared with some state-of-the-art competitors. In particular, eRAN shows its unique abilities in recommending cold-start items with higher accuracy, as well as in understanding user preferences underlying complicated co-purchasing behaviors.

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11% to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.

Tremendous advances in parallel computing and graphics hardware opened up several novel real-time GPU applications in the fields of computer vision, computer graphics as well as augmented reality (AR) and virtual reality (VR). Although these applications built upon established open-source frameworks that provide highly optimized algorithms, they often come with custom self-written data structures to manage the underlying data. In this work, we present stdgpu, an open-source library which defines several generic GPU data structures for fast and reliable data management. Rather than abandoning previous established frameworks, our library aims to extend them, therefore bridging the gap between CPU and GPU computing. This way, it provides clean and familiar interfaces and integrates seamlessly into new as well as existing projects. We hope to foster further developments towards unified CPU and GPU computing and welcome contributions from the community.

We focus on graph-to-sequence learning, which can be framed as transducing graph structures to sequences for text generation. To capture structural information associated with graphs, we investigate the problem of encoding graphs using graph convolutional networks (GCNs). Unlike various existing approaches where shallow architectures were used for capturing local structural information only, we introduce a dense connection strategy, proposing a novel Densely Connected Graph Convolutional Networks (DCGCNs). Such a deep architecture is able to integrate both local and non-local features to learn a better structural representation of a graph. Our model outperforms the state-of-the-art neural models significantly on AMRto-text generation and syntax-based neural machine translation.

A recent analysis of a model of iterative neural network in Hilbert spaces established fundamental properties of such networks, such as existence of the fixed points sets, convergence analysis, and Lipschitz continuity. Building on these results, we show that under a single mild condition on the weights of the network, one is guaranteed to obtain a neural network converging to its unique fixed point. We provide a bound on the norm of this fixed point in terms of norms of weights and biases of the network. We also show why this model of a feed-forward neural network is not able to accomodate Hopfield networks under our assumption.

In computational science and in computer science, research software is a central asset for research. Computational science is the application of computer science and software engineering principles to solving scientific problems, whereas computer science is the study of computer hardware and software design. The Open Science agenda holds that science advances faster when we can build on existing results. Therefore, research software has to be reusable for advancing science. Thus, we need proper research software engineering for obtaining reusable and sustainable research software. This way, software engineering methods may improve research in other disciplines. However, research in software engineering and computer science itself will also benefit from reuse when research software is involved. For good scientific practice, the resulting research software should be open and adhere to the FAIR principles (findable, accessible, interoperable and repeatable) to allow repeatability, reproducibility, and reuse. Compared to research data, research software should be both archived for reproducibility and actively maintained for reusability. The FAIR data principles do not require openness, but research software should be open source software. Established open source software licenses provide sufficient licensing options, such that it should be the rare exception to keep research software closed. We review and analyze the current state in this area in order to give recommendations for making computer science research software FAIR and open. We observe that research software publishing practices in computer science and in computational science show significant differences.

This paper is about regularizing deep convolutional networks (ConvNets) based on an adaptive multi-objective framework for transfer learning with limited training data in the target domain. Recent advances of ConvNets regularization in this context are commonly due to the use of additional regularization objectives. They guide the training away from the target task using some concrete tasks. Unlike those related approaches, we report that an objective without a concrete goal can serve surprisingly well as a regularizer. In particular, we demonstrate Pseudo-task Regularization (PtR) which dynamically regularizes a network by simply attempting to regress image representations to a pseudo-target during fine-tuning. Through numerous experiments, the improvements on classification accuracy by PtR are shown greater or on a par to the recent state-of-the-art methods. These results also indicate a room for rethinking on the requirements for a regularizer, i.e., if specifically designed task for regularization is indeed a key ingredient. The contributions of this paper are: a) PtR provides an effective and efficient alternative for regularization without dependence on concrete tasks or extra data; b) desired strength of regularization effect in PtR is dynamically adjusted and maintained based on the gradient norms of the target objective and the pseudo-task.

One-shot neural architecture search features fast training of a supernet in a single run. A pivotal issue for this weight-sharing approach is the lacking of scalability. A simple adjustment with identity block renders a scalable supernet but it arouses unstable training, which makes the subsequent model ranking unreliable. In this paper, we introduce linearly equivalent transformation to soothe training turbulence, providing with the proof that such transformed path is identical with the original one as per representational power. The overall method is named as SCARLET (SCAlable supeRnet with Linearly Equivalent Transformation). We show through experiments that linearly equivalent transformations can indeed harmonize the supernet training. With an EfficientNet-like search space and a multi-objective reinforced evolutionary backend, it generates a series of competitive models: Scarlet-A achieves 76.9% Top-1 accuracy on ImageNet which outperforms EfficientNet-B0 by a large margin; the shallower Scarlet-B exemplifies the proposed scalability which attains the same accuracy 76.3% as EfficientNet-B0 with much fewer FLOPs; Scarlet-C scores competitive 75.6% with comparable sizes. The models and evaluation code are released online https://…/ScarletNAS .

It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. The limiting case when the network size approaches infinity has also been considered. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. We show that for smooth activations, such as tanh and swish, the minimum eigenvalue can be exponentially small depending on the span of the dataset implying that the training can be very slow. In contrast, for activations with a ‘kink,’ such as ReLU, SELU, ELU, all eigenvalues are large under minimal assumptions on the data. Several new ideas are involved. Finally, we corroborate our results empirically.

Computer simulations have become a popular tool of assessing complex skills such as problem-solving skills. Log files of computer-based items record the entire human-computer interactive processes for each respondent. The response processes are very diverse, noisy, and of nonstandard formats. Few generic methods have been developed for exploiting the information contained in process data. In this article, we propose a method to extract latent variables from process data. The method utilizes a sequence-to-sequence autoencoder to compress response processes into standard numerical vectors. It does not require prior knowledge of the specific items and human-computers interaction patterns. The proposed method is applied to both simulated and real process data to demonstrate that the resulting latent variables extract useful information from the response processes.

“Naively at lot of businesses think all it takes is to hit the big red button marked ‘Data Science’ get the answer of 42 and sit back and watch the $$$’s roll in, if you’ve been in this world you know this isn’t the case.”Michael Pearmain

VIPE This paper presents a new interactive opinion mining tool that helps users to classify large sets of short texts originated from Web opinion polls, technical forums or Twitter. From a manual multi-label pre-classification of a very limited text subset, a learning algorithm predicts the labels of the remaining texts of the corpus and the texts most likely associated to a selected label. Using a fast matrix factorization, the algorithm is able to handle large corpora and is well-adapted to interactivity by integrating the corrections proposed by the users on the fly. Experimental results on classical datasets of various sizes and feedbacks of users from marketing services of the telecommunication company Orange confirm the quality of the obtained results. …

Adaptive Shivers Sort (ASS) We present a stable mergesort, called~\ASS, that exploits the existence of monotonic runs for sorting efficiently partially sorted data. We also prove that, although this algorithm is simple to implement, its computational cost, in number of comparisons performed, is optimal up to an additive linear term. …

Robust Options Deep Q Network Robust reinforcement learning aims to produce policies that have strong guarantees even in the face of environments/transition models whose parameters have strong uncertainty. Existing work uses value-based methods and the usual primitive action setting. In this paper, we propose robust methods for learning temporally abstract actions, in the framework of options. We present a Robust Options Policy Iteration (ROPI) algorithm with convergence guarantees, which learns options that are robust to model uncertainty. We utilize ROPI to learn robust options with the Robust Options Deep Q Network (RO-DQN) that solves multiple tasks and mitigates model misspecification due to model uncertainty. We present experimental results which suggest that policy iteration with linear features may have an inherent form of robustness when using coarse feature representations. In addition, we present experimental results which demonstrate that robustness helps policy iteration implemented on top of deep neural networks to generalize over a much broader range of dynamics than non-robust policy iteration. …

IteRank Neighbor-based collaborative ranking (NCR) techniques follow three consecutive steps to recommend items to each target user: first they calculate the similarities among users, then they estimate concordance of pairwise preferences to the target user based on the calculated similarities. Finally, they use estimated pairwise preferences to infer the total ranking of items for the target user. This general approach faces some problems as the rank data is usually sparse as users usually have compared only a few pairs of items and consequently, the similarities among users is calculated based on limited information and is not accurate enough for inferring true values of preference concordance and can lead to an invalid ranking of items. This article presents a novel framework, called IteRank, that models the data as a bipartite network containing users and pairwise preferences. It then simultaneously refines users’ similarities and preferences’ concordances using a random walk method on this graph structure. It uses the information in this first step in another network structure for simultaneously adjusting the concordances of preferences and rankings of items. Using this approach, IteRank can overcome some existing problems caused by the sparsity of the data. Experimental results show that IteRank improves the performance of recommendation compared to the state of the art NCR techniques that use the traditional NCR framework for recommendation. …

Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a ”philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: ”with enough data, the numbers speak for themselves”. The ”end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this ”philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in ”randomly” generated, large enough databases, which – as we will prove – implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.

Causal questions are omnipresent in many scientific problems. While much progress has been made in the analysis of causal relationships between random variables, these methods are not well suited if the causal mechanisms manifest themselves only in extremes. This work aims to connect the two fields of causal inference and extreme value theory. We define the causal tail coefficient that captures asymmetries in the extremal dependence of two random variables. In the population case, the causal tail coefficient is shown to reveal the causal structure if the distribution follows a linear structural causal model. This holds even in the presence of latent common causes that have the same tail index as the observed variables. Based on a consistent estimator of the causal tail coefficient, we propose a computationally highly efficient algorithm that infers causal structure from finitely many data. We prove that our method consistently estimates the causal order and compare it to other well-established and non-extremal approaches in causal discovery on synthetic data. The code is available as an open-access R package on Github.

Policy evaluation is central to economic data analysis, but economists mostly work with observational data in view of limited opportunities to carry out controlled experiments. In the potential outcome framework, the panel data approach (Hsiao, Ching and Wan, 2012) constructs the counterfactual by exploiting the correlation between cross-sectional units in panel data. The choice of cross-sectional control units, a key step in their implementation, is nevertheless unresolved in data-rich environment when many possible controls are at the researcher’s disposal. We propose the forward selection method to choose control units, and establish validity of post-selection inference. Our asymptotic framework allows the number of possible controls to grow much faster than the time dimension. The easy-to-implement algorithms and their theoretical guarantee extend the panel data approach to big data settings. Monte Carlo simulations are conducted to demonstrate the finite sample performance of the proposed method. Two empirical examples illustrate the usefulness of our procedure when many controls are available in real-world applications.

Graph drawing and visualisation techniques are important tools for the exploratory analysis of complex systems. While these methods are regularly applied to visualise data on complex networks, we increasingly have access to time series data that can be modelled as temporal networks or dynamic graphs. In such dynamic graphs, the temporal ordering of time-stamped edges determines the causal topology of a system, i.e. which nodes can directly and indirectly influence each other via a so-called causal path. While this causal topology is crucial to understand dynamical processes, the role of nodes, or cluster structures, we lack graph drawing techniques that incorporate this information into static visualisations. Addressing this gap, we present a novel dynamic graph drawing algorithm that utilises higher-order graphical models of causal paths in time series data to compute time-aware static graph visualisations. These visualisations combine the simplicity of static graphs with a time-aware layout algorithm that highlights patterns in the causal topology that result from the temporal dynamics of edges.

We propose a method to open the black box of the Multi-Layer Perceptron by inferring from it a simpler and generally more accurate general additive model. The resulting model comprises non-linear univariate and bivariate partial responses derived from the original Multi-Layer Perceptron. The responses are combined using the Lasso and further optimised within a modular structure. The approach is generic and provides a constructive framework to simplify and explain the Multi-Layer Perceptron for any data set, opening the door for validation against prior knowledge. Experimental results on benchmarking datasets indicate that the partial responses are intuitive to interpret and the Area Under the Curve is competitive with Gradient Boosting, Support Vector Machines and Random Forests. The performance improvement compared with a fully connected Multi-Layer Perceptron is attributed to reduced confounding in the second stage of optimisation of the weights. The main limitation of the method is that it explicitly models only up to pairwise interactions. For many practical applications this will be optimal, but where that is not the case then this will be indicated by the performance difference compared to the original model. The streamlined model simultaneously interprets and optimises this frequently used flexible model.

Empirical value of the Hellinger correlation, a new measure of dependence between two continuous random variables that satisfies a set of 7 desirable axioms (existence, symmetry, normalisation, characterisation of independence, weak Gaussian conformity, characterisation of pure dependence, generalised Data Processing Inequality). More details can be found in Geenens and Lafaye De Micheaux (2018) .

• csdmpy A python module for importing and exporting CSD model file-format. The `csdmpy` package is a Python support for the core scientific dataset (CSD) model file exchange-format. The package is based on the core scientific dataset (CSD) model which is designed as a building block in the development of a more sophisticated portable scientific dataset file standard. The CSD model is capable of handling a wide variety of scientific datasets both within and across disciplinary fields.

• lisc Literature Scanner. LISC, or ‘literature scanner’ is a package for collecting and analyzing scientific literature. LISC acts as a wrapper and connector between available APIs, allowing users to collect data from and about scientific articles, and to do analyses on this data, such as performing automated meta-analyses.

• pylift Python implementation of uplift modeling.

• pythonGraph simple graphics in Python. pythonGraph is a simple graphics package for Python. It’s design is loosely descended from Adagraph and RAPTOR Graph. pythonGraph supports the creation of a graphics window, the display of simple geometric shapes, and simple user interactions using the mouse and keyboard.

• pytorch-transformers-pvt-nightly Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM

• rl-musician Composition of music with reinforcement learning.

• shapeletpy Time series classification Using Shapelets

• torch-radiate Automatic deep learning research report generator

• torch-stethoscope Automatic deep learning research report generator

• yahoo-finance-hdd Download historical financial data from yahoo finance

• affnine-deltaleaf Easy way to create MAP (Undirected Graph)and find shortest path from one Node to another.

• apai ML & DL helper library. Helper library built on Tensorflow. Provides utility and helper functions for the machine and deep learning tasks.

• easyagents-v1 reinforcement learning for practitioners. Use EasyAgents if
• you are looking for an easy and simple way to get started with reinforcement learning
• you have implemented your own environment and want to experiment with it
• you want mix and match different implementations and algorithms

• ignis Intuitive library for training neural nets in PyTorch. `ignis` is a high-level library that helps you write compact but full-featured training loops with metrics, early stops, and model checkpoints for deep learning library [PyTorch](https://pytorch.org ).

Auto-Adaptive Parentage Inference Software Tolerant to Missing Parents (APIS) Parentage assignment package. Parentage assignment is performed based on observed average Mendelian transmission probability distributions. The main function of this package is the function APIS(), which is the parentage assignment function.

Functional Latent Data Models for Clustering Heterogeneous Curves (‘FLaMingos’) (flamingos) Provides a variety of original and flexible user-friendly statistical latent variable models for the simultaneous clustering and segmentation of heterogeneous functional data (i.e time series, or more generally longitudinal data, fitted by unsupervised algorithms, including EM algorithms. Functional Latent Data Models for Clustering heterogeneous curves (‘FLaMingos’) are originally introduced and written in ‘Matlab’ by Faicel Chamroukhi <https://…&type=public&language=matlab>. The references are mainly the following ones. Chamroukhi F. (2010) <https://…/FChamroukhi-PhD.pdf>. Chamroukhi F., Same A., Govaert, G. and Aknin P. (2010) <doi:10.1016/j.neucom.2009.12.023>. Chamroukhi F., Same A., Aknin P. and Govaert G. (2011). <doi:10.1109/IJCNN.2011.6033590>. Same A., Chamroukhi F., Govaert G. and Aknin, P. (2011) <doi:10.1007/s11634-011-0096-5>. Chamroukhi F., and Glotin H. (2012) <doi:10.1109/IJCNN.2012.6252818>. Chamroukhi F., Glotin H. and Same A. (2013) <doi:10.1016/j.neucom.2012.10.030>. Chamroukhi F. (2015) <https://…/FChamroukhi-HDR.pdf>. Chamroukhi F. and Nguyen H-D. (2019) <doi:10.1002/widm.1298>.

Create Split Packed Bubble Charts (hpackedbubble) By binding R functions and the ‘Highcharts’ <http://…/> charting library, ‘hpackedbubble’ package provides a simple way to draw split packed bubble charts.

Count Transformation Models (cotram) Count transformation models featuring parameters interpretable as discrete hazard ratios, odds ratios, reverse-time discrete hazard ratios, or transformed expectations. An appropriate data transformation for a count outcome and regression coefficients are simultaneously estimated by maximising the exact discrete log-likelihood using the computational framework provided in package ‘mlt’, technical details are given in Hothorn et al. (2018) <DOI:10.1111/sjos.12291>.

Future buildings will offer new convenience, comfort, and efficiency possibilities to their residents. Changes will occur to the way people live as technology involves into people’s lives and information processing is fully integrated into their daily living activities and objects. The future expectation of smart buildings includes making the residents’ experience as easy and comfortable as possible. The massive streaming data generated and captured by smart building appliances and devices contains valuable information that needs to be mined to facilitate timely actions and better decision making. Machine learning and big data analytics will undoubtedly play a critical role to enable the delivery of such smart services. In this paper, we survey the area of smart building with a special focus on the role of techniques from machine learning and big data analytics. This survey also reviews the current trends and challenges faced in the development of smart building services.Machine Learning, Big Data, And Smart Buildings: A Comprehensive Survey

“The most important step for applying machine learning to DevOps is to select a method (accuracy, f1, or other), define the expected target, and its evolution.”Boris Tvaroska ( 2017 )