# If you did not already know

Fair on Average Causal Effect (FACE)
As virtually all aspects of our lives are increasingly impacted by algorithmic decision making systems, it is incumbent upon us as a society to ensure such systems do not become instruments of unfair discrimination on the basis of gender, race, ethnicity, religion, etc. We consider the problem of determining whether the decisions made by such systems are discriminatory, through the lens of causal models. We introduce two definitions of group fairness grounded in causality: fair on average causal effect (FACE), and fair on average causal effect on the treated (FACT). We use the Rubin-Neyman potential outcomes framework for the analysis of cause-effect relationships to robustly estimate FACE and FACT. We demonstrate the effectiveness of our proposed approach on synthetic data. Our analyses of two real-world data sets, the Adult income data set from the UCI repository (with gender as the protected attribute), and the NYC Stop and Frisk data set (with race as the protected attribute), show that the evidence of discrimination obtained by FACE and FACT, or lack thereof, is often in agreement with the findings from other studies. We further show that FACT, being somewhat more nuanced compared to FACE, can yield findings of discrimination that differ from those obtained using FACE. …

Denoising-GAN
Recently, Generative Adversarial Networks (GANs) have emerged as a popular alternative for modeling complex high dimensional distributions. Most of the existing works implicitly assume that the clean samples from the target distribution are easily available. However, in many applications, this assumption is violated. In this paper, we consider the observation setting when the samples from target distribution are given by the superposition of two structured components and leverage GANs for learning the structure of the components. We propose two novel frameworks: denoising-GAN and demixing-GAN. The denoising-GAN assumes access to clean samples from the second component and try to learn the other distribution, whereas demixing-GAN learns the distribution of the components at the same time. Through extensive numerical experiments, we demonstrate that proposed frameworks can generate clean samples from unknown distributions, and provide competitive performance in tasks such as denoising, demixing, and compressive sensing. …

Stochastic Weight Averaging in Low-Precision Training (SWALP)
Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SWALP is easy to implement and can match the performance of full-precision SGD even with all numbers quantized down to 8 bits, including the gradient accumulators. Additionally, we show that SWALP converges arbitrarily close to the optimal solution for quadratic objectives, and to a noise ball asymptotically smaller than low precision SGD in strongly convex settings. …

CP-Net
While several convolution-like operators have recently been proposed for extracting features out of point clouds, down-sampling an unordered point cloud in a deep neural network has not been rigorously studied. Existing methods down-sample the points regardless of their importance for the output. As a result, some important points in the point cloud may be removed, while less valuable points may be passed to the next layers. In contrast, adaptive down-sampling methods sample the points by taking into account the importance of each point, which varies based on the application, task and training data. In this paper, we propose a permutation-invariant learning-based adaptive down-sampling layer, called Critical Points Layer (CPL), which reduces the number of points in an unordered point cloud while retaining the important points. Unlike most graph-based point cloud down-sampling methods that use $k$-NN search algorithm to find the neighbouring points, CPL is a global down-sampling method, rendering it computationally very efficient. The proposed layer can be used along with any graph-based point cloud convolution layer to form a convolutional neural network, dubbed CP-Net in this paper. We introduce a CP-Net for $3$D object classification that achieves the best accuracy for the ModelNet$40$ dataset among point cloud-based methods, which validates the effectiveness of the CPL. …

# Distilled News

One approach of using random number generation inside a function without affecting outer state of random generator.
Data is key for any analysis in data science, be it inferential analysis, predictive analysis, or prescriptive analysis. The predictive power of a model depends on the quality of the data that was used in building the model. Data comes in different forms such as text, table, image, voice or video. Most often, data that is used for analysis has to be mined, processed and transformed to render it to a form suitable for further analysis. The most common type of dataset used in most of the analysis is clean data that is stored in a comma-separated value (csv) table. However because a printable document format (pdf) file is one of the most used file formats, every data scientist should understand how to extract data from a pdf file and transform the data into a format such as ‘csv’ that can then be used for analysis or model building.
The superiority of particle filter technology in nonlinear and non-Gaussian systems determines its wide range of applications. In addition, the multi-modal processing capability of the particle filter is one of the reasons why it is widely used. Internationally, particle filtering has been applied in various fields.
Why the confusion of these concepts has profound implications, from healthcare to business management. In correlated data, a pair of variables are related in that one thing is likely to change when the other does. This relationship might lead us to assume that a change to one thing causes the change in the other. This article clarifies that kind of faulty thinking by explaining correlation, causation, and the bias that often lumps the two together. The human brain simplifies incoming information, so we can make sense of it. Our brains often do that by making assumptions about things based on slight relationships, or bias. But that thinking process isn’t foolproof. An example is when we mistake correlation for causation. Bias can make us conclude that one thing must cause another if both change in the same way at the same time. This article clears up the misconception that correlation equals causation by exploring both of those subjects and the human brain’s tendency toward bias.
This is Part III of the ‘Building An A.I. Music Generator’ series. I’ll be covering the basics of Multitask training with Music Models – which we’ll use to do really cool things like harmonization, melody generation, and song remixing. We’ll be building off of Part I and Part II.
Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG). There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?
Together with blockchain and machine learning, stream processing seems to be one of the hottest topics nowadays. Companies are onboarding modern stream processing tools, service providers are releasing better and more powerful stream processing products, and specialists are in high demand. This article introduces the basics of stream processing. It starts with a rationale for why we need stream processing and how it works under the hood. Then it goes into how to write simple, scalable distributed stream processing applications. All in fewer than 40 lines of code! Since stream processing is a vast topic, this article is focused mostly on the data management part while sophisticated processing is left for another article. To make the article more practical, it discusses AWS Kinesis, a stream processing solution from Amazon, but it also refers to other popular Open Source technologies to present a broader picture.
Companies of all sizes and in all industries are developing ways to harness the power of big data for better decision-making. To provide valuable insights and meet expectations, data science teams have long turned to predictive analytics – or using historical data to model a problem and uncover the key factors that generated specific outcomes in the past to make predictions about the future. Predictive analytics has been around for years; however, prior to machine learning, the technology was not easy to adopt or scale in real-time. Machine learning is modernizing predictive analytics, providing data scientists with the ability to augment their efforts with more real-time insights. And thanks to hybrid cloud infrastructure opportunities, it’s now possible to embed and scale predictive analytics in almost any business application quickly and efficiently. The ability to process larger quantities of data in real-time results in more accurate predictions, and therefore, better business decisions. However, modernizing predictive analytics is not without its challenges. Here are a few ways companies can modernize the deployment of their legacy predictive models, and the pros and cons of these popular approaches.
One of the primary computer vision’s tasks is object tracking. Object tracking is used in the vast majority of applications such as: video surveillance, car tracking (distance estimation), people detection and tracking, etc. The object trackers usually need some initialization step such as the initial object location, which can be provided manually or automatically by using object detector such as Viola and Jones detector or fast template matching. There are several major problems related to tracking:
• occlusion
• multiple objects
• scale, illumination, appearance change
• difficult and rapid motions
• …
Although the object tracking problem is present for years, it is still not solved, and there are many object trackers, the ones that are built for special purposes and generic ones.
The Kalman filter assumes linear motion model and Gaussian noise and returns only one hypothesis (e.g. the probable position of a tracked object). If the movements are rapid and unpredictable (e.g. leaf on a tree during windy day), the Kalman filter is likely to fail. The particle filter returns multiple hypotheses (each particle presents one hypothesis) and thus can deal with non-Gaussian noise and support non-linear models. Besides the object tracking where the state is a position vector (x, y), the state can be anything, e.g., shape of the model. This article will explain the main idea behind particle filter and will focus on their practical usage for object tracking along with samples.
Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a ”philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: ”with enough data, the numbers speak for themselves”. The ”end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this ”philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in ”randomly” generated, large enough databases, which – as we will prove – implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.
Recent advances in AI have been made possible through access to ‘Big Data’ and cheap computing power. But can it go wrong? Big data is suddenly everywhere. From scarcity and difficulty to find data (and information), we now have a deluge of data. In recent years, the amount of available data has been growing in an exponential pace. This is in turn made possible due to the immense growth in number of devices recording data, as well as the connectivity between all these devices through the internet of things. Everyone seems to be collecting, analyzing, making money from and celebrating (or fearing) the powers of Big data. By combining the power of modern computing, it promises to solve virtually any problem – just by crunching the numbers.
The exploration-exploitation dilemma is faced by our agents while learning to play the game tic-tac-toe [Medium article]. This dilemma is a fundamental problem in reinforcement learning as well as in real life which we frequently face when choosing between options, would you rather:
• pick something you are familiar in order to maximise the chance of getting what you wanted
• or pick something you have not tried and possibly learning more, which may (or may not) result in you making better decisions in future
This trade-off will affect either you earn your reward sooner or you learn about the environment first then earn your rewards later.
The tech stack for Data Science teams is misunderstood by companies of all sizes. Oftentimes there is a failure to understand what tooling is necessary for what jobs. Fortunately, most trends in technology result in standardized workflows across industry. As of yet, this seems to have been limited in the Data Science world. There isn’t a clear route to building and deploying an AI product like there is for something like a basic web application. Maybe your AI solution is going to be deployed and provide predictions through a basic web application. This adds an extra layer of complexity, a layer most teams are not prepared to deal with.
In the previous post we saw what Bayes’ Theorem is, and went through an easy, intuitive example of how it works. You can find this post here. If you don’t know what Bayes’ Theorem is, and you have not had the pleasure to read it yet, I recommend you do, as it will make understanding this present article a lot easier. In this post, we will see the uses of this theorem in Machine Learning. Ready? Lets go then!
Before setting the parameter ? and plugging it into the formula, let’s pause a second and ask a question. Why did Poisson have to invent the Poisson Distribution?
Distributions play an essential role in the life of every Statistician. Now coming from a non-statistical background, distributions always come across as something mystical to me. And the fact is that there are a lot of them. So which ones should I know? And how do I know and understand them? This post is about some of the most used discrete distributions that you need to know along with some intuition and proofs.
1. Bernoulli Distribution
2. Binomial Distribution
3. Geometric Distribution
4. Negative Binomial Distribution
5. Poisson Distribution
Matplotlib’s default properties often yield unappealing plots. Here are several simple ways to spruce up your visualizations.

# Fresh from the Python Package Index

rgxg
ReGular eXpression Generator

robolab
Open source development tools for Robot Framework RPA developers

smitter
HPC submission for deep learning.

smoothfit
Smooth data fitting. Given experimental data, it is often desirable to produce a function whose values match the data to some degree. This package implements a robust approach to data fitting based on a minimization problem.

tensorflow-gcp
TensorFlow for GCP. TensorFlow for GCP is a software library for high performance numerical computation based on TensorFlow. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.

typecats
Structure unstructured data for the purpose of static type checking. In many web services it is common to consume or generate JSON or some JSON-like representation of data. JSON translates quite nicely to core Python objects such as dicts and lists. However, if your data is structured, it is nice to be able to work with it in a structured manner, i.e. with Python objects. Python objects give you better code readability, and in more recent versions of Python they are also capable of being statically type-checked with a tool like mypy.

awkward1
Development of awkward 1.0, to replace scikit-hep/awkward-array in 2020.

Azure-Sentinel-Utilities
AZURE SENTINEL NOTEBOOKS PYTHON TOOLS: This package is developed to support Azure Sentinel Notebooks. It is in an early preview stage so please provide feedback, report bugs, and suggets for new features.

graphqt
Create Graph DataStructure. This is test library for the python which lets you create graph using some functions. This library will allow you to create undirected graphs and also provides some of the other functionalities. This library is for the beginner programmer who wants to get started at the programming with the graphs and want to visualize it easily. It is highly recommended to learn the actual code so that one can implement them without using this library at the interviews, and competition. This library was designed for the programmers who are studying graph currently. This will allow them to have the idea and brute force solution to them which they can further optimize.

mk-tf-img-mod
Easy to use Image Learning Module for Tensorflow

Multi-Template-Matching
Object-recognition in images using multiple templates. Multi-Template-Matching is a package to perform object-recognition in images using one or several smaller template images. The template and images should have the same bitdepth (8,16,32-bit) and number of channels (single/Grayscale or RGB). The main function MTM.matchTemplates returns the best predicted locations provided either a score_threshold and/or the expected number of objects in the image.

A reinforcement learning library for training, evaluating, and deploying robust trading agents with TF2.

Apache Flink Python API. Apache Flink is an open source stream processing framework with the powerful stream- and batch-processing capabilities. Learn more about Flink at (https://flink.apache.org ) This packaging allows you to write Flink programs in Python, but it is currently a very initial version and will change in future versions.

change_detection
package for detecting change in time-series data. Detects changes in time series using the R package gets (https://…/index.html ). Uses a combination of Google BigQuery and Python to query data, which is then fed to the R change detection code. Outputs a table containing results.

gdascore
Tools for generating General Data Anonymity Scores (www.gda-score.org)

gdascore-mock
Tools for generating General Data Anonymity Scores (www.gda-score.org)

ingaia-gcloud
Python utility library for Google Cloud Platform

py2store
Interfacing with stored data through python. Storage CRUD how and where you want it. List, read, write, and delete data in a structured data source/target, as if manipulating simple python builtins (dicts, lists), or through the interface **you** want to interact with, with configuration or physical particularities out of the way. Also, being able to change these particularities without having to change the business-logic code.

# Whats new on arXiv

Using class labels to represent class similarity is a typical approach to training deep hashing systems for retrieval; samples from the same or different classes take binary 1 or 0 similarity values. This similarity does not model the full rich knowledge of semantic relations that may be present between data points. In this work we build upon the idea of using semantic hierarchies to form distance metrics between all available sample labels; for example cat to dog has a smaller distance than cat to guitar. We combine this type of semantic distance into a loss function to promote similar distances between the deep neural network embeddings. We also introduce an empirical Kullback-Leibler divergence loss term to promote binarization and uniformity of the embeddings. We test the resulting SHREWD method and demonstrate improvements in hierarchical retrieval scores using compact, binary hash codes instead of real valued ones, and show that in a weakly supervised hashing setting we are able to learn competitively without explicitly relying on class labels, but instead on similarities between labels.
The concepts of risk-aversion, chance-constrained optimization, and robust optimization have developed significantly over the last decade. Statistical learning community has also witnessed a rapid theoretical and applied growth by relying on these concepts. A modeling framework, called distributionally robust optimization (DRO), has recently received significant attention in both the operations research and statistical learning communities. This paper surveys main concepts and contributions to DRO, and its relationships with robust optimization, risk-aversion, chance-constrained optimization, and function regularization.
Recently, researchers utilize Knowledge Graph (KG) as side information in recommendation system to address cold start and sparsity issue and improve the recommendation performance. Existing KG-aware recommendation model use the feature of neighboring entities and structural information to update the embedding of currently located entity. Although the fruitful information is beneficial to the following task, the cost of exploring the entire graph is massive and impractical. In order to reduce the computational cost and maintain the pattern of extracting features, KG-aware recommendation model usually utilize fixed-size and random set of neighbors rather than complete information in KG. Nonetheless, there are two critical issues in these approaches: First of all, fixed-size and randomly selected neighbors restrict the view of graph. In addition, as the order of graph feature increases, the growth of parameter dimensionality of the model may lead the training process hard to converge. To solve the aforementioned limitations, we propose GraphSW, a strategy based on stage-wise training framework which would only access to a subset of the entities in KG in every stage. During the following stages, the learned embedding from previous stages is provided to the network in the next stage and the model can learn the information gradually from the KG. We apply stage-wise training on two SOTA recommendation models, RippleNet and Knowledge Graph Convolutional Networks (KGCN). Moreover, we evaluate the performance on six real world datasets, Last.FM 2011, Book-Crossing,movie, LFM-1b 2015, Amazon-book and Yelp 2018. The result of our experiments shows that proposed strategy can help both models to collect more information from the KG and improve the performance. Furthermore, it is observed that GraphSW can assist KGCN to converge effectively in high-order graph feature.
How to make the best decision between the opinions and tastes of your friends and acquaintances? Therefore, recommender systems are used to solve such issues. The common algorithms use a similarity measure to predict active users’ tastes over a particular item. According to the cold start and data sparsity problems, these systems cannot predict and suggest particular items to users. In this paper, we introduce a new recommender system is able to find user preferences and based on it, provides the recommendations. Our proposed system called CUPCF is a combination of two similarity measures in collaborative filtering to solve the data sparsity problem and poor prediction (high prediction error rate) problems for better recommendation. The experimental results based on MovieLens dataset show that, combined with the preferences of the user’s nearest neighbor, the proposed system error rate compared to a number of state-of-the-art recommendation methods improved. Furthermore, the results indicate the efficiency of CUPCF. The maximum improved error rate of the system is 15.5% and the maximum values of Accuracy, Precision and Recall of CUPCF are 0.91402, 0.91436 and 0.9974 respectively.
Recent years have witnessed a wave of research activities in systems science toward the study of population systems. The driving force behind this shift was geared by numerous emerging and ever-changing technologies in life and physical sciences and engineering, from neuroscience, biology, and quantum physics to robotics, where many control-enabled applications involve manipulating a large ensemble of structurally identical dynamic units, or agents. Analyzing fundamental properties of ensemble control systems in turn plays a foundational and critical role in enabling and, further, advancing these applications, and the analysis is largely beyond the capability of classical control techniques. In this paper, we consider an ensemble of time-invariant linear systems evolving on an infinite-dimensional space of continuous functions. We exploit the notion of separating points and techniques of polynomial approximation to develop necessary and sufficient ensemble controllability conditions. In particular, we introduce an extended notion of controllability matrix, called Ensemble Controllability Gramian. This means enables the characterization of ensemble controllability through evaluating controllability of each individual system in the ensemble. As a result, the work provides a unified framework with a systematic procedure for analyzing control systems defined on an infinite-dimensional space by a finite-dimensional approach.
We study the problem of end-to-end learning from complex multigraphs with potentially very large numbers of edges between two vertices, each edge labeled with rich information. Examples of such graphs include financial transactions, communication networks, or flights between airports. We propose Latent-Graph Convolutional Networks (L-GCNs), which can successfully propagate information from these edge labels to a latent adjacency tensor, after which further propagation and downstream tasks can be performed, such as node classification. We evaluate the performance of several variations of the model on two synthetic datasets simulating fraud in financial transaction networks, to ensure that the model must make use of edge labels in order to achieve good classification performance. We find that allowing for nonlinear interactions on a per-neighbor basis enhances performance significantly, while also showing promising results in an inductive setting.
In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.
In this paper, we propose a novel end-to-end framework called KBRD, which stands for Knowledge-Based Recommender Dialog System. It integrates the recommender system and the dialog generation system. The dialog system can enhance the performance of the recommendation system by introducing knowledge-grounded information about users’ preferences, and the recommender system can improve that of the dialog generation system by providing recommendation-aware vocabulary bias. Experimental results demonstrate that our proposed model has significant advantages over the baselines in both the evaluation of dialog generation and recommendation. A series of analyses show that the two systems can bring mutual benefits to each other, and the introduced knowledge bridges the gap between the two systems.
We study the fair allocation of indivisible goods under the assumption that the goods form an undirected graph and each agent must receive a connected subgraph. Our focus is on well-studied fairness notions including envy-freeness and maximin share fairness. We establish graph-specific maximin share guarantees, which are tight for large classes of graphs in the case of two agents and for paths and stars in the general case. Unlike in previous work, our guarantees are with respect to the complete-graph maximin share, which allows us to compare possible guarantees for different graphs. For instance, we show that for biconnected graphs it is possible to obtain at least $3/4$ of the maximin share, while for the remaining graphs the guarantee is at most $1/2$. In addition, we determine the optimal relaxation of envy-freeness that can be obtained with each graph for two agents, and characterize the set of trees and complete bipartite graphs that always admit an allocation satisfying envy-freeness up to one good (EF1) for three agents. Our work demonstrates several applications of graph-theoretical tools and concepts to fair division problems.
This paper presents Knowledge-Based Reinforcement Learning (KB-RL) as a method that combines a knowledge-based approach and a reinforcement learning (RL) technique into one method for intelligent problem solving. The proposed approach focuses on multi-expert knowledge acquisition, with the reinforcement learning being applied as a conflict resolution strategy aimed at integrating the knowledge of multiple exerts into one knowledge base. The article describes the KB-RL approach in detail and applies the reported method to one of the most challenging problems of current Artificial Intelligence (AI) research, namely playing a strategy game. The results show that the KB-RL system is able to play and complete the full FreeCiv game, and to win against the computer players in various game settings. Moreover, with more games played, the system improves the gameplay by shortening the number of rounds that it takes to win the game. Overall, the reported experiment supports the idea that, based on human knowledge and empowered by reinforcement learning, the KB-RL system can deliver a strong solution to the complex, multi-strategic problems, and, mainly, to improve the solution with increased experience.
Recently, a variety of regularization techniques have been widely applied in deep neural networks, such as dropout, batch normalization, data augmentation, and so on. These methods mainly focus on the regularization of weight parameters to prevent overfitting effectively. In addition, label regularization techniques such as label smoothing and label disturbance have also been proposed with the motivation of adding a stochastic perturbation to labels. In this paper, we propose a novel adaptive label regularization method, which enables the neural network to learn from the erroneous experience and update the optimal label representation online. On the other hand, compared with knowledge distillation, which learns the correlation of categories using teacher network, our proposed method requires only a minuscule increase in parameters without cumbersome teacher network. Furthermore, we evaluate our method on CIFAR-10/CIFAR-100/ImageNet datasets for image recognition tasks and AGNews/Yahoo/Yelp-Full datasets for text classification tasks. The empirical results show significant improvement under all experimental settings.
There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.
Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can improve performance and generalization capability across different tasks. In this paper, we propose to visualize loss landscapes and optimization trajectories of fine-tuning BERT on specific datasets. First, we find that pre-training reaches a good initial point across downstream tasks, which leads to wider optima and easier optimization compared with training from scratch. We also demonstrate that the fine-tuning procedure is robust to overfitting, even though BERT is highly over-parameterized for downstream tasks. Second, the visualization results indicate that fine-tuning BERT tends to generalize better because of the flat and wide optima, and the consistency between the training loss surface and the generalization error surface. Third, the lower layers of BERT are more invariant during fine-tuning, which suggests that the layers that are close to input learn more transferable representations of language.
Self-supervision techniques have allowed neural language models to advance the frontier in Natural Language Understanding. However, existing self-supervision techniques operate at the word-form level, which serves as a surrogate for the underlying semantic content. This paper proposes a method to employ self-supervision directly at the word-sense level. Our model, named SenseBERT, is pre-trained to predict not only the masked words but also their WordNet supersenses. Accordingly, we attain a lexical-semantic level language model, without the use of human annotation. SenseBERT achieves significantly improved lexical understanding, as we demonstrate by experimenting on SemEval, and by attaining a state of the art result on the Word in Context (WiC) task. Our approach is extendable to other linguistic signals, which can be similarly integrated into the pre-training process, leading to increasingly semantically informed language models.

# Distilled News

A non-technical guide for managers, leaders, thinkers and dreamers.
1. Formulate an Executive Strategy
2. Identify and Prioritise Ideas
2. Identify and Prioritise Ideas
4. Perform the necessary Risk Assessments
5. Choose the relevant Method & Model
6. Make a BBP decision
7. Run performance checks
8. Deploy the algorithm
9. Communicate both successes and failures
10. Tracking
Today we are going to talk about quantile regression. When we use the lm command in R we are fitting a linear regression using Ordinary Least Squares (OLS), which has the interpretation of a model for the conditional mean of y on x. However, sometimes we may need to look at more than the conditional mean to understand our data and quantile regressions may be a good alternative. Instead of looking at the mean, quantile regressions will establish models for particular quantiles as chosen by the user. The most simple case when quantile regressions are good is when you have outliers in your data because the median is much less affected by extreme values than the mean (0.5 quantile). But there are other cases where quantile regression may be used, for example to identify some heterogeneous effects of some variable or even to give more robustness to your results.
Survey’s play the main part when receiving client feedback on a particular product or service one offers for the public. Are we getting too many negative feedbacks? Why? How can we fix issues? What are we doing well and what have we improved after some time? What are the most key issues to comprehend? Assessing responses from customer surveys and creating a report that will give us the answers to these questions is easier said than done. It may take us hours, or even days to go through all responses and find the root of a problem.
Recently, I came across a great video of Prof. Andrew Ng who explains in front of a CS class at Stanford how one can excel in the field of artificial intelligence. I will rephrase his words below. Deep learning is evolving fast enough that, even though you have learned the foundations of deep learning, when you are working on specific applications you need to read research papers to stay on top of most recent ideas.
What makes a great data driven product? Fancy models? Ground breaking ideas? The truth is that the secret sauce usually rests in successfully implementing a product methodology. In this post I carry out a retro on a recent hackathon experience, using lean and agile methodology concepts of Minimum Viable Product, Risky Assumptions, and Spikes. I explore how these approaches can help a team quickly identify a use case, map the risks and complexity of the solutions envisioned, and iterate rapidly towards a shippable product.
I’ve found that is a little difficult to get started with Apache Spark (this will focus on PySpark) and install it on local machines for most people. With this simple tutorial you’ll get there really fast! Apache Spark is a must for Big data’s lovers as it is a fast, easy-to-use general engine for big data processing with built-in modules for streaming, SQL, machine learning and graph processing. This technology is an in-demand skill for data engineers, but also data scientists can benefit from learning Spark when doing Exploratory Data Analysis (EDA), feature extraction and, of course, ML. But please remember that Spark is only truly realized when it is run on a cluster with a large number of nodes.
1. Squared Error Loss
2. Absolute Error Loss
3. Huber Loss
4. Binary Cross Entropy Loss
5. Hinge Loss
6. Multi-Class Cross Entropy Loss
7. Kullback-Liebler Divergence
At the Summer School of the Swiss Association of Actuaries, in Lausanne, following the part of Jean-Philippe Boucher (UQAM) on telematic data, I will start talking about pictures this Wednesday.
The utility of technology is dependent on its accessibility. One key component of accessibility is automatic speech recognition (ASR), which can greatly improve the ability of those with speech impairments to interact with every-day smart devices. However, ASR systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility. For example, amyotrophic lateral sclerosis (ALS) is a disease that can adversely affect a person’s speech – about 25% of people with ALS experiencing slurred speech as their first symptom. In addition, most people with ALS eventually lose the ability to walk, so being able to interact with automated devices from a distance can be very important. Yet current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.
We develop model averaging estimation in the linear regression model where some covariates are subject to measurement error. The absence of the true covariates in this framework makes the calculation of the standard residual-based loss function impossible. We take advantage of the explicit form of the parameter estimators and construct a weight choice criterion. It is asymptotically equivalent to the unknown model average estimator minimizing the loss function. When the true model is not included in the set of candidate models, the method achieves optimality in terms of minimizing the relative loss, whereas, when the true model is included, the method estimates the model parameter with root n rate. Simulation results in comparison with existing Bayesian information criterion and Akaike information criterion model selection and model averaging methods strongly favour our model averaging method. The method is applied to a study on health.
We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building on a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence between the local scale parameters. The resulting processes inherit the desirable shrinkage behaviour of popular global-local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modelling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya-gamma scale mixture representation of the process proposed. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than do competing methods, and we apply the model for irregular curve fitting of minute-by-minute Twitter central processor unit usage data. In addition, we develop an adaptive time varying parameter regression model to assess the efficacy of the Fama-French five-factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that, with the exception of the market risk, no other risk factors are significant except for brief periods.
EDA (Exploratory Data Analysis) is one of the key steps in any Data Science Project. The better the EDA is the better the Feature Engineering could be done. From Modelling to Communication, EDA has got much more hidden benefits that aren’t often emphasised while beginners start while teaching Data Science for beginners.

# If you did not already know

Snapshot Distillation
Optimizing a deep neural network is a fundamental task in computer vision, yet direct training methods often suffer from over-fitting. Teacher-student optimization aims at providing complementary cues from a model trained previously, but these approaches are often considerably slow due to the pipeline of training a few generations in sequence, i.e., time complexity is increased by several times. This paper presents snapshot distillation (SD), the first framework which enables teacher-student optimization in one generation. The idea of SD is very simple: instead of borrowing supervision signals from previous generations, we extract such information from earlier epochs in the same generation, meanwhile make sure that the difference between teacher and student is sufficiently large so as to prevent under-fitting. To achieve this goal, we implement SD in a cyclic learning rate policy, in which the last snapshot of each cycle is used as the teacher for all iterations in the next cycle, and the teacher signal is smoothed to provide richer information. In standard image classification benchmarks such as CIFAR100 and ILSVRC2012, SD achieves consistent accuracy gain without heavy computational overheads. We also verify that models pre-trained with SD transfers well to object detection and semantic segmentation in the PascalVOC dataset. …

ImageNet-P
In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called ImageNet-P which enables researchers to benchmark a classifier’s robustness to common perturbations. Unlike recent robustness research, this benchmark evaluates performance on common corruptions and perturbations not worst-case adversarial perturbations. We find that there are negligible changes in relative corruption robustness from AlexNet classifiers to ResNet classifiers. Afterward we discover ways to enhance corruption and perturbation robustness. We even find that a bypassed adversarial defense provides substantial common perturbation robustness. Together our benchmarks may aid future work toward networks that robustly generalize. …

THUMT
This paper introduces THUMT, an open-source toolkit for neural machine translation (NMT) developed by the Natural Language Processing Group at Tsinghua University. THUMT implements the standard attention-based encoder-decoder framework on top of Theano and supports three training criteria: maximum likelihood estimation, minimum risk training, and semi-supervised training. It features a visualization tool for displaying the relevance between hidden states in neural networks and contextual words, which helps to analyze the internal workings of NMT. Experiments on Chinese-English datasets show that THUMT using minimum risk training significantly outperforms GroundHog, a state-of-the-art toolkit for NMT. …

Layer Reuse Network (LruNet)
A convolutional layer in a Convolutional Neural Network (CNN) consists of many filters which apply convolution operation to the input, capture some special patterns and pass the result to the next layer. If the same patterns also occur at the deeper layers of the network, why wouldn’t the same convolutional filters be used also in those layers? In this paper, we propose a CNN architecture, Layer Reuse Network (LruNet), where the convolutional layers are used repeatedly without the need of introducing new layers to get a better performance. This approach introduces several advantages: (i) Considerable amount of parameters are saved since we are reusing the layers instead of introducing new layers, (ii) the Memory Access Cost (MAC) can be reduced since reused layer parameters can be fetched only once, (iii) the number of nonlinearities increases with layer reuse, and (iv) reused layers get gradient updates from multiple parts of the network. The proposed approach is evaluated on CIFAR-10, CIFAR-100 and Fashion-MNIST datasets for image classification task, and layer reuse improves the performance by 5.14%, 5.85% and 2.29%, respectively. The source code and pretrained models are publicly available. …

# Document worth reading: “Metrics for Graph Comparison: A Practitioner’s Guide”

Comparison of graph structure is a ubiquitous task in data analysis and machine learning, with diverse applications in fields such as neuroscience, cyber security, social network analysis, and bioinformatics, among others. Discovery and comparison of structures such as modular communities, rich clubs, hubs, and trees in data in these fields yields insight into the generative mechanisms and functional properties of the graph. Often, two graphs are compared via a pairwise distance measure, with a small distance indicating structural similarity and vice versa. Common choices include spectral distances (also known as $\lambda$ distances) and distances based on node affinities. However, there has of yet been no comparative study of the efficacy of these distance measures in discerning between common graph topologies and different structural scales. In this work, we compare commonly used graph metrics and distance measures, and demonstrate their ability to discern between common topological features found in both random graph models and empirical datasets. We put forward a multi-scale picture of graph structure, in which the effect of global and local structure upon the distance measures is considered. We make recommendations on the applicability of different distance measures to empirical graph data problem based on this multi-scale view. Finally, we introduce the Python library NetComp which implements the graph distances used in this work. Metrics for Graph Comparison: A Practitioner’s Guide