# Document worth reading: “Algorithms and Statistical Models for Scientific Discovery in the Petabyte Era”

The field of astronomy has arrived at a turning point in terms of size and complexity of both datasets and scientific collaboration. Commensurately, algorithms and statistical models have begun to adapt — e.g., via the onset of artificial intelligence — which itself presents new challenges and opportunities for growth. This white paper aims to offer guidance and ideas for how we can evolve our technical and collaborative frameworks to promote efficient algorithmic development and take advantage of opportunities for scientific discovery in the petabyte era. We discuss challenges for discovery in large and complex data sets; challenges and requirements for the next stage of development of statistical methodologies and algorithmic tool sets; how we might change our paradigms of collaboration and education; and the ethical implications of scientists’ contributions to widely applicable algorithms and computational modeling. We start with six distinct recommendations that are supported by the commentary following them. This white paper is related to a larger corpus of effort that has taken place within and around the Petabytes to Science Workshops (https://petabytestoscience.github.io/). Algorithms and Statistical Models for Scientific Discovery in the Petabyte Era

# What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

essay-eval
Evaluation of essays using NLP

genclf
Gender Classifier ML Package for classifying gender using firstname

pyTsetlinMachineParallel
Multi-threaded implementation of the Tsetlin Machine, Convolutional Tsetlin Machine, Regression Tsetlin Machine, and Weighted Tsetlin Machine, with support for continuous features and multigranularity.

resystools
a lib with some recommendation algorithms

seqvec
Embeeder tool for ‘Modelling the Language of Life – Deep Learning Protein Sequences’. One common task in Computational Biology is the prediction of aspects of protein function and structure from their amino acid sequence. For 26 years, most state-of-the-art approaches toward this end have been marrying machine learning and evolutionary information. The retrieval of related proteins from ever growing sequence databases is becoming so time-consuming that the analysis of entire proteomes becomes challenging. On top, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome.

sinabs
A spiking deep neural network simulator, and neuromoprhic hardware emulator. sinabs (pytorch based library) is developed to design and implement Spiking Convolutional Neural Networks (SCNNs). The library implements several layers that are spiking equivalents of CNN layers. In addition it provides support to import CNN models implemented in keras conveniently to test their spiking equivalent implementation.

gplpy
Grammar Guided Genetic Programming Library

mskit
A tool kit for MS data post-processing

mxnet-osx
mxnet-osx

nbeats-keras
N-Beats: Neural basis expansion analysis for interpretable time series forecasting
• Implementation in Pytorch
• Implementation in Keras

nbeats-pytorch
N-Beats: Neural basis expansion analysis for interpretable time series forecasting
• Implementation in Pytorch
• Implementation in Keras

# Finding out why

Difference-in-differences (diff-in-diff) is a study design that compares outcomes of two groups (treated and comparison) at two time points (pre- and post-treatment) and is widely used in evaluating new policy implementations. For instance, diff-in-diff has been used to estimate the effect that increasing minimum wage has on employment rates and to assess the Affordable Care Act’s effect on health outcomes. Although diff-in-diff appears simple, potential pitfalls lurk. In this paper, we discuss one such complication: time-varying confounding. We provide rigorous definitions for confounders in diff-in-diff studies and explore regression strategies to adjust for confounding. In simulations, we show how and when regression adjustment can ameliorate confounding for both time-invariant and time-varying covariates. We compare our regression approach to those models commonly fit in applied literature, which often fail to address the time-varying nature of confounding in diff-in-diff.
Counterfactual explanations help users understand why machine learned models make certain decisions, and more specifically, how these decisions can be changed. In this work, we frame the problem of finding counterfactual explanations — the minimal perturbation to an input such that the prediction changes — as an optimization task. Previously, optimization techniques for generating counterfactual examples could only be applied to differentiable models, or alternatively via query access to the model by estimating gradients from randomly sampled perturbations. In order to accommodate non-differentiable models such as tree ensembles, we propose using probabilistic model approximations in the optimization framework. We introduce a novel approximation technique that is effective for finding counterfactual explanations while also closely approximating the original model. Our results show that our method is able to produce counterfactual examples that are closer to the original instance in terms of Euclidean, Cosine, and Manhattan distance compared to other methods specifically designed for tree ensembles.
Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the ‘ladder of causal inference’, from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!1

Python Library: causalnex

Propensity score matching is commonly used to draw causal inference from observational survival data. However, there is no gold standard approach to analyze survival data after propensity score matching, and variance estimation after matching is open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching.
For decades, researchers in fields, such as the natural and social sciences, have been verifying causal relationships and investigating hypotheses that are now well-established or understood as truth. These causal mechanisms are properties of the natural world, and thus are invariant conditions regardless of the collection domain or environment. We show in this paper how prior knowledge in the form of a causal graph can be utilized to guide model selection, i.e., to identify from a set of trained networks the models that are the most robust and invariant to unseen domains. Our method incorporates prior knowledge (which can be incomplete) as a Structural Causal Model (SCM) and calculates a score based on the likelihood of the SCM given the target predictions of a candidate model and the provided input variables. We show on both publicly available and synthetic datasets that our method is able to identify more robust models in terms of generalizability to unseen out-of-distribution test examples and domains where covariates have shifted.
Marginal structural models (MSMs) are commonly used to estimate causal intervention effects in longitudinal non-randomised studies. A common issue when analysing data from observational studies is the presence of incomplete confounder data, which might lead to bias in the intervention effect estimates if they are not handled properly in the statistical analysis. However, there is currently no recommendation on how to address missing data on covariates in MSMs under a variety of missingness mechanisms encountered in practice. We reviewed existing methods to handling missing data in MSMs and performed a simulation study to compare the performance of complete case (CC) analysis, the last observation carried forward (LOCF), the missingness pattern approach (MPA), multiple imputation (MI) and inverse-probability-of-missingness weighting (IPMW). We considered three mechanisms for non-monotone missing data which are common in observational studies using electronic health record data. Whereas CC analysis lead to biased estimates of the intervention effect in almost all scenarios, the performance of the other approaches varied across scenarios. The LOCF approach led to unbiased estimates only under a specific non-random mechanism in which confounder values were missing when their values remained unchanged since the previous measurement. In this scenario, MI, the MPA and IPMW were biased. MI and IPMW led to the estimation of unbiased effects when data were missing at random, given the covariates or the treatment but only MI was unbiased when the outcome was a predictor of missingness. Furthermore, IPMW generally lead to very large standard errors. Lastly, regardless of the missingness mechanism, the MPA led to unbiased estimates only when the failure to record a confounder at a given time-point modified the subsequent relationships between the partially observed covariate and the outcome.
The aim of this paper is to investigate various information-theoretic measures, including entropy, mutual information, and some systematic measures that based on mutual information, for a class of structured spiking neuronal network. In order to analyze and compute these information-theoretic measures for large networks, we coarse-grained the data by ignoring the order of spikes that fall into the same small time bin. The resultant coarse-grained entropy mainly capture the information contained in the rhythm produced by a local population of the network. We first proved that these information theoretical measures are well-defined and computable by proving the stochastic stability and the law of large numbers. Then we use three neuronal network examples, from simple to complex, to investigate these information-theoretic measures. Several analytical and computational results about properties of these information-theoretic measures are given.

# If you did not already know

Minimax Entropy (MME)
Contemporary domain adaptation methods are very effective at aligning feature distributions of source and target domains without any target supervision. However, we show that these techniques perform poorly when even a few labeled examples are available in the target. To address this semi-supervised domain adaptation (SSDA) setting, we propose a novel Minimax Entropy (MME) approach that adversarially optimizes an adaptive few-shot model. Our base model consists of a feature encoding network, followed by a classification layer that computes the features’ similarity to estimated prototypes (representatives of each class). Adaptation is achieved by alternately maximizing the conditional entropy of unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder. We empirically demonstrate the superiority of our method over many baselines, including conventional feature alignment and few-shot methods, setting a new state of the art for SSDA. …

Integrated Rational Prediction and Motionless TextbfANalysis (IRON-MAN)
Analyzing video for traffic categorization is an important pillar of Intelligent Transport Systems. However, it is difficult to analyze and predict traffic based on image frames because the representation of each frame may vary significantly within a short time period. This also would inaccurately represent the traffic over a longer period of time such as the case of video. We propose a novel bio-inspired methodology that integrates analysis of the previous image frames of the video to represent the analysis of the current image frame, the same way a human being analyzes the current situation based on past experience. In our proposed methodology, called IRON-MAN (Integrated Rational prediction and Motionless textbfANalysis), we utilize Bayesian update on top of the individual image frame analysis in the videos and this has resulted in highly accurate prediction of Temporal Motionless Analysis of the Videos (TMAV) for most of the chosen test cases. The proposed approach could be used for TMAV using Convolutional Neural Network (CNN) for applications where the number of objects in an image is the deciding factor for prediction and results also show that our proposed approach outperforms the state-of-the-art for the chosen test case. We also introduce a new metric named, Energy Consumption per Training Image (ECTI). Since, different CNN based models have different training capability and computing resource utilization, some of the models are more suitable for embedded device implementation than the others, and ECTI metric is useful to assess the suitability of using a CNN model in multi-processor systems-on-chips (MPSoCs) with a focus on energy consumption and reliability in terms of lifespan of the embedded device using these MPSoCs. …

Functional Aggregate Queries (FAQ)
Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering Functional Aggregate Queries (FAQ) in which some of the input factors are defined by a collection of Additive Inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define ‘relaxed’ tree decompositions and ‘relaxed’ submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle’s geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by ‘#subw’. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw. We present three applications of our FAQ-AI framework to relational machine learning: k-means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations. …

Approximate Common Variance (A-ComVar)
We consider nonregular fractions of factorial experiments for a class of linear models. These models have a common general mean and main effects, however they may have different 2-factor interactions. Here we assume for simplicity that 3-factor and higher order interactions are negligible. In the absence of a priori knowledge about which interactions are important, it is reasonable to prefer a design that results in equal variance for the estimates of all interaction effects to aid in model discrimination. Such designs are called common variance designs and can be quite challenging to identify without performing an exhaustive search of possible designs. In this work, we introduce an extension of common variance designs called approximate common variance, or A-ComVar designs. We develop a numerical approach to finding A-ComVar designs that is much more efficient than an exhaustive search. We present the types of A-ComVar designs that can be found for different number of factors, runs, and interactions. We further demonstrate the competitive performance of both common variance and A-ComVar designs with Plackett-Burman designs for model selection using simulation. …

# Document worth reading: “Machine Learning for Fluid Mechanics”

The field of fluid mechanics is rapidly advancing, driven by unprecedented volumes of data from experiments, field measurements, and large-scale simulations at multiple spatiotemporal scales. Machine learning presents us with a wealth of techniques to extract information from data that can be translated into knowledge about the underlying fluid mechanics. Moreover, machine learning algorithms can augment domain knowledge and automate tasks related to flow control and optimization. This article presents an overview of past history, current developments, and emerging opportunities of machine learning for fluid mechanics. We outline fundamental machine learning methodologies and discuss their uses for understanding, modeling, optimizing, and controlling fluid flows. The strengths and limitations of these methods are addressed from the perspective of scientific inquiry that links data with modeling, experiments, and simulations. Machine learning provides a powerful information processing framework that can augment, and possibly even transform, current lines of fluid mechanics research and industrial applications. Machine Learning for Fluid Mechanics

# Whats new on arXiv

Past literature has been effective in demonstrating ideological gaps in machine learning (ML) fairness definitions when considering their use in complex socio-technical systems. However, we go further to demonstrate that these definitions often misunderstand the legal concepts from which they purport to be inspired, and consequently inappropriately co-opt legal language. In this paper, we demonstrate examples of this misalignment and discuss the differences in ML terminology and their legal counterparts, as well as what both the legal and ML fairness communities can learn from these tensions. We focus this paper on U.S. anti-discrimination law since the ML fairness research community regularly references terms from this body of law.
Deep neural networks with more parameters and FLOPs have higher capacity and generalize better to diverse domains. But to be deployed on edge devices, the model’s complexity has to be constrained due to limited compute resource. In this work, we propose a method to improve the model capacity without increasing inference-time complexity. Our method is based on an assumption of data locality: for an edge device, within a short period of time, the input data to the device are sampled from a single domain with relatively low diversity. Therefore, it is possible to utilize a specialized, low-complexity model to achieve good performance in that input domain. To leverage this, we propose Domain-aware Dynamic Network (DDN), which is a high-capacity dynamic network in which each layer contains multiple weights. During inference, based on the input domain, DDN dynamically combines those weights into one single weight that specializes in the given domain. This way, DDN can keep the inference-time complexity low but still maintain a high capacity. Experiments show that without increasing the parameters, FLOPs, and actual latency, DDN achieves up to 2.6\% higher AP50 than a static network on the BDD100K object-detection benchmark.
Recent advances in artificial intelligence research have led to a profusion of studies that apply deep learning to problems in image analysis and natural language processing among others. Additionally, the availability of open-source computational frameworks has lowered the barriers to implementing state-of-the-art methods across multiple domains. Albeit leading to major performance breakthroughs in some tasks, effective dissemination of deep learning algorithms remains challenging, inhibiting reproducibility and benchmarking studies, impeding further validation, and ultimately hindering their effectiveness in the cumulative scientific progress. In developing a platform for sharing research outputs, we present ModelHub.AI (www.modelhub.ai), a community-driven container-based software engine and platform for the structured dissemination of deep learning models. For contributors, the engine controls data flow throughout the inference cycle, while the contributor-facing standard template exposes model-specific functions including inference, as well as pre- and post-processing. Python and RESTful Application programming interfaces (APIs) enable users to interact with models hosted on ModelHub.AI and allows both researchers and developers to utilize models out-of-the-box. ModelHub.AI is domain-, data-, and framework-agnostic, catering to different workflows and contributors’ preferences.
The aim of this paper is to investigate various information-theoretic measures, including entropy, mutual information, and some systematic measures that based on mutual information, for a class of structured spiking neuronal network. In order to analyze and compute these information-theoretic measures for large networks, we coarse-grained the data by ignoring the order of spikes that fall into the same small time bin. The resultant coarse-grained entropy mainly capture the information contained in the rhythm produced by a local population of the network. We first proved that these information theoretical measures are well-defined and computable by proving the stochastic stability and the law of large numbers. Then we use three neuronal network examples, from simple to complex, to investigate these information-theoretic measures. Several analytical and computational results about properties of these information-theoretic measures are given.
Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we investigate whether human ratings themselves can be trusted at their face value when scoring video-based structured interviews, and whether such ratings can impact machine learning models that use them as training data. We present preliminary empirical evidence that indicates there might be biases in such annotations, most of which are visual in nature.
Recently the concept of transformative AI (TAI) has begun to receive attention in the AI policy space. TAI is often framed as an alternative formulation to notions of strong AI (e.g. artificial general intelligence or superintelligence) and reflects increasing consensus that advanced AI which does not fit these definitions may nonetheless have extreme and long-lasting impacts on society. However, the term TAI is poorly defined and often used ambiguously. Some use the notion of TAI to describe levels of societal transformation associated with previous ‘general purpose technologies’ (GPTs) such as electricity or the internal combustion engine. Others use the term to refer to more drastic levels of transformation comparable to the agricultural or industrial revolutions. The notion has also been used much more loosely, with some implying that current AI systems are already having a transformative impact on society. This paper unpacks and analyses the notion of TAI, proposing a distinction between TAI and radically transformative AI (RTAI), roughly corresponding to societal change on the level of the agricultural or industrial revolutions. We describe some relevant dimensions associated with each and discuss what kinds of advances in capabilities they might require. We further consider the relationship between TAI and RTAI and whether we should necessarily expect a period of TAI to precede the emergence of RTAI. This analysis is important as it can help guide discussions among AI policy researchers about how to allocate resources towards mitigating the most extreme impacts of AI and it can bring attention to negative TAI scenarios that are currently neglected.
We present, FT-SWRL, a fuzzy temporal extension to the Semantic Web Rule Language (SWRL), which combines fuzzy theories based on the valid-time temporal model to provide a standard approach for modeling imprecise temporal domain knowledge in OWL ontologies. The proposal introduces a fuzzy temporal model for the semantic web, which is syntactically defined as a fuzzy temporal SWRL ontology (SWRL-FTO) with a new set of fuzzy temporal SWRL built-ins for defining their semantics. The SWRL-FTO hierarchically defines the necessary linguistic terminologies and variables for the fuzzy temporal model. An example model demonstrating the usefulness of the fuzzy temporal SWRL built-ins to model imprecise temporal information is also represented. Fuzzification process of interval-based temporal logic is further discussed as a reasoning paradigm for our FT-SWRL rules, with the aim of achieving a complete OWL-based fuzzy temporal reasoning. Literature review on fuzzy temporal representation approaches, both within and without the use of ontologies, led to the conclusion that the FT-SWRL model can authoritatively serve as a formal specification for handling imprecise temporal expressions on the semantic web.
This paper introduces the $d$-distance matching problem, in which we are given a bipartite graph $G=(S,T;E)$ with $S=\{s_1,\dots,s_n\}$, a weight function on the edges and an integer $d\in\mathbb{N}$. The goal is to find a maximum weight subset $M\subseteq E$ of the edges satisfying the following two conditions: i) the degree of every node of $S$ is at most one in $M$, ii) if $ts_i,ts_j\in M$ for some $i, then $j-i\geq d$. The question finds applications, for example, in various scheduling problems. $\quad$We show that the problem is NP-complete in general and admits a simple $3$-approximation. We give an FPT algorithm parameterized by $d$ and also settle the case when the size of $T$ is constant. From an approximability point of view, we consider several greedy approaches. In particular, a local search algorithm is presented that achieves an approximation ratio of $3/2+\epsilon$ for any constant $\epsilon>0$ in the unweighted case. We show that the integrality gap of the natural integer programming model is at most $2-\frac{1}{2d-1}$, and give an LP-based approximation algorithm for the weighted case with the same guarantee. The novel approaches used in the analysis of the integrality gap and the approximation ratio of locally optimal solutions might be of independent combinatorial interest.
In this paper we present a new framework for time-series modeling that combines the best of traditional statistical models and neural networks. We focus on time-series with long-range dependencies, needed for monitoring fine granularity data (e.g. minutes, seconds, milliseconds), prevalent in operational use-cases. Traditional models, such as auto-regression fitted with least squares (Classic-AR) can model time-series with a concise and interpretable model. When dealing with long-range dependencies, Classic-AR models can become intractably slow to fit for large data. Recently, sequence-to-sequence models, such as Recurrent Neural Networks, which were originally intended for natural language processing, have become popular for time-series. However, they can be overly complex for typical time-series data and lack interpretability. A scalable and interpretable model is needed to bridge the statistical and deep learning-based approaches. As a first step towards this goal, we propose modelling AR-process dynamics using a feed-forward neural network approach, termed AR-Net. We show that AR-Net is as interpretable as Classic-AR but also scales to long-range dependencies. Our results lead to three major conclusions: First, AR-Net learns identical AR-coefficients as Classic-AR, thus being equally interpretable. Second, the computational complexity with respect to the order of the AR process, is linear for AR-Net as compared to a quadratic for Classic-AR. This makes it possible to model long-range dependencies within fine granularity data. Third, by introducing regularization, AR-Net automatically selects and learns sparse AR-coefficients. This eliminates the need to know the exact order of the AR-process and allows to learn sparse weights for a model with long-range dependencies.
Today, Android devices are able to provide various services. They support applications for different purposes such as entertainment, business, health, education, and banking services. Because of the functionality and popularity of Android devices as well as the open-source policy of Android OS, they have become a suitable target for attackers. Android Botnet is one of the most dangerous malwares because an attacker called Botmaster can control that remotely to perform destructive attacks. A number of researchers have used different well-known Machine Learning (ML) methods to recognize Android Botnets from benign applications. However, these conventional methods are not able to detect new sophisticated Android Botnets. In this paper, we propose a novel method based on Android permissions and Convolutional Neural Networks (CNNs) to classify Botnets and benign Android applications. Being the first developed method that uses CNNs for this aim, we also proposed a novel method to represent each application as an image which is constructed based on the co-occurrence of used permissions in that application. The proposed CNN is a binary classifier that is trained using these images. Evaluating the proposed method on 5450 Android applications consist of Botnet and benign samples, the obtained results show the accuracy of 97.2% and recall of 96% which is a promising result just using Android permissions.
In this paper, we propose a new product knowledge graph (PKG) embedding approach for learning the intrinsic product relations as product knowledge for e-commerce. We define the key entities and summarize the pivotal product relations that are critical for general e-commerce applications including marketing, advertisement, search ranking and recommendation. We first provide a comprehensive comparison between PKG and ordinary knowledge graph (KG) and then illustrate why KG embedding methods are not suitable for PKG learning. We construct a self-attention-enhanced distributed representation learning model for learning PKG embeddings from raw customer activity data in an end-to-end fashion. We design an effective multi-task learning schema to fully leverage the multi-modal e-commerce data. The Poincare embedding is also employed to handle complex entity structures. We use a real-world dataset from grocery.walmart.com to evaluate the performances on knowledge completion, search ranking and recommendation. The proposed approach compares favourably to baselines in knowledge completion and downstream tasks.
Graph convolutional networks (GCNs) have shown the powerful ability in text structure representation and effectively facilitate the task of text classification. However, challenges still exist in adapting GCN on learning discriminative features from texts due to the main issue of graph variants incurred by the textual complexity and diversity. In this paper, we propose a dual-attention GCN to model the structural information of various texts as well as tackle the graph-invariant problem through embedding two types of attention mechanisms, i.e. the connection-attention and hop-attention, into the classic GCN. To encode various connection patterns between neighbour words, connection-attention adaptively imposes different weights specified to neighbourhoods of each word, which captures the short-term dependencies. On the other hand, the hop-attention applies scaled coefficients to different scopes during the graph diffusion process to make the model learn more about the distribution of context, which captures long-term semantics in an adaptive way. Extensive experiments are conducted on five widely used datasets to evaluate our dual-attention GCN, and the achieved state-of-the-art performance verifies the effectiveness of dual-attention mechanisms.
Quantization and Knowledge distillation (KD) methods are widely used to reduce memory and power consumption of deep neural networks (DNNs), especially for resource-constrained edge devices. Although their combination is quite promising to meet these requirements, it may not work as desired. It is mainly because the regularization effect of KD further diminishes the already reduced representation power of a quantized model. To address this short-coming, we propose Quantization-aware Knowledge Distillation (QKD) wherein quantization and KD are care-fully coordinated in three phases. First, Self-studying (SS) phase fine-tunes a quantized low-precision student network without KD to obtain a good initialization. Second, Co-studying (CS) phase tries to train a teacher to make it more quantizaion-friendly and powerful than a fixed teacher. Finally, Tutoring (TU) phase transfers knowledge from the trained teacher to the student. We extensively evaluate our method on ImageNet and CIFAR-10/100 datasets and show an ablation study on networks with both standard and depthwise-separable convolutions. The proposed QKD outperformed existing state-of-the-art methods (e.g., 1.3% improvement on ResNet-18 with W4A4, 2.6% on MobileNetV2 with W4A4). Additionally, QKD could recover the full-precision accuracy at as low as W3A3 quantization on ResNet and W6A6 quantization on MobilenetV2.
Current approaches to deep learning are beginning to rely heavily on transfer learning as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks. Similarly, such pre-trained models are often used to create embedding representations for various types of data, such as text and images, which can then be fed as input into separate, downstream models. However, in cases where such transfer learning models perform poorly (i.e., for data outside of the training distribution), one must resort to fine-tuning such models, or even retraining them completely. Currently, no form of data augmentation has been proposed that can be applied directly to embedding inputs to improve downstream model performance. In this work, we introduce four new types of data augmentation that are generally applicable to embedding inputs, thus making them useful in both Natural Language Processing (NLP) and Computer Vision (CV) applications. For models trained on downstream tasks with such embedding inputs, these augmentation methods are shown to improve the AUC score of the models from a score of 0.9582 to 0.9812 and significantly increase the model’s ability to identify classes of data that are not seen during training.
Deep metric learning (DML) is a popular approach for images retrieval, solving verification (same or not) problems and addressing open set classification. Arguably, the most common DML approach is with triplet loss, despite significant advances in the area of DML. Triplet loss suffers from several issues such as collapse of the embeddings, high sensitivity to sampling schemes and more importantly a lack of performance when compared to more modern methods. We attribute this adoption to a lack of fair comparisons between various methods and the difficulty in adopting them for novel problem statements. In this paper, we perform an unbiased comparison of the most popular DML baseline methods under same conditions and more importantly, not obfuscating any hyper parameter tuning or adjustment needed to favor a particular method. We find, that under equal conditions several older methods perform significantly better than previously believed. In fact, our unified implementation of 12 recently introduced DML algorithms achieve state-of-the art performance on CUB200, CAR196, and Stanford Online products datasets which establishes a new set of baselines for future DML research. The codebase and all tuned hyperparameters will be open-sourced for reproducibility and to serve as a source of benchmark.
Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as ‘Obama is a _ by profession’. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as ‘Obama worked as a _’ may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts and ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 38.1%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://…/LPAQA.
Deep learning has gained tremendous success and great popularity in the past few years. However, recent research found that it is suffering several inherent weaknesses, which can threaten the security and privacy of the stackholders. Deep learning’s wide use further magnifies the caused consequences. To this end, lots of research has been conducted with the purpose of exhaustively identifying intrinsic weaknesses and subsequently proposing feasible mitigation. Yet few is clear about how these weaknesses are incurred and how effective are these attack approaches in assaulting deep learning. In order to unveil the security weaknesses and aid in the development of a robust deep learning system, we are devoted to undertaking a comprehensive investigation on attacks towards deep learning, and extensively evaluating these attacks in multiple views. In particular, we focus on four types of attacks associated with security and privacy of deep learning: model extraction attack, model inversion attack, poisoning attack and adversarial attack. For each type of attack, we construct its essential workflow as well as adversary capabilities and attack goals. Many pivot metrics are devised for evaluating the attack approaches, by which we perform a quantitative and qualitative analysis. From the analysis, we have identified significant and indispensable factors in an attack vector, \eg, how to reduce queries to target models, what distance used for measuring perturbation. We spot light on 17 findings covering these approaches’ merits and demerits, success probability, deployment complexity and prospects. Moreover, we discuss other potential security weaknesses and possible mitigation which can inspire relevant researchers in this area.
We propose a novel approach to address one aspect of the non-stationarity problem in multi-agent reinforcement learning (RL), where the other agents may alter their policies due to environment changes during execution. This violates the Markov assumption that governs most single-agent RL methods and is one of the key challenges in multi-agent RL. To tackle this, we propose to train multiple policies for each agent and postpone the selection of the best policy at execution time. Specifically, we model the environment non-stationarity with a finite set of scenarios and train policies fitting each scenario. In addition to multiple policies, each agent also learns a policy predictor to determine which policy is the best with its local information. By doing so, each agent is able to adapt its policy when the environment changes and consequentially the other agents alter their policies during execution. We empirically evaluated our method on a variety of common benchmark problems proposed for multi-agent deep RL in the literature. Our experimental results show that the agents trained by our algorithm have better adaptiveness in changing environments and outperform the state-of-the-art methods in all the tested environments.
As a metric to measure the performance of an online method, dynamic regret with switching cost has drawn much attention for online decision making problems. Although the sublinear regret has been provided in many previous researches, we still have little knowledge about the relation between the dynamic regret and the switching cost. In the paper, we investigate the relation for two classic online settings: Online Algorithms (OA) and Online Convex Optimization (OCO). We provide a new theoretical analysis framework, which shows an interesting observation, that is, the relation between the switching cost and the dynamic regret is different for settings of OA and OCO. Specifically, the switching cost has significant impact on the dynamic regret in the setting of OA. But, it does not have an impact on the dynamic regret in the setting of OCO. Furthermore, we provide a lower bound of regret for the setting of OCO, which is same with the lower bound in the case of no switching cost. It shows that the switching cost does not change the difficulty of online decision making problems in the setting of OCO.
We present a method for solving the general mixed constrained convex quadratic programming problem using an active set method on the dual problem. The approach is similar to existing active set methods, but we present a new way of solving the linear systems arising in the algorithm. There are two main contributions; we present a new way of factorizing the linear systems, and show how iterative refinement can be used to achieve good accuracy and to solve both types of sub-problems that arise from semi-definite problems.
Decentralized optimization algorithms have attracted intensive interests recently, as it has a balanced communication pattern, especially when solving large-scale machine learning problems. Stochastic Path Integrated Differential Estimator Stochastic First-Order method (SPIDER-SFO) nearly achieves the algorithmic lower bound in certain regimes for nonconvex problems. However, whether we can find a decentralized algorithm which achieves a similar convergence rate to SPIDER-SFO is still unclear. To tackle this problem, we propose a decentralized variant of SPIDER-SFO, called decentralized SPIDER-SFO (D-SPIDER-SFO). We show that D-SPIDER-SFO achieves a similar gradient computation cost—that is, $\mathcal{O}(\epsilon^{-3})$ for finding an $\epsilon$-approximate first-order stationary point—to its centralized counterpart. To the best of our knowledge, D-SPIDER-SFO achieves the state-of-the-art performance for solving nonconvex optimization problems on decentralized networks in terms of the computational cost. Experiments on different network configurations demonstrate the efficiency of the proposed method.
There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector is not sufficient and would lead to poor accuracies in ML tasks. Thus, we argue to additionally incorporate the information given by the database schema into the embedding, e.g. which words appear in the same column or are related to each other. In this paper, we propose RETRO (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the best of both worlds, the rich information encoded by word embeddings and the relational information encoded by database tables. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem and derive good settings for all of them. Our evaluation shows that the proposed embeddings are ready-to-use for many ML tasks such as classification and regression and even outperform state-of-the-art techniques in integration tasks such as null value imputation and link prediction.
Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors. Current explanations of dropout include bagging, naive Bayes, regularization, and sex in evolution. According to the activation patterns of neurons in the human brain, when faced with different situations, the firing rates of neurons are random and continuous, not binary as current dropout does. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. On the one hand, continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout. On the other hand, we demonstrate that continuous dropout has the property of avoiding the co-adaptation of feature detectors, which suggests that we can extract more independent feature detectors for model averaging in the test stage. We introduce the proposed continuous dropout to a feedforward neural network and comprehensively compare it with binary dropout, adaptive dropout, and DropConnect on MNIST, CIFAR-10, SVHN, NORB, and ILSVRC-12. Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance. The code is available at: https://…/dropout.
We consider a three-layer Sejnowski machine and show that features learnt via contrastive divergence have a dual representation as patterns in a dense associative memory of order P=4. The latter is known to be able to Hebbian-store an amount of patterns scaling as N^{P-1}, where N denotes the number of constituting binary neurons interacting P-wisely. We also prove that, by keeping the dense associative network far from the saturation regime (namely, allowing for a number of patterns scaling only linearly with N, while P>2) such a system is able to perform pattern recognition far below the standard signal-to-noise threshold. In particular, a network with P=4 is able to retrieve information whose intensity is O(1) even in the presence of a noise O(\sqrt{N}) in the large N limit. This striking skill stems from a redundancy representation of patterns — which is afforded given the (relatively) low-load information storage — and it contributes to explain the impressive abilities in pattern recognition exhibited by new-generation neural networks. The whole theory is developed rigorously, at the replica symmetric level of approximation, and corroborated by signal-to-noise analysis and Monte Carlo simulations.
This paper presents a new approach to the detection of discontinuities in the n-th derivative of observational data. This is achieved by performing two polynomial approximations at each interstitial point. The polynomials are coupled by constraining their coefficients to ensure continuity of the model up to the (n-1)-th derivative; while yielding an estimate for the discontinuity of the n-th derivative. The coefficients of the polynomials correspond directly to the derivatives of the approximations at the interstitial points through the prudent selection of a common coordinate system. The approximation residual and extrapolation errors are investigated as measures for detecting discontinuity. This is necessary since discrete observations of continuous systems are discontinuous at every point. It is proven, using matrix algebra, that positive extrema in the combined approximation-extrapolation error correspond exactly to extrema in the difference of the Taylor coefficients. This provides a relative measure for the severity of the discontinuity in the observational data. The matrix algebraic derivations are provided for all aspects of the methods presented here; this includes a solution for the covariance propagation through the computation. The performance of the method is verified with a Monte Carlo simulation using synthetic piecewise polynomial data with known discontinuities. It is also demonstrated that the discontinuities are suitable as knots for B-spline modelling of data. For completeness, the results of applying the method to sensor data acquired during the monitoring of heavy machinery are presented.
One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.
In this paper, we consider the problem of finding the constraints in bow-free acyclic directed mixed graphs (ADMGs). ADMGs are a generalisation of directed acyclic graphs (DAGs) that allow for certain latent variables. We first show that minimal generators for the ideal $\I(G)$ containing all the constraints of a Gaussian ADMG $G$ corresponds precisely to the pairs of non-adjacent vertices in $G$. The proof of this theorem naturally leads to an efficient algorithm that fits a bow-free Gaussian ADMG by maximum likelihood. In particular, we can test for the goodness of fit of a given data set to a bow-free ADMG.
Cyber-Physical Systems (CPS) are present in many settings addressing a myriad of purposes. Examples are Internet-of-Things (IoT) or sensing software embedded in appliances or even specialised meters that measure and respond to electricity demands in smart grids. Due to their pervasive nature, they are usually chosen as recipients for larger scope cyber-security attacks. Those promote system-wide disruptions and are directed towards one key aspect such as confidentiality, integrity, availability or a combination of those characteristics. Our paper focuses on a particular and distressing attack where coordinated malware infected IoT units are maliciously employed to synchronously turn on or off high-wattage appliances, affecting the grid’s primary control management. Our model could be extended to larger (smart) grids, Active Buildings as well as similar infrastructures. Our approach models Coordinated Load-Changing Attacks (CLCA) also referred as GridLock or BlackIoT, against a theoretical power grid, containing various types of power plants. It employs Continuous-Time Markov Chains where elements such as Power Plants and Botnets are modelled under normal or attack situations to evaluate the effect of CLCA in power reliant infrastructures. We showcase our modelling approach in the scenario of a power supplier (e.g. power plant) being targeted by a botnet. We demonstrate how our modelling approach can quantify the impact of a botnet attack and be abstracted for any CPS system involving power load management in a smart grid. Our results show that by prioritising the type of power-plants, the impact of the attack may change: in particular, we find the most impacting attack times and show how different strategies impact their success. We also find the best power generator to use depending on the current demand and strength of attack.
We consider a finite impulse response system with centered independent sub-Gaussian design covariates and noise components that are not necessarily identically distributed. We derive non-asymptotic near-optimal estimation and prediction bounds for the least-squares estimator of the parameters. Our results are based on two concentration inequalities on the norm of sums of dependent covariate vectors and on the singular values of their covariance operator that are of independent value on their own and where the dependence arises from the time shift structure of the time series. These results generalize the known bounds for the independent case.
The performance of acquisition functions for Bayesian optimisation is investigated in terms of the Pareto front between exploration and exploitation. We show that Expected Improvement and the Upper Confidence Bound always select solutions to be expensively evaluated on the Pareto front, but Probability of Improvement is never guaranteed to do so and Weighted Expected Improvement does only for a restricted range of weights. We introduce two novel $\epsilon$-greedy acquisition functions. Extensive empirical evaluation of these together with random search, purely exploratory and purely exploitative search on 10 benchmark problems in 1 to 10 dimensions shows that $\epsilon$-greedy algorithms are generally at least as effective as conventional acquisition functions, particularly with a limited budget. In higher dimensions $\epsilon$-greedy approaches are shown to have improved performance over conventional approaches. These results are borne out on a real world computational fluid dynamics optimisation problem and a robotics active learning problem.
With the improvements of Los Angeles in many aspects, people in mounting numbers tend to live or travel to the city. The primary objective of this paper is to apply a set of methods for the time series analysis of traffic accidents in Los Angeles in the past few years. The number of traffic accidents, collected from 2010 to 2019 monthly reveals that the traffic accident happens seasonally and increasing with fluctuation. This paper utilizes the ensemble methods to combine several different methods to model the data from various perspectives, which can lead to better forecasting accuracy. The IMA(1, 1), ETS(A, N, A), and two models with Fourier items are failed in independence assumption checking. However, the Online Gradient Descent (OGD) model generated by the ensemble method shows the perfect fit in the data modeling, which is the state-of-the-art model among our candidate models. Therefore, it can be easier to accurately forecast future traffic accidents based on previous data through our model, which can help designers to make better plans.

# What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

git-hammer
Statistics tool for git repositories

Self Organizing Maps Package

n2dit
Deep learning-based Night-to-Day image-translation software

pysparkcli
PySpark Project Buiding Tool. Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

serial-visualisation
A package for visualisation of serial data in a grid.

topsis-jamesfallon
Implementation of TOPSIS decision making

wf-smc-kalman
Sequential Monte Carlo modeling for linear Gaussian systems

woe-monotonic-binning
Optimal binning algorithm and function to apply on a pandas DataFrame

arboretum-dev

args-topic-modeling
modeling the topic of arguments

deepl-pro
A wrapper for the DeepL Pro API.

dj-kaggle-pipeline
Pipelines and utility classes for Kaggle and data science!

# What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

deepforest
Tree crown prediction using deep learning retinanets

distilkobert
Distillation of KoBERT

ohsome2label
Historical OpenStreetMap Objects to Machine Learning Training Samples

pyapprox
High-dimensional function approximation and estimation

sc2rl
A python-sc2 wrapper for Reinforcement Learning

valkyrie
Implementation of machine learning algorithm.

yao-framework

do-py
Standardized data structures for Python.

drlkit
A High Level Python Deep Reinforcement Learning library. Great for beginners, for prototyping and quickly comparing algorithms

exkaldi
ExKaldi Automatic Speech Recognition Toolkit

gdpr-prov-neo4j-api
This project exports a set of methods that enable graph-based database management.

gimbiseo
A system of Human-Machine Dialogue