# Distilled News

Polyaxon is a platform for managing the whole life cycle of machine learning (ML) and deep learning(DL) applications. Today, we are pleased to announce the 0.4 release, the most stable release we have made until now. This release brings a lot of new features, integrations, improvements, and fixes. For over a year now, Polyaxon has been delivering software that enables many teams and organizations to be more productive, iterate faster on their research and ideas, and ship robust models to production.
Using the raw data for training a machine learning algorithm might not be the suitable choice in some situations. The algorithm, when trained by raw data, has to do feature mining by itself for detecting the different groups from each other. But this requires large amounts of data for doing feature mining automatically. For small datasets, it is preferred that the data scientist do the feature mining step on its own and just tell the machine learning algorithm which feature set to use. The used feature set has to be representative of the data samples and thus we have to take care of selecting the best features. The data scientist suggests using some types of features that seems helpful in representing the data samples based on the previous experience. Some features might prove their robustness in representing the samples and others not.
Deep Feedforward networks or also known multilayer perceptrons are the foundation of most deep learning models. Networks like CNNs and RNNs are just some special cases of Feedforward networks. These networks are mostly used for supervised machine learning tasks where we already know the target function ie the result we want our network to achieve and are extremely important for practicing machine learning and form the basis of many commercial applications, areas such as computer vision and NLP were greatly affected by the presence of these networks.
The building block of the deep neural networks is called the sigmoid neuron. Sigmoid neurons are similar to perceptrons, but they are slightly modified such that the output from the sigmoid neuron is much smoother than the step functional output from perceptron. In this post, we will talk about the motivation behind the creation of sigmoid neuron and working of the sigmoid neuron model.
There are dozens of different hypothesis tests, so choosing one can be a little overwhelming. The good news is that one of the more popular tests will usually do the trick–unless you have unusual data or are working within very specific guidelines (i.e. in medical research). The following picture shows several tests for a single population, and what kind of data (nominal, ordinal, interval/ratio) is best suited to those tests.
Big data operates in a different ways than traditional relational database structures, index and keys are not usually present in Big data systems, where distributed systems concerns tend to have the upper hand. Nevertheless there are specific ways to operate big data, and understanding how to best operate with these type of dataset can prove the key to unlocking insights.
Model explainability techniques show you what your model is learning, and seeing inside your model is even more useful than most people expect. I’ve interviewed many data scientists in the last 10 years, and model explainability techniques are my favorite topic to distinguish the very best data scientists from the average. Some people think machine learning models are black boxes, useful for making predictions but otherwise unintelligible; but the best data scientists know techniques to extract real-world insights from any model.
Why we should worry about gender inequality in Natural Language Processing techniques.
Augmented reality (AR) helps you do more with what you see by overlaying digital content and information on top of the physical world. For example, AR features coming to Google Maps will let you find your way with directions overlaid on top of your real world. With Playground – a creative mode in the Pixel camera — you can use AR to see the world differently. And with the latest release of YouTube Stories and ARCore’s new Augmented Faces API you can add objects like animated masks, glasses, 3D hats and more to your own selfies! One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world; a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.
In a previous post, I discussed k-means clustering as a way of summarising text data. I also talked about some of the limitations of k-means and in what situations it may not be the most appropriate solution. Probably the biggest limitation is that each cluster has the same diagonal covariance matrix. This produces spherical clusters that are quite inflexible in terms of the types of distributions they can model. In this post, I wanted to address some of those limitations and talk about one method in particular that can avoid these issues, Gaussian Mixture Modelling (GMM). The format of this post will be very similar to the last one where I explain the theory behind GMM and how it works. I then want to dive into coding the algorithm in Python and we can see how the results differ from k-means and why using GMM may be a good alternative.
The analysis of time series data is an integral part of any data scientist’s job, more so in the quantitative trading world. Financial data is the most perplexing of time series data and often seems erratic. However, over these few articles, I will build a framework of analyzing such time series first using well established theories, and then delving into more exotic, modern day approaches such as machine learning. So let’s begin!
Swarm intelligence based optimal feature selection for enhanced predictive sentiment accuracy on twitter
This article will simplify the Kalman Filter for you. Hopefully you’ll learn and demystify all these cryptic things that you find in Wikipedia when you google Kalman filters. So let’s get started!
A guide to the less desirable aspects of deep learning environment configurations. Thanks to cheaper and bigger storage we have more data than what we had a couple of years back. We do owe our thanks to Big Data no matter how much hype it has created. However, the real MVP here is faster and better computing ,which made papers from the 1980s and 90s more relevant (LSTMs were actually invented in 1997)! We are finally able to leverage the true power of neural networks and deep learning thanks to better and faster CPUs and GPUs. Whether we like it or not, traditional statistical and machine learning models have severe limitations on problems with high-dimensionality, unstructured data, more complexity and large volumes of data.
Natural Language Processing (NLP) applications have become ubiquitous these days. I seem to stumble across websites and applications regularly that are leveraging NLP in one form or another. In short, this is a wonderful time to be involved in the NLP domain. This rapid increase in NLP adoption has happened largely thanks to the concept of transfer learning enabled through pretrained models. Transfer learning, in the context of NLP, is essentially the ability to train a model on one dataset and then adapt that model to perform different NLP functions on a different dataset. This breakthrough has made things incredibly easy and simple for everyone, especially folks who don’t have the time or resources to build NLP models from scratch. It’s perfect for beginners as well who want to learn or transition into NLP.

# If you did not already know

Ensemble Clustering Algorithm for Graphs (ECG)
We propose an ensemble clustering algorithm for graphs (ECG), which is based on the Louvain algorithm and the concept of consensus clustering. We validate our approach by replicating a recently published study comparing graph clustering algorithms over artificial networks, showing that ECG outperforms the leading algorithms from that study. We also illustrate how the ensemble obtained with ECG can be used to quantify the presence of community structure in the graph. …

Object-Driven Attentive Generative Adversarial Newtork (Obj-GAN)
In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow object-centered text-to-image synthesis for complex scenes. Following the two-step (layout-image) generation process, a novel object-driven attentive image generator is proposed to synthesize salient objects by paying attention to the most relevant words in the text description and the pre-generated semantic layout. In addition, a new Fast R-CNN based object-wise discriminator is proposed to provide rich object-wise discrimination signals on whether the synthesized object matches the text description and the pre-generated layout. The proposed Obj-GAN significantly outperforms the previous state of the art in various metrics on the large-scale COCO benchmark, increasing the Inception score by 27% and decreasing the FID score by 11%. A thorough comparison between the traditional grid attention and the new object-driven attention is provided through analyzing their mechanisms and visualizing their attention layers, showing insights of how the proposed model generates complex scenes in high quality. …

Graph Processing Framework for Large Dynamic Graphs (BLADYG)
Recently, distributed processing of large dynamic graphs has become very popular, especially in certain domains such as social network analysis, Web graph analysis and spatial network analysis. In this context, many distributed/parallel graph processing systems have been proposed, such as Pregel, GraphLab, and Trinity. These systems can be divided into two categories: (1) vertex-centric and (2) block-centric approaches. In vertex-centric approaches, each vertex corresponds to a process, and message are exchanged among vertices. In block-centric approaches, the unit of computation is a block, a connected subgraph of the graph, and message exchanges occur among blocks. In this paper, we are considering the issues of scale and dynamism in the case of block-centric approaches. We present bladyg, a block-centric framework that addresses the issue of dynamism in large-scale graphs. We present an implementation of BLADYG on top of akka framework. We experimentally evaluate the performance of the proposed framework. …

# Let’s get it right

Article: Changing contexts and intents

Every day, someone comes up with a new use for old data. Recently, IBM scraped a million photos from Flickr and turned them into a training data set for an AI project intending to reduce bias in facial recognition. That’s a noble goal, promoted to researchers as an opportunity to make more ethical AI.

Article: Round Up: Ethics and Skepticism

There are a whole lot of different ways to misunderstand or be duped by data. This is my round up of good links that illustrate some of the most common problems with relying on data. Additions are welcome, but I’m looking for news stories, rather than theoretical examples. Your job, as a reporter, is to put the data in context. Sometimes that means making an honest decision about whether or not maps are even the right way to tell the story, because how you tell a story matters.
Kate Klonick, an assistant professor at St John’s Law School, teaches an Information Privacy course for second- and third-year law students; she devised a wonderful and simply exercise to teach her students about ‘anonymous speech, reasonable expectation of privacy, third party doctrine, and privacy by obscurity’ over the spring break. Klonick’s students were assigned to sit in a public place and eavesdrop on nearby conversations, then, using only Google searches, ‘see if you can de-anonymize someone based on things they say loudly enough for lots of others to hear and/or things that are displayed on their clothing or bags.’

Article: Designing Ethical Algorithms

Ethical algorithm design is becoming a hot topic as machine learning becomes more widespread. But how do you make an algorithm ethical? Here are 5 suggestions to consider.
Current advances in research, development and application of artificial intelligence (AI) systems have yielded a far-reaching discourse on AI ethics. In consequence, a number of ethics guidelines have been released in recent years. These guidelines comprise normative principles and recommendations aimed to harness the ‘disruptive’ potentials of new AI technologies. Designed as a comprehensive evaluation, this paper analyzes and compares these guidelines highlighting overlaps but also omissions. As a result, I give a detailed overview of the field of AI ethics. Finally, I also examine to what extent the respective ethical principles and values are implemented in the practice of research, development and application of AI systems – and how the effectiveness in the demands of AI ethics can be improved.
The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly — in accordance with laws and ethical norms. In this article we discuss three recent regulatory frameworks: the European Union’s General Data Protection Regulation (GDPR), the New York City Automated Decisions Systems (ADS) Law, and the Net Neutrality principle, that aim to protect the rights of individuals who are impacted by data collection and analysis. These frameworks are prominent examples of a global trend: Governments are starting to recognize the need to regulate data-driven algorithmic technology. Our goal in this paper is to bring these regulatory frameworks to the attention of the data management community, and to underscore the technical challenges they raise and which we, as a community, are well-equipped to address. The main take-away of this article is that legal and ethical norms cannot be incorporated into data-driven systems as an afterthought. Rather, we must think in terms of responsibility by design, viewing it as a systems requirement.
We provide a formal definition of blameworthiness in settings where multiple agents can collaborate to avoid a negative outcome. We first provide a method for ascribing blameworthiness to groups relative to an epistemic state (a distribution over causal models that describe how the outcome might arise). We then show how we can go from an ascription of blameworthiness for groups to an ascription of blameworthiness for individuals using a standard notion from cooperative game theory, the Shapley value. We believe that getting a good notion of blameworthiness in a group setting will be critical for designing autonomous agents that behave in a moral manner.
Back in the day, machine experiences were a drag. Hit a button, pull a lever, and get the task done. Decades later, with subsequent computing innovation, machines have transformed into their ultra-smart, self-learning, automated versions that are sweeping the human landscape. The underlying technology that’s reinventing machines to personalize human experiences is Machine Learning (ML), a branch of Artificial Intelligence and a strong buzzword in today’s digital-first world. In essence, it’s about programming machines to infuse the ability of self-learning by leveraging Big Data. Information extracted from various touchpoints is analyzed and used to predict intentions for actionable intelligence. And, the good news is, Latest technology is advancing consistently and revolutionizing every facet of our routines. Humans had their first brush-up with Machine Learning when voice-controlled personal assistants?-?Amazon’s Echo and Alexa?-?were launched. These devices are a new normal with the trend of smart homes picking up. Driverless cars, which were a quintessential sci-fi fantasy, aren’t something of the far-off future now. These new-age vehicles, aimed at cutting down human labor, are tested across the world for their utility benefits. Initially, the idea of intelligent machines was preposterous. Machines that act on behalf of humans weren’t a norm. However, with enablement and evolution of Machine Learning in our daily lives, the human landscape is radically changing and how. Below, we have mentioned 10 ways in which Machine Learning is revolutionizing our lives. Let’s take a dive right in.

# R Packages worth a look

A Class for Working with Time Series Based on ‘data.table’ and ‘R6’ with Largely Optional Reference Semantics (DTSg)
Basic time series functionalities such as listing of missing values, application of arbitrary aggregation as well as rolling window functions and autom …

Finds the Price of Anarchy for Routing Games (PoA)
Computes the optimal flow, Nash flow and the Price of Anarchy for any routing game defined within the game theoretical framework. The input is a routin …

A Fully Featured Logging Framework (lgr)
A flexible, feature-rich yet light-weight logging framework based on ‘R6’ classes. It supports hierarchical loggers, custom log levels, arbitrary data …

Split, Combine and Compress PDF Files (qpdf)
Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the ‘qpdf’ C+ …

Encoding of Sequences Based on Frequency Matrix Chaos Game Representation (kaos)
Sequences encoding by using the chaos game representation. Löchel et al. (2019) <doi:10.1101/575324>.

Efficient solvers for 10 regularized multi-task learning algorithms applicable for regression, classification, joint feature selection, task clustering …

# Book Memo: “All Data Are Local”

 Thinking Critically in a Data-Driven Society How to analyze data settings rather than data sets, acknowledging the meaning-making power of the local.In our data-driven society, it is too easy to assume the transparency of data. Instead, Yanni Loukissas argues in All Data Are Local, we should approach data sets with an awareness that data are created by humans and their dutiful machines, at a time, in a place, with the instruments at hand, for audiences that are conditioned to receive them. All data are local. The term data set implies something discrete, complete, and portable, but it is none of those things. Examining a series of data sources important for understanding the state of public life in the United States-Harvard’s Arnold Arboretum, the Digital Public Library of America, UCLA’s Television News Archive, and the real estate marketplace Zillow-Loukissas shows us how to analyze data settings rather than data sets.Loukissas sets out six principles: all data are local; data have complex attachments to place; data are collected from heterogeneous sources; data and algorithms are inextricably entangled; interfaces recontextualize data; and data are indexes to local knowledge. He then provides a set of practical guidelines to follow. To make his argument, Loukissas employs a combination of qualitative research on data cultures and exploratory data visualizations. Rebutting the ‘myth of digital universalism,’ Loukissas reminds us of the meaning-making power of the local.

# Distilled News

Kate Klonick, an assistant professor at St John’s Law School, teaches an Information Privacy course for second- and third-year law students; she devised a wonderful and simply exercise to teach her students about ‘anonymous speech, reasonable expectation of privacy, third party doctrine, and privacy by obscurity’ over the spring break. Klonick’s students were assigned to sit in a public place and eavesdrop on nearby conversations, then, using only Google searches, ‘see if you can de-anonymize someone based on things they say loudly enough for lots of others to hear and/or things that are displayed on their clothing or bags.’
1. Apache Spark
2. Apache Kafka
5. Cassandra
6. Apache Storm
7. RapidMiner
8. Graph Databases (Neo4J and GraphX)
9. Elastic Search
10. Tableau
This article focuses on filtering algorithms’ structure used in image processing. The novel algorithm is based on binary matrices of input data to create sub-clusters which are used as filters later in classification. These sub-clusters are called Probability Kernels. The developed algorithm was tested using different data-sets and shows various accuracy rate which depends on diversity of data-set and the matching value which is used for comparison of matrices. Diabetes data-set, Breast Cancer data-set and Ionosphere data-set are used for comparison of performance of the developed algorithm and already known algorithms such as Naïve Bayes, AdaBoost, Random Forest and, Multilayer Perceptron. Test results regarding accuracy rate confirm advantages of proposed algorithm.
What is MCMC exactly? To answer that question we first need a refresher on Bayesian statistics. Bayesian statistics are built on the idea that the probability of a thing happening is influenced by the prior assumption of the probability and the likelihood that something happened as indicated by the data. With Bayesian statistics, probability is represented by a distribution. If the prior and likelihood probability distributions are normally distributed, we are able to describe the posterior distribution with a function. This is called a closed-form solution. This type of Bayes is shown below. As you can see the posterior distribution is shaped by both the prior and likelihood distributions and ends up somewhere in the middle.
How do you deal with events in the cluster that cross over to the next day?Some users might be night owls or from another timezone; If their activity spans midnight the math becomes harder. If a user sends messages at 23:57, 23:58 and 23:59, calculating the average is straightforward. What if the user is a minute late? Obviously we don’t want to add 00:00 to the average. Imagine you are trying to predict some outcome that is dependant on the time of day, a simple model (regression) will suffer the discontinuity at midnight.
In this article, we will demonstrate how to generate a dataset to build a machine learning model. According to this, Medicare fraud and abuse cost taxpayers \$60 billion per year. AI/ML could significantly help identify and prevent fraud and abuse, but since privacy is of utmost importance in medical patient data, it is extremely difficult to access this data. This prevents data scientists from generating models which would potentially have a positive impact on this field. Is there a way to design and develop models without access to underlying data? Yes, you can generate a prototype using realistic randomly generated data. Concretely, we will build an auto-generated medical insurance dataset and use it to identify potentially fraudulent claims.
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned which can reveal how the network typically represents some concepts.
We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding of their internal decision-making processes will let us identify weaknesses and investigate failures.
Neural networks have become the de facto standard for image-related tasks in computing, currently being deployed in a multitude of scenarios, ranging from automatically tagging photos in your image library to autonomous driving systems. These machine-learned systems have become ubiquitous because they perform more accurately than any system humans were able to directly design without machine learning. But because essential details of these systems are learned during the automated training process, understanding how a network goes about its given task can sometimes remain a bit of a mystery. Today, in collaboration with colleagues at OpenAI, we’re publishing ‘Exploring Neural Networks with Activation Atlases’, which describes a new technique aimed at helping to answer the question of what image classification neural networks ‘see’ when provided an image. Activation atlases provide a new way to peer into convolutional vision networks, giving a global, hierarchical, and human-interpretable overview of concepts within the hidden layers of a network. We think of activation atlases as revealing a machine-learned alphabet for images – an array of simple, atomic concepts that are combined and recombined to form much more complex visual ideas. We are also releasing some jupyter notebooks to help you get you started in making your own activation atlases.
Review of Dropout: A Simple Way to Prevent Neural Networks from Overfitting: In their paper, Srivastava et al. claim that repeatedly eliminating randomly selected nodes and their corresponding connections of a neural network during training will reduce overfitting and result in a signficantly improved neural network. Dropout was the first method to address the issue of overfitting in parametric models. Even today, Dropout remains efficient and works well. This review will go over the basics of understanding this paper such as ‘What is a neural network’ and ‘What is overfitting.’
It’s easy to find examples of businesses getting value from data. It’s also easy to find examples of all of the technical tools you can use to leverage data. Those technical tools are often what people have in mind when they talk about data science, but I’ve found the tools matter less than how they are deployed. Those technologies are part of a larger system of business processes that, taken all together, produces business outcomes. Integration of data science into the business can mean you simply output reports and then someone decides how to act on the findings, or it can mean a whole lot more. This post is about that whole lot more, which is often called ‘productionization.’ For something to be ‘in production’ means it is part of the pipeline from the business to its customers. In manufacturing, if something is in production if it exists somewhere in the process that will result in actual goods being put in stores where consumers can buy them and take them home. In data science, is something is in production it’s on the path to putting information in a place where it is consumed.
Few months ago I mentioned some of the exciting new features that will be included in TensorFlow 2.0. And guess what? Today (at the point of writing) TensorFlow 2.0 Alpha preview package was officially released and the documentation was updated on the official website! I’m so pumped and excited about this and just can’t wait to share this with you!! To get to know some of the previous uses cases of TensorFlow and some of the changes in TensorFlow 2.0, check out the short video below: https://…/href
How much data do you need to train a seq2seq model? Let’s say that you want to translate sentences from one language to another. You probably need a bigger dataset to translate longer sentences than if you wanted to translate shorter ones. How does the need for data grow as the sentence length increases?
Educated decision making involves two major ingredients: probabilistic forecasts for future events or quantities and an assessment of predictive performance. This thesis focuses on the latter topic and illustrates its importance and implications from both theoretical and applied perspectives. Receiver operating characteristic (ROC) curves are key tools for the assessment of predictions for binary events. Despite their popularity and ubiquitous use, the mathematical understanding of ROC curves is still incomplete. We establish the equivalence between ROC curves and cumulative distribution functions (CDFs) on the unit interval and elucidate the crucial role of concavity in interpreting and modeling ROC curves. Under this essential requirement, the classical binormal ROC model is strongly inhibited in its flexibility and we propose the novel beta ROC model as an alternative. For a class of models that includes the binormal and the beta model, we derive the large sample distribution of the minimum distance estimator. This allows for uncertainty quantification and statistical tests of goodness-of-fit or equal predictive ability. Turning … mehr

# Whats new on arXiv

Of primary importance in formulating a response to the increasing prevalence and power of artificial intelligence (AI) applications in society are questions of ontology. Questions such as: What ‘are’ these systems? How are they to be regarded? How does an algorithm come to be regarded as an agent? We discuss three factors which hinder discussion and obscure attempts to form a clear ontology of AI: (1) the various and evolving definitions of AI, (2) the tendency for pre-existing technologies to be assimilated and regarded as ‘normal,’ and (3) the tendency of human beings to anthropomorphize. This list is not intended as exhaustive, nor is it seen to preclude entirely a clear ontology, however, these challenges are a necessary set of topics for consideration. Each of these factors is seen to present a ‘moving target’ for discussion, which poses a challenge for both technical specialists and non-practitioners of AI systems development (e.g., philosophers and theologians) to speak meaningfully given that the corpus of AI structures and capabilities evolves at a rapid pace. Finally, we present avenues for moving forward, including opportunities for collaborative synthesis for scholars in philosophy and science.
This study provides a systematic review of the recent advances in designing the intelligent tutoring robot (ITR), and summarises the status quo of applying artificial intelligence (AI) techniques. We first analyse the environment of the ITR and propose a relationship model for describing interactions of ITR with the students, the social milieu and the curriculum. Then, we transform the relationship model into the perception-planning-action model for exploring what AI techniques are suitable to be applied in the ITR. This article provides insights on promoting human-robot teaching-learning process and AI-assisted educational techniques, illustrating the design guidelines and future research perspectives in intelligent tutoring robots.
The interaction between syntax (formal language) and its semantics (meanings of language) is well studied in categorical logic. Results of this study are employed to understand how the brain could create meanings. To emphasize the toy character of the proposed model, we prefer to speak on homunculus’ brain rather than just on the brain. Homunculus’ brain consists of neurons, each of which is modeled by a category, and axons between neurons, which are modeled by functors between the corresponding neuron-categories. Each neuron (category) has its own program enabling its working, i.e. a ‘theory’ of this neuron. In analogy with what is known from categorical logic, we postulate the existence of the pair of adjoint functors, called Lang and Syn, from a category, now called BRAIN, of categories, to a category, now called MIND, of theories. Our homunculus is a kind of ‘mathematical robot’, the neuronal architecture of which is not important. Its only aim is to provide us with the opportunity to study how such a simple brain-like structure could ‘create meanings’ out of its purely syntactic program. The pair of adjoint functors Lang and Syn models mutual dependencies between the syntactical structure of a given theory of MIND and the internal logic of its semantics given by a category of BRAIN. In this way, a formal language (syntax) and its meanings (semantics) are interwoven with each other in a manner corresponding to the adjointness of the functors Lang and Syn. Categories BRAIN and MIND interact with each other with their entire structures and, at the same time, these very structures are shaped by this interaction.
Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters with enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, with the end of Moore’s law, there is a limit to such scaling. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, which drastically reduce the computation during both training and inference, with simple multi-core parallelism on a modest CPU. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level. We provide codes and benchmark scripts for reproducibility.
In this paper we explore tying together the ideas from Scattering Transforms and Convolutional Neural Networks (CNN) for Image Analysis by proposing a learnable ScatterNet. Previous attempts at tying them together in hybrid networks have tended to keep the two parts separate, with the ScatterNet forming a fixed front end and a CNN forming a learned backend. We instead look at adding learning between scattering orders, as well as adding learned layers before the ScatterNet. We do this by breaking down the scattering orders into single convolutional-like layers we call ‘locally invariant’ layers, and adding a learned mixing term to this layer. Our experiments show that these locally invariant layers can improve accuracy when added to either a CNN or a ScatterNet. We also discover some surprising results in that the ScatterNet may be best positioned after one or more layers of learning rather than at the front of a neural network.
There has been strong recent interest in testing interval null hypothesis for improved scientific inference. For example, Lakens et al (2018) and Lakens and Harms (2017) use this approach to study if there is a pre-specified meaningful treatment effect in gerontology and clinical trials, which is different from the more traditional point null hypothesis that tests for any treatment effect. Two popular Bayesian approaches are available for interval null hypothesis testing. One is the standard Bayes factor and the other is the Region of Practical Equivalence (ROPE) procedure championed by Kruschke and others over many years. This paper establishes a formal connection between these two approaches with two benefits. First, it helps to better understand and improve the ROPE procedure. Second, it leads to a simple and effective algorithm for computing Bayes factor in a wide range of problems using draws from posterior distributions generated by standard Bayesian programs such as BUGS, JAGS and Stan. The tedious and error-prone task of coding custom-made software specific for Bayes factor is then avoided.
Network embedding, as a promising way of the network representation learning, is capable of supporting various subsequent network mining and analysis tasks, and has attracted growing research interests recently. Traditional approaches assign each node with an independent continuous vector, which will cause huge memory overhead for large networks. In this paper we propose a novel multi-hot compact embedding strategy to effectively reduce memory cost by learning partially shared embeddings. The insight is that a node embedding vector is composed of several basis vectors, which can significantly reduce the number of continuous vectors while maintain similar data representation ability. Specifically, we propose a MCNE model to learn compact embeddings from pre-learned node features. A novel component named compressor is integrated into MCNE to tackle the challenge that popular back-propagation optimization cannot propagate through discrete samples. We further propose an end-to-end model MCNE$_{t}$ to learn compact embeddings from the input network directly. Empirically, we evaluate the proposed models over three real network datasets, and the results demonstrate that our proposals can save about 90\% of memory cost of network embeddings without significantly performance decline.
Heterogeneous knowledge naturally arises among different agents in cooperative multiagent reinforcement learning. As such, learning can be greatly improved if agents can effectively pass their knowledge on to other agents. Existing work has demonstrated that peer-to-peer knowledge transfer, a process referred to as action advising, improves team-wide learning. In contrast to previous frameworks that advise at the level of primitive actions, we aim to learn high-level teaching policies that decide when and what high-level action (e.g., sub-goal) to advise a teammate. We introduce a new learning to teach framework, called hierarchical multiagent teaching (HMAT). The proposed framework solves difficulties faced by prior work on multiagent teaching when operating in domains with long horizons, delayed rewards, and continuous states/actions by leveraging temporal abstraction and deep function approximation. Our empirical evaluations show that HMAT accelerates team-wide learning progress in difficult environments that are more complex than those explored in previous work. HMAT also learns teaching policies that can be transferred to different teammates/tasks and can even teach teammates with heterogeneous action spaces.
Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express a wide range of physical database layouts, going well beyond the row- and column- based methods that are widely used in database management systems. We also build a compiler for this language, which is specialized for a dataset and a query workload. We conduct experiments using a popular database benchmark, which shows that the performance of these specialized queries is competitive with a state-of-the-art in memory compiled database system.
Adversarial methods for imitation learning have been shown to perform well on various control tasks. However, they require a large number of environment interactions for convergence. In this paper, we propose an end-to-end differentiable adversarial imitation learning algorithm in a Dyna-like framework for switching between model-based planning and model-free learning from expert data. Our results on both discrete and continuous environments show that our approach of using model-based planning along with model-free learning converges to an optimal policy with fewer number of environment interactions in comparison to the state-of-the-art learning methods.
The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, rankingmotivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we present two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a setbased similarity structure by exploiting all instances in the gallery. The samples are split into a positive and a negative set. Our objective is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution might be dropped. In contrast, we propose to learn a hypersphere for each class in order to preserve the similarity structure inside it. Our extensive experiments show that the proposed method achieves state-of-the-art performance on three widely used benchmarks.
Attribute acquisition for classes is a key step in ontology construction, which is often achieved by community members manually. This paper investigates an attention-based automatic paradigm called TransATT for attribute acquisition, by learning the representation of hierarchical classes and attributes in Chinese ontology. The attributes of an entity can be acquired by merely inspecting its classes, because the entity can be regard as the instance of its classes and inherit their attributes. For explicitly describing of the class of an entity unambiguously, we propose class-path to represent the hierarchical classes in ontology, instead of the terminal class word of the hypernym-hyponym relation (i.e., is-a relation) based hierarchy. The high performance of TransATT on attribute acquisition indicates the promising ability of the learned representation of class-paths and attributes. Moreover, we construct a dataset named \textbf{BigCilin11k}. To the best of our knowledge, this is the first Chinese dataset with abundant hierarchical classes and entities with attributes.
Since the introduction and the public availability of the \textsc{ucr} time series benchmark data sets, numerous Time Series Classification (TSC) methods has been designed, evaluated and compared to each others. We suggest a critical view of TSC performance evaluation protocols put in place in recent TSC literature. The main goal of this `position’ paper is to stimulate discussion and reflexion about performance evaluation in TSC literature.
The Era of Big Data has forced researchers to explore new distributed solutions for building fuzzy classifiers, which often introduce approximation errors or make strong assumptions to reduce computational and memory requirements. As a result, Big Data classifiers might be expected to be inferior to those designed for standard classification tasks (Small Data) in terms of accuracy and model complexity. To our knowledge, however, there is no empirical evidence to confirm such a conjecture yet. Here, we investigate the extent to which state-of-the-art fuzzy classifiers for Big Data sacrifice performance in favor of scalability. To this end, we carry out an empirical study that compares these classifiers with some of the best performing algorithms for Small Data. Assuming the latter were generally designed for maximizing performance without considering scalability issues, the results of this study provide some intuition around the tradeoff between performance and scalability achieved by current Big Data solutions. Our findings show that, although slightly inferior, Big Data classifiers are gradually catching up with state-of-the-art classifiers for Small data, suggesting that a unified learning algorithm for Big and Small Data might be possible.
Multiple kernel learning (MKL) algorithms combine different base kernels to obtain a more efficient representation in the feature space. Focusing on discriminative tasks, MKL has been used successfully for feature selection and finding the significant modalities of the data. In such applications, each base kernel represents one dimension of the data or is derived from one specific descriptor. Therefore, MKL finds an optimal weighting scheme for the given kernels to increase the classification accuracy. Nevertheless, the majority of the works in this area focus on only binary classification problems or aim for linear separation of the classes in the kernel space, which are not realistic assumptions for many real-world problems. In this paper, we propose a novel multi-class MKL framework which improves the state-of-the-art by enhancing the local separation of the classes in the feature space. Besides, by using a sparsity term, our large-margin multiple kernel algorithm (LMMK) performs discriminative feature selection by aiming to employ a small subset of the base kernels. Based on our empirical evaluations on different real-world datasets, LMMK provides a competitive classification accuracy compared with the state-of-the-art algorithms in MKL. Additionally, it learns a sparse set of non-zero kernel weights which leads to a more interpretable feature selection and representation learning.
It has long been recognized as a difficult problem to determine whether the observed statistical correlation between two classical variables arise from causality or from common causes. Recent research has shown that in quantum theoretical framework, the mechanisms of entanglement and quantum coherence provide an advantage in tackling this problem. In some particular cases, quantum common causes and quantum causality can be effectively distinguished using observations only. However, these solutions do not apply to all cases. There still exist enormous cases in which quantum common causes and quantum causality can not be distinguished. In this paper, along the line of considering unitary transformation as causality in the quantum world, we formally show quantum common causes and quantum causality are universally separable. Based on the analysis, we further provide a general method to discriminate the two.
Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process. In this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. We give a depth separation argument for distributions with fractal structure, showing that they can be expressed efficiently by deep networks, but not with shallow ones. These distributions have a natural coarse-to-fine structure, and we show that the balance between the coarse and fine details has a crucial effect on whether the optimization process is likely to succeed. We prove that when the distribution is concentrated on the fine details, gradient-based algorithms are likely to fail. Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.
We propose a Three-Player Generative Adversarial Network to improve classification networks. In addition to the game played between the discriminator and generator, a competition is introduced between the generator and the classifier. The generator’s objective is to synthesize samples that are both realistic and hard to label for the classifier. Even though we make no assumptions on the type of augmentations to learn, we find that the model is able to synthesize realistically looking examples that are hard for the classification model. Furthermore, the classifier becomes more robust when trained on these difficult samples. The method is evaluated on a public dataset for traffic sign recognition.

# Book Memo: “Machine Learning and AI for Healthcare”

 Big Data for Improved Health Outcomes Explore the theory and practical applications of artificial intelligence (AI) and machine learning in healthcare. This book offers a guided tour of machine learning algorithms, architecture design, and applications of learning in healthcare and big data challenges. You’ll discover the ethical implications of healthcare data analytics and the future of AI in population and patient health optimization. You’ll also create a machine learning model, evaluate performance and operationalize its outcomes within your organization. Machine Learning and AI for Healthcare provides techniques on how to apply machine learning within your organization and evaluate the efficacy, suitability, and efficiency of AI applications. These are illustrated through leading case studies, including how chronic disease is being redefined through patient-led data learning and the Internet of Things.

# If you did not already know

SemiPsm
Recent years have witnessed a surge of manipulation of public opinion and political events by malicious social media actors. These users are referred to as ‘Pathogenic Social Media (PSM)’ accounts. PSMs are key users in spreading misinformation in social media to viral proportions. These accounts can be either controlled by real users or automated bots. Identification of PSMs is thus of utmost importance for social media authorities. The burden usually falls to automatic approaches that can identify these accounts and protect social media reputation. However, lack of sufficient labeled examples for devising and training sophisticated approaches to combat these accounts is still one of the foremost challenges facing social media firms. In contrast, unlabeled data is abundant and cheap to obtain thanks to massive user-generated data. In this paper, we propose a semi-supervised causal inference PSM detection framework, SemiPsm, to compensate for the lack of labeled data. In particular, the proposed method leverages unlabeled data in the form of manifold regularization and only relies on cascade information. This is in contrast to the existing approaches that use exhaustive feature engineering (e.g., profile information, network structure, etc.). Evidence from empirical experiments on a real-world ISIS-related dataset from Twitter suggests promising results of utilizing unlabeled instances for detecting PSMs. …

User-Sensitive Recommendation Ensemble with Clustered Multi-Task Learning (UREC)
This paper considers recommendation algorithm ensembles in a user-sensitive manner. Recently researchers have proposed various effective recommendation algorithms, which utilized different aspects of the data and different techniques. However, the ‘user skewed prediction’ problem may exist for almost all recommendation algorithms — algorithms with best average predictive accuracy may cover up that the algorithms may perform poorly for some part of users, which will lead to biased services in real scenarios. In this paper, we propose a user-sensitive ensemble method named ‘UREC’ to address this issue. We first cluster users based on the recommendation predictions, then we use multi-task learning to learn the user-sensitive ensemble function for the users. In addition, to alleviate the negative effects of new user problem to clustering users, we propose an approximate approach based on a spectral relaxation. Experiments on real-world datasets demonstrate the superiority of our methods. …

Knowledge Tracing Machine
Knowledge tracing is a sequence prediction problem where the goal is to predict the outcomes of students over questions as they are interacting with a learning platform. By tracking the evolution of the knowledge of some student, one can optimize instruction. Existing methods are either based on temporal latent variable models, or factor analysis with temporal features. We here show that factorization machines (FMs), a model for regression or classification, encompass several existing models in the educational literature as special cases, notably additive factor model, performance factor model, and multidimensional item response theory. We show, using several real datasets of tens of thousands of users and items, that FMs can estimate student knowledge accurately and fast even when student data is sparsely observed, and handle side information such as multiple knowledge components and number of attempts at item or skill level. Our approach allows to fit student models of higher dimension than existing models, and provides a testbed to try new combinations of features in order to improve existing models. …