Magister Dixit

“Within the next five years, big data will become the norm, enabling a new horizon of personalization for both products and services. Wise leaders will soon embrace the game-changing opportunities that big data afford to their societies and organizations, and will provide the necessary sponsorship to realize this potential. Skeptics and laggards, meanwhile, look set to pay a heavy price.” Strategy& ( 2014 )


Distilled News

Adding Custom Fonts to ggplot in R

ggplot – You can spot one from a mile away, which is great! And when you do it’s a silent fist bump. But sometimes you want more than the standard theme. Fonts can breathe new life into your plots, helping to match the theme of your presentation, poster or report. This is always a second thought for me and need to work out how to do it again, hence the post.

Basic Binary Sentiment Analysis using NLTK

In today’s context, it turns out a LOT. Social media has opened the floodgates of customer opinions and it is now free-flowing in mammoth proportions for businesses to analyze. Today, using machine learning companies are able to extract these opinions in the form of text or audio and then analyze the emotions behind them on an unprecedented scale. Sentiment analysis, opinion mining call it what you like, if you have a product/service to sell you need to be on it. ‘ When captured electronically, customer sentiment?-?expressions beyond facts, that convey mood, opinion, and emotion?-?carries immense business value. We’re talking the voice of the customer, and of the prospect, patient, voter, and opinion leader.’ – Starting from user reviews in media to analyzing stock prices, sentiment analysis has become a ubiquitous tool in almost all industries. For example, the graph below shows the stock price movement of eBay with a sentiment index created based on an analysis of tweets that mention eBay.

How I implemented googleSignIn in R (shiny) and lived

Known user identity when building shiny apps can sometimes come really handy. While you can implement your own user login, for instance using cookies, you can also use some of the services which authenticate a user for you, such as Google. This way, you don’t have to handle cookies or passwords, just a small part of bureaucracy in your database.

IT Support Ticket Classification and Deployment using Machine Learning and AWS Lambda

As a part of our final project for Cognitive computing, we decided to address a real life business challenge for which we chose IT Service Management. Of all the business cases, we were interested with four user cases that might befitting for our project.

[eBook] Standardizing the Machine Learning Lifecycle

Successfully building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself) to reproduce your pipeline, compare the results of different versions, track what’s running where, and redeploy and rollback updated models, is much harder. In this eBook, we’ll explore what makes the machine learning lifecycle so challenging compared to the traditional software-development lifecycle, and share the Databricks approach to addressing these challenges.

R and Python: Using reticulate to get the best of both worlds

It’s March 15th and that means it’s World Sleep Day (WSD). Don’t snooze off just yet! We’re about to check out a package that can make using R and Python a dream. It’s called reticulate and we’ll use it to train a Support Vector Machine for a simple classification task.

Full Stack Visualizations For Complex Solutions – For Data Scientists

This post is mainly for data scientists who want to develop an interface around their solution quickly. While it is true that you can build some interactive dashboards in Jupyter Notebooks or other places, I personally have encountered their limitations in a couple of my projects. Plus, sometimes it’s just much easier to let people play around with the solution rather than you explaining to them.

Top R Packages for Data Cleaning

Data cleaning is one of the most important and time consuming task for data scientists. Here are the top R packages for data cleaning.

15 Great Articles about Bayesian Methods and Networks

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more.

New word for Data Science – Signuology

I propose a new word for data science for sparking new thinking. Signuology is defined as the study of sets of characteristic predictive signals contained within data in the form of combined features of the data that are characteristic of an observation of interest within the data. The terms data mining and data structure imply rigid and discrete characteristics. A signal has more flexibility, borrowing from ideas contained in the superposition principle in physics. One can take the same data and ask a difference question, a different dependent variable, and find a different signal; the data structure will be the same. Data structure as a high level concept appears to limit one’s thinking.

How do I know if my AI idea is possible?

One of the questions that I get asked often as an AI consultant is, in some ways, the most simple: Is this possible? People will come to me with some very vague notion of something they want automated or some sort of AI product they want to create. They usually don’t come from a technology background, but they are smart, intelligent, informed people. They have read about AI technology being applied in other, similar domains and they see a similar opportunity in their own domain.

Checklist for debugging neural networks

Tangible steps you can take to identify and fix issues with training, generalization, and optimization for machine learning models

An All-Neural On-Device Speech Recognizer

In 2012, speech recognition research showed significant accuracy improvements with deep learning, leading to early adoption in products such as Google’s Voice Search. It was the beginning of a revolution in the field: each year, new architectures were developed that further increased quality, from deep neural networks (DNNs) to recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional networks (CNNs), and more. During this time, latency remained a prime focus – an automated assistant feels a lot more helpful when it responds quickly to requests.


Magma is an open-source software platform that gives network operators an open, flexible and extendable mobile core network solution. Magma enables better connectivity by:
• Allowing operators to offer cellular service without vendor lock-in with a modern, open source core network
• Enabling operators to manage their networks more efficiently with more automation, less downtime, better predictability, and more agility to add new services and applications
• Enabling federation between existing MNOs and new infrastructure providers for expanding rural infrastructure
• Allowing operators who are constrained with licensed spectrum to add capacity and reach by using Wi-Fi and CBRS

Lessons learned building natural language processing systems in health care

We’re in an exciting decade for natural language processing (NLP). Computers will get as good as humans in complex tasks like reading comprehension, language translation, and creative writing. Language understanding benefits from every part of the fast-improving ABC of software: AI (freely available deep learning libraries like PyText and language models like BERT), big data (Hadoop, Spark, and Spark NLP), and cloud (GPU’s on demand and NLP-as-a-service from all the major cloud providers).

Whats new on arXiv

Autoregressive Convolutional Recurrent Neural Network for Univariate and Multivariate Time Series Prediction

Time Series forecasting (univariate and multivariate) is a problem of high complexity due the different patterns that have to be detected in the input, ranging from high to low frequencies ones. In this paper we propose a new model for timeseries prediction that utilizes convolutional layers for feature extraction, a recurrent encoder and a linear autoregressive component. We motivate the model and we test and compare it against a baseline of widely used existing architectures for univariate and multivariate timeseries. The proposed model appears to outperform the baselines in almost every case of the multivariate timeseries datasets, in some cases even with 50% improvement which shows the strengths of such a hybrid architecture in complex timeseries.

IMEXnet: A Forward Stable Deep Neural Network

Deep convolutional neural networks have revolutionized many machine learning and computer vision tasks. Despite their enormous success, remaining key challenges limit their wider use. Pressing challenges include improving the network’s robustness to perturbations of the input images and simplifying the design of architectures that generalize. Another problem relates to the limited ‘field of view’ of convolution operators, which means that very deep networks are required to model nonlocal relations in high-resolution image data. We introduce the IMEXnet that addresses these challenges by adapting semi-implicit methods for partial differential equations. Compared to similar explicit networks such as the residual networks (ResNets) our network is more stable. This stability has been recently shown to reduce the sensitivity to small changes in the input features and improve generalization. The implicit step connects all pixels in the images and therefore addresses the field of view problem, while being comparable to standard convolutions in terms of the number of parameters and computational complexity. We also present a new dataset for semantic segmentation and demonstrate the effectiveness of our architecture using the NYU depth dataset.

Generative Graph Convolutional Network for Growing Graphs

Modeling generative process of growing graphs has wide applications in social networks and recommendation systems, where cold start problem leads to new nodes isolated from existing graph. Despite the emerging literature in learning graph representation and graph generation, most of them can not handle isolated new nodes without nontrivial modifications. The challenge arises due to the fact that learning to generate representations for nodes in observed graph relies heavily on topological features, whereas for new nodes only node attributes are available. Here we propose a unified generative graph convolutional network that learns node representations for all nodes adaptively in a generative model framework, by sampling graph generation sequences constructed from observed graph data. We optimize over a variational lower bound that consists of a graph reconstruction term and an adaptive Kullback-Leibler divergence regularization term. We demonstrate the superior performance of our approach on several benchmark citation network datasets.

Structure-Preserving Community In A Multilayer Network: Definition, Detection, And Analysis

Multilayer networks or MLNs (also called multiplexes or network of networks) are being used extensively for modeling and analysis of data sets with multiple entity and feature types as well as their relationships. As the concept of communities and hubs are used for these analysis, a structure-preserving definition for them on MLNs (that retains the original MLN structure and node/edge labels and types) and its efficient detection are critical. There is no structure-preserving definition of a community for a MLN as most of the current analyses aggregate a MLN to a single graph. Although there is consensus on community definition for single graphs (and detection packages) and to a lesser extent for homogeneous MLNs, it is lacking for heterogeneous MLNs. In this paper, we not only provide a structure-preserving definition for the first time, but also its efficient computation using a decoupling approach, and discuss its characteristics & significance for analysis. The proposed decoupling approach for efficiency combines communities from individual layers to form a serial k-community for connected k layers in a MLN. We propose several weight metrics for composing layer-wise communities using the bipartite graph match approach based on the analysis semantics. Our proposed approach has a number of advantages. It: i) leverages extant single graph community detection algorithms, ii) is based on the widely-used maximal flow bipartite graph matching for composing k layers, iii) introduces several weight metrics that are customized for the community concept, and iv) experimentally validates the definition, mapping, and efficiency from a flexible analysis perspective on widely-used IMDb data set. Keywords: Heterogeneous Multilayer Networks; Bipartite Graphs; Community Definition and Detection; Decoupling-Based Composition

A Character-Level Approach to the Text Normalization Problem Based on a New Causal Encoder

Text normalization is a ubiquitous process that appears as the first step of many Natural Language Processing problems. However, previous Deep Learning approaches have suffered from so-called silly errors, which are undetectable on unsupervised frameworks, making those models unsuitable for deployment. In this work, we make use of an attention-based encoder-decoder architecture that overcomes these undetectable errors by using a fine-grained character-level approach rather than a word-level one. Furthermore, our new general-purpose encoder based on causal convolutions, called Causal Feature Extractor (CFE), is introduced and compared to other common encoders. The experimental results show the feasibility of this encoder, which leverages the attention mechanisms the most and obtains better results in terms of accuracy, number of parameters and convergence time. While our method results in a slightly worse initial accuracy (92.74%), errors can be automatically detected and, thus, more readily solved, obtaining a more robust model for deployment. Furthermore, there is still plenty of room for future improvements that will push even further these advantages.

Using World Models for Pseudo-Rehearsal in Continual Learning

The utility of learning a dynamics/world model of the environment in reinforcement learning has been shown in a many ways. When using neural networks, however, these models suffer catastrophic forgetting when learned in a lifelong or continual fashion. Current solutions to the continual learning problem require experience to be segmented and labeled as discrete tasks, however, in continuous experience it is generally unclear what a sufficient segmentation of tasks would be. Here we propose a method to continually learn these internal world models through the interleaving of internally generated rollouts from past experiences (i.e., pseudo-rehearsal). We show this method can sequentially learn unsupervised temporal prediction, without task labels, in a disparate set of Atari games. Empirically, this interleaving of the internally generated rollouts with the external environment’s observations leads to an average 4.5x reduction in temporal prediction loss compared to non-interleaved learning. Similarly, we show that the representations of this internal model remain stable across learned environments. Here, an agent trained using an initial version of the internal model can perform equally well when using a subsequent version that has successfully incorporated experience from multiple new environments.

Multi-Instance Learning for End-to-End Knowledge Base Question Answering

End-to-end training has been a popular approach for knowledge base question answering (KBQA). However, real world applications often contain answers of varied quality for users’ questions. It is not appropriate to treat all available answers of a user question equally. This paper proposes a novel approach based on multiple instance learning to address the problem of noisy answers by exploring consensus among answers to the same question in training end-to-end KBQA models. In particular, the QA pairs are organized into bags with dynamic instance selection and different options of instance weighting. Curriculum learning is utilized to select instance bags during training. On the public CQA dataset, the new method significantly improves both entity accuracy and the Rouge-L score over a state-of-the-art end-to-end KBQA baseline.

Fast Parallel Algorithms for Feature Selection

In this paper, we analyze a fast parallel algorithm to efficiently select and build a set of k random variables from a large set of n candidate elements. This combinatorial optimization problem can be viewed in the context of feature selection for the prediction of a response variable. Using the adaptive sampling technique, which has recently been shown to exponentially speed up submodular maximization algorithms, we propose a new parallelizable algorithm that dramatically speeds up previous selection algorithms by reducing the number of rounds from \mathcal O(k) to \mathcal O(\log n) for objectives that do not conform to the submodularity property. We introduce a new metric to quantify the closeness of the objective function to submodularity and analyze the performance of adaptive sampling under this regime. We also conduct experiments on synthetic and real datasets and show that the empirical performance of adaptive sampling on not-submodular objectives greatly outperforms its theoretical lower bound. Additionally, the empirical running time drastically improved in all experiments without comprising the terminal value, showing the practicality of adaptive sampling.

Concurrent Meta Reinforcement Learning

State-of-the-art meta reinforcement learning algorithms typically assume the setting of a single agent interacting with its environment in a sequential manner. A negative side-effect of this sequential execution paradigm is that, as the environment becomes more and more challenging, and thus requiring more interaction episodes for the meta-learner, it needs the agent to reason over longer and longer time-scales. To combat the difficulty of long time-scale credit assignment, we propose an alternative parallel framework, which we name ‘Concurrent Meta-Reinforcement Learning’ (CMRL), that transforms the temporal credit assignment problem into a multi-agent reinforcement learning one. In this multi-agent setting, a set of parallel agents are executed in the same environment and each of these ‘rollout’ agents are given the means to communicate with each other. The goal of the communication is to coordinate, in a collaborative manner, the most efficient exploration of the shared task the agents are currently assigned. This coordination therefore represents the meta-learning aspect of the framework, as each agent can be assigned or assign itself a particular section of the current task’s state space. This framework is in contrast to standard RL methods that assume that each parallel rollout occurs independently, which can potentially waste computation if many of the rollouts end up sampling the same part of the state space. Furthermore, the parallel setting enables us to define several reward sharing functions and auxiliary losses that are non-trivial to apply in the sequential setting. We demonstrate the effectiveness of our proposed CMRL at improving over sequential methods in a variety of challenging tasks.

Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning? – A Case Study in Dynamic Courier Dispatching System

In this paper, we study a courier dispatching problem (CDP) raised from an online pickup-service platform of Alibaba. The CDP aims to assign a set of couriers to serve pickup requests with stochastic spatial and temporal arrival rate among urban regions. The objective is to maximize the revenue of served requests given a limited number of couriers over a period of time. Many online algorithms such as dynamic matching and vehicle routing strategy from existing literature could be applied to tackle this problem. However, these methods rely on appropriately predefined optimization objectives at each decision point, which is hard in dynamic situations. This paper formulates the CDP as a Markov decision process (MDP) and proposes a data-driven approach to derive the optimal dispatching rule-set under different scenarios. Our method stacks multi-layer images of the spatial-and-temporal map and apply multi-agent reinforcement learning (MARL) techniques to evolve dispatching models. This method solves the learning inefficiency caused by traditional centralized MDP modeling. Through comprehensive experiments on both artificial dataset and real-world dataset, we show: 1) By utilizing historical data and considering long-term revenue gains, MARL achieves better performance than myopic online algorithms; 2) MARL is able to construct the mapping between complex scenarios to sophisticated decisions such as the dispatching rule. 3) MARL has the scalability to adopt in large-scale real-world scenarios.

Allocation of Computation-Intensive Graph Jobs over Vehicular Clouds

Recent years have witnessed dramatic growth in smart vehicles and computation-intensive jobs, which pose new challenges to the provision of efficient services related to the internet of vehicles. Graph jobs, in which computations are represented by graphs consisting of components (denoting either data sources or data processing) and edges (corresponding to data flows between the components) are one type of computation-intensive job warranting attention. Limitations on computational resources and capabilities of on-board equipment are primary obstacles to fulfilling the requirements of such jobs. Vehicular clouds, formed by a collection of vehicles allowing jobs to be offloaded among vehicles, can substantially alleviate heavy on-board workloads and enable on-demand provisioning of computational resources. In this article, we present a novel framework for vehicular clouds that maps components of graph jobs to service providers via opportunistic vehicle-to-vehicle communication. Then, graph job allocation over vehicular clouds is formulated as a form of non-linear integer programming with respect to vehicles’ contact duration and available resources, aiming to minimize job completion time and data exchange cost. The problem is approached from two scenarios: low-traffic and rush-hours. For the former, we determine the optimal solutions for the problem. In the latter case, given intractable computations for deriving feasible allocations, we propose a novel low-complexity randomized algorithm. Numerical analysis and comparative evaluations are performed for the proposed algorithms under different graph job topologies and vehicular cloud configurations.

Multimapper: Data Density Sensitive Topological Visualization

Mapper is an algorithm that summarizes the topological information contained in a dataset and provides an insightful visualization. It takes as input a point cloud which is possibly high-dimensional, a filter function on it and an open cover on the range of the function. It returns the nerve simplicial complex of the pullback of the cover. Mapper can be considered a discrete approximation of the topological construct called Reeb space, as analysed in the 1-dimensional case by [Carri et al.]. Despite its success in obtaining insights in various fields such as in [Kamruzzaman et al., 2016], Mapper is an ad hoc technique requiring lots of parameter tuning. There is also no measure to quantify goodness of the resulting visualization, which often deviates from the Reeb space in practice. In this paper, we introduce a new cover selection scheme for data that reduces the obscuration of topological information at both the computation and visualisation steps. To achieve this, we replace global scale selection of cover with a scale selection scheme sensitive to local density of data points. We also propose a method to detect some deviations in Mapper from Reeb space via computation of persistence features on the Mapper graph.

Doubly Aligned Incomplete Multi-view Clustering

Nowadays, multi-view clustering has attracted more and more attention. To date, almost all the previous studies assume that views are complete. However, in reality, it is often the case that each view may contain some missing instances. Such incompleteness makes it impossible to directly use traditional multi-view clustering methods. In this paper, we propose a Doubly Aligned Incomplete Multi-view Clustering algorithm (DAIMC) based on weighted semi-nonnegative matrix factorization (semi-NMF). Specifically, on the one hand, DAIMC utilizes the given instance alignment information to learn a common latent feature matrix for all the views. On the other hand, DAIMC establishes a consensus basis matrix with the help of L_{2,1}-Norm regularized regression for reducing the influence of missing instances. Consequently, compared with existing methods, besides inheriting the strength of semi-NMF with ability to handle negative entries, DAIMC has two unique advantages: 1) solving the incomplete view problem by introducing a respective weight matrix for each view, making it able to easily adapt to the case with more than two views; 2) reducing the influence of view incompleteness on clustering by enforcing the basis matrices of individual views being aligned with the help of regression. Experiments on four real-world datasets demonstrate its advantages.

GRATIS: GeneRAting TIme Series with diverse and controllable characteristics

The explosion of time series data in recent years has brought a flourish of new time series analysis methods, for forecasting, clustering, classification and other tasks. The evaluation of these new methods requires a diverse collection of time series benchmarking data to enable reliable comparisons against alternative approaches. We propose GeneRAting TIme Series with diverse and controllable characteristics, named GRATIS, with the use of mixture autoregressive (MAR) models. We generate sets of time series using MAR models and investigate the diversity and coverage of the generated time series in a time series feature space. By tuning the parameters of the MAR models, GRATIS is also able to efficiently generate new time series with controllable features. In general, as a costless surrogate to the traditional data collection approach, GRATIS can be used as an evaluation tool for tasks such as time series forecasting and classification. We illustrate the usefulness of our time series generation process through a time series forecasting application.

Interpretable Deep Learning in Drug Discovery

Without any means of interpretation, neural networks that predict molecular properties and bioactivities are merely black boxes. We will unravel these black boxes and will demonstrate approaches to understand the learned representations which are hidden inside these models. We show how single neurons can be interpreted as classifiers which determine the presence or absence of pharmacophore- or toxicophore-like structures, thereby generating new insights and relevant knowledge for chemistry, pharmacology and biochemistry. We further discuss how these novel pharmacophores/toxicophores can be determined from the network by identifying the most relevant components of a compound for the prediction of the network. Additionally, we propose a method which can be used to extract new pharmacophores from a model and will show that these extracted structures are consistent with literature findings. We envision that having access to such interpretable knowledge is a crucial aid in the development and design of new pharmaceutically active molecules, and helps to investigate and understand failures and successes of current methods.

Multi-output Bus Travel Time Prediction with Convolutional LSTM Neural Network

Accurate and reliable travel time predictions in public transport networks are essential for delivering an attractive service that is able to compete with other modes of transport in urban areas. The traditional application of this information, where arrival and departure predictions are displayed on digital boards, is highly visible in the city landscape of most modern metropolises. More recently, the same information has become critical as input for smart-phone trip planners in order to alert passengers about unreachable connections, alternative route choices and prolonged travel times. More sophisticated Intelligent Transport Systems (ITS) include the predictions of connection assurance, i.e. to hold back services in case a connecting service is delayed. In order to operate such systems, and to ensure the confidence of passengers in the systems, the information provided must be accurate and reliable. Traditional methods have trouble with this as congestion, and thus travel time variability, increases in cities, consequently making travel time predictions in urban areas a non-trivial task. This paper presents a system for bus travel time prediction that leverages the non-static spatio-temporal correlations present in urban bus networks, allowing the discovery of complex patterns not captured by traditional methods. The underlying model is a multi-output, multi-time-step, deep neural network that uses a combination of convolutional and long short-term memory (LSTM) layers. The method is empirically evaluated and compared to other popular approaches for link travel time prediction and currently available services, including the currently deployed model in Copenhagen, Denmark. We find that the proposed model significantly outperforms all the other methods we compare with, and is able to detect small irregular peaks in bus travel times very quickly.

Predicting Research Trends From Arxiv

We perform trend detection on two datasets of Arxiv papers, derived from its machine learning (cs.LG) and natural language processing (cs.CL) categories. Our approach is bottom-up: we first rank papers by their normalized citation counts, then group top-ranked papers into different categories based on the tasks that they pursue and the methods they use. We then analyze these resulting topics. We find that the dominating paradigm in cs.CL revolves around natural language generation problems and those in cs.LG revolve around reinforcement learning and adversarial principles. By extrapolation, we predict that these topics will remain lead problems/approaches in their fields in the short- and mid-term.

COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis

There are substantial instructional videos on the Internet, which enables us to acquire knowledge for completing various tasks. However, most existing datasets for instructional video analysis have the limitations in diversity and scale,which makes them far from many real-world applications where more diverse activities occur. Moreover, it still remains a great challenge to organize and harness such data. To address these problems, we introduce a large-scale dataset called ‘COIN’ for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated effectively with a series of step descriptions and the corresponding temporal boundaries. Furthermore, we propose a simple yet effective method to capture the dependencies among different steps, which can be easily plugged into conventional proposal-based action detection methods for localizing important steps in instructional videos. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under different evaluation criteria. We expect the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community.

Robust and Communication-Efficient Federated Learning from Non-IID Data

Federated Learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning however comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods however are only of limited utility in the Federated Learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions such as iid distribution of the client data, which typically can not be found in Federated Learning. In this work, we propose Sparse Ternary Compression (STC), a new compression framework that is specifically designed to meet the requirements of the Federated Learning environment. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms Federated Averaging in common Federated Learning scenarios where clients either a) hold non-iid data, b) use small batch sizes during training, or where c) the number of clients is large and the participation rate in every communication round is low. We furthermore show that even if the clients hold iid data and use medium sized batches for training, STC still behaves pareto-superior to Federated Averaging in the sense that it achieves fixed target accuracies on our benchmarks within both fewer training iterations and a smaller communication budget.

Detection of Advanced Malware by Machine Learning Techniques

In today’s digital world most of the anti-malware tools are signature based which is ineffective to detect advanced unknown malware viz. metamorphic malware. In this paper, we study the frequency of opcode occurrence to detect unknown malware by using machine learning technique. For the purpose, we have used kaggle Microsoft malware classification challenge dataset. The top 20 features obtained from fisher score, information gain, gain ratio, chi-square and symmetric uncertainty feature selection methods are compared. We also studied multiple classifier available in WEKA GUI based machine learning tool and found that five of them (Random Forest, LMT, NBT, J48 Graft and REPTree) detect malware with almost 100% accuracy.

Scheduling OLTP Transactions via Machine Learning

Current main memory database system architectures are still challenged by high contention workloads and this challenge will continue to grow as the number of cores in processors continues to increase. These systems schedule transactions randomly across cores to maximize concurrency and to produce a uniform load across cores. Scheduling never considers potential conflicts. Performance could be improved if scheduling balanced between concurrency to maximize throughput and scheduling transactions linearly to avoid conflicts. In this paper, we present the design of several intelligent transaction scheduling algorithms that consider both potential transaction conflicts and concurrency. To incorporate reasoning about transaction conflicts, we develop a supervised machine learning model that estimates the probability of conflict. This model is incorporated into several scheduling algorithms. In addition, we integrate an unsupervised machine learning algorithm into an intelligent scheduling algorithm. We then empirically measure the performance impact of different scheduling algorithms on OLTP and social networking workloads. Our results show that, with appropriate settings, intelligent scheduling can increase throughput by 54% and reduce abort rate by 80% on a 20-core machine, relative to random scheduling. In summary, the paper provides preliminary evidence that intelligent scheduling significantly improves DBMS performance.

When random search is not enough: Sample-Efficient and Noise-Robust Blackbox Optimization of RL Policies

Interest in derivative-free optimization (DFO) and ‘evolutionary strategies’ (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they match state of the art methods for policy optimization tasks. However, blackbox DFO methods suffer from high sampling complexity since they require a substantial number of policy rollouts for reliable updates. They can also be very sensitive to noise in the rewards, actuators or the dynamics of the environment. In this paper we propose to replace the standard ES derivative-free paradigm for RL based on simple reward-weighted averaged random perturbations for policy updates, that has recently become a subject of voluminous research, by an algorithm where gradients of blackbox RL functions are estimated via regularized regression methods. In particular, we propose to use L1/L2 regularized regression-based gradient estimation to exploit sparsity and smoothness, as well as LP decoding techniques for handling adversarial stochastic and deterministic noise. Our methods can be naturally aligned with sliding trust region techniques for efficient samples reuse to further reduce sampling complexity. This is not the case for standard ES methods requiring independent sampling in each epoch. We show that our algorithms can be applied in locomotion tasks, where training is conducted in the presence of substantial noise, e.g. for learning in sim transferable stable walking behaviors for quadruped robots or training quadrupeds how to follow a path. We further demonstrate our methods on several \mathrm{OpenAI} \mathrm{Gym} \mathrm{Mujoco} RL tasks. We manage to train effective policies even if up to 25\% of all measurements are arbitrarily corrupted, where standard ES methods produce sub-optimal policies or do not manage to learn at all. Our empirical results are backed by theoretical guarantees.

Fast Exact Dynamic Time Warping on Run-Length Encoded Time Series

Dynamic Time Warping (DTW) is a well-known similarity measure for time series. The standard dynamic programming approach to compute the dtw-distance of two length-n time series, however, requires O(n^2) time, which is often too slow in applications. Therefore, many heuristics have been proposed to speed up the dtw computation. These are often based on approximating or bounding the true dtw-distance or considering special inputs (e.g. binary or piecewise constant time series). In this paper, we present a fast and exact algorithm to compute the dtw-distance of two run-length encoded time series. This might be used for fast and accurate indexing and classification of time series in combination with preprocessing techniques such as piecewise aggregate approximation (PAA).

HEAT: Hyperbolic Embedding of Attributed Networks

Finding a low dimensional representation of hierarchical, structured data described by a network remains a challenging problem in the machine learning community. An emerging approach is embedding these networks into hyperbolic space because it can naturally represent a network’s hierarchical structure. However, existing hyperbolic embedding approaches cannot deal with attributed networks, in which nodes are annotated with additional attributes. These attributes might provide additional proximity information to constrain the representations of the nodes, which is important to learn high quality hyperbolic embeddings. To fill this gap, we introduce HEAT (Hyperbolic Embedding of ATributed networks), the first method for embedding attributed networks to a hyperbolic space. HEAT consists of 1) a modified random walk algorithm to obtain training samples that capture both topological and attribute similarity; and 2) a learning algorithm for learning hyperboloid embeddings from the obtained training samples. We show that by leveraging node attributes, HEAT can outperform a state-of-the-art Hyperbolic embedding algorithm on several downstream tasks. As a general embedding method, HEAT opens the door to hyperbolic manifold learning on a wide range of attributed and unattributed networks.

Analysis Dictionary Learning: An Efficient and Discriminative Solution

Discriminative Dictionary Learning (DL) methods have been widely advocated for image classification problems. To further sharpen their discriminative capabilities, most state-of-the-art DL methods have additional constraints included in the learning stages. These various constraints, however, lead to additional computational complexity. We hence propose an efficient Discriminative Convolutional Analysis Dictionary Learning (DCADL) method, as a lower cost Discriminative DL framework, to both characterize the image structures and refine the interclass structure representations. The proposed DCADL jointly learns a convolutional analysis dictionary and a universal classifier, while greatly reducing the time complexity in both training and testing phases, and achieving a competitive accuracy, thus demonstrating great performance in many experiments with standard databases.

Intelligent Knowledge Distribution: Constrained-Action POMDPs for Resource-Aware Multi-Agent Communication

This paper addresses a fundamental question of multi-agent knowledge distribution: what information should be sent to whom and when, with the limited resources available to each agent? Communication requirements for multi-agent systems can be rather high when an accurate picture of the environment and the state of other agents must be maintained. To reduce the impact of multi-agent coordination on networked systems, e.g., power and bandwidth, this paper introduces two concepts for partially observable Markov decision processes (POMDPs): 1) action-based constraints which yield constrained-action partially observable Markov decision processes (CA-POMDPs); and 2) soft probabilistic constraint satisfaction for the resulting infinite-horizon controllers. To enable constraint analysis over an infinite horizon, an unconstrained policy is first represented as a Finite State Controller (FSC) and optimized with policy iteration. The FSC representation then allows for a combination of Markov chain Monte Carlo and discrete optimization to improve the probabilistic constraint satisfaction of the controller while minimizing the impact to the value function. Within the CA-POMDP framework we then propose Intelligent Knowledge Distribution (IKD) which yields per-agent policies for distributing knowledge between agents subject to interaction constraints. Finally, the CA-POMDP and IKD concepts are validated using an asset tracking problem where multiple unmanned aerial vehicles (UAVs) with heterogeneous sensors collaborate to localize a ground asset to assist in avoiding unseen obstacles in a disaster area. The IKD model was able to maintain asset tracking through multi-agent communications while only violating soft power and bandwidth constraints 3% of the time, while greedy and naive approaches violated constraints more than 60% of the time.

Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples

Few-shot classification refers to learning a classifier for new classes given only a few examples. While a plethora of models have emerged to tackle this recently, we find the current procedure and datasets that are used to systematically assess progress in this setting lacking. To address this, we propose Meta-Dataset: a new benchmark for training and evaluating few-shot classifiers that is large-scale, consists of multiple datasets, and presents more natural and realistic tasks. The aim is to measure the ability of state-of-the-art models to leverage diverse sources of data to achieve higher generalization, and to evaluate that generalization ability in a more challenging setting. We additionally measure robustness of current methods to variations in the number of available examples and the number of classes. Finally our extensive empirical evaluation leads us to identify weaknesses in Prototypical Networks and MAML, two popular few-shot classification methods, and to propose a new method, Proto-MAML, which achieves improved performance on our benchmark.

R Packages worth a look

Knowledge-Based Guided Regularized Random Forest (KnowGRRF)
Random Forest (RF) and Regularized Random Forest can be used for feature selection. Moreover, by Guided Regularized Random Forest, statistical-based we …

Accessing Data Stored in ‘LaBB-CAT’ Instances (nzilbb.labbcat)
LaBB-CAT’ is a web-based language corpus management system developed by the New Zealand Institute of Language, Brain and Behaviour (NZILBB) – see << …

Wavelet Scalogram Tools for Time Series Analysis (wavScalogram)
Provides scalogram based wavelet tools for time series analysis: wavelet power spectrum, scalogram, windowed scalogram, windowed scalogram difference ( …

Causal Effect Identification from Multiple Incomplete Data Sources (dosearch)
Identification of causal effects from arbitrary observational and experimental probability distributions via do-calculus and standard probability manip …

Rasterize Graphical Output (rasterize)
Provides R functions to selectively rasterize components of ‘grid’ output.

Enterprise Streamlined ‘Shiny’ Application Framework (periscope)
An enterprise-targeted scalable and UI-standardized ‘shiny’ framework including a variety of developer convenience functions with the goal of both stre …

Document worth reading: “Core Decomposition in Multilayer Networks: Theory, Algorithms, and Applications”

Multilayer networks are a powerful paradigm to model complex systems, where various relations might occur among the same set of entities. Despite the keen interest in a variety of problems, algorithms, and analysis methods in this type of network, the problem of extracting dense subgraphs has remained largely unexplored. As a first step in this direction, we study the problem of core decomposition of a multilayer network. Unlike the single-layer counterpart in which cores are all nested into one another, in the multilayer context no total order exists among multilayer cores: they form a lattice whose size is exponential in the number of layers. In this setting we devise three algorithms which differ in the way they visit the core lattice and in their pruning techniques. We assess time and space efficiency of the three algorithms on a large variety of real-world multilayer networks. We then study the problem of extracting only the inner-most cores, i.e., the cores that are not dominated by any other core in terms of their index on all the layers. As inner-most cores are orders of magnitude less than all the cores, it is desirable to develop algorithms that effectively exploit the maximality property and extract inner-most cores directly, without first computing a complete decomposition. Moreover, we showcase an application of the multilayer core-decomposition tool to the problem of densest-subgraph extraction from multilayer networks. We introduce a definition of multilayer densest subgraph that trades-off between high density and number of layers in which the high density holds, and show how multilayer core decomposition can be exploited to approximate this problem with quality guarantees. We also exploit multilayer core decomposition to speed-up the extraction of frequent cross-graph quasi-cliques and to generalize the community-search problem to the multilayer setting. Core Decomposition in Multilayer Networks: Theory, Algorithms, and Applications

R Packages worth a look

Assessing Essential Unidimensionality Using External Validity Information (unival)
Assess essential unidimensionality using external validity information using the procedure proposed by Ferrando & Lorenzo-Seva (2019) <doi:10.11 …

Grouped Statistical Analyses in a Tidy Way (broomExtra)
Collection of functions to assist ‘broom’ and ‘broom.mixed’ package-related data analysis workflows. In particular, the generic functions tidy(), glanc …

Sample Size Calculations for Longitudinal Data (longpower)
Compute power and sample size for linear models of longitudinal data. Supported models include mixed-effects models and models fit by generalized least …

Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge (moc.gapbk)
Implements the Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) which was proposed by Parraga-Alava, J. et. al. …

Autoencoder-based Residual Deep Network with Keras Support (resautonet)
This package is the R implementation of the Autoencoder-based Residual Deep Network that is based on this paper (<a href="

Hypothesis Test for Dependent Clusterings of Two Data Views (multiviewtest)
Implements a hypothesis test of whether clusterings of two data views are independent from Gao, L.L., Bien, J., and Witten, D. (2019) Are Clusterings o …

Distilled News

What Is An Enterprise Knowledge Graph and Why Do I Want One?

Enterprise Knowledge Graphs have been on the rise. We see them as an incredibly valuable tool for relating your structured and unstructured information and discovering facts about your organization. Yet, knowledge graphs have been and still are far too underutilized. Organizations are still struggling to find and, more importantly, discover their valuable content. To take it a step further, knowledge graphs are a prerequisite for achieving smart, semantic artificial intelligence applications (AI) that can help you discover facts from your content, data, and organizational knowledge, which otherwise would go unnoticed. A smart semantic AI application, whether it is a chatbot, a cognitive search utilizing Natural Language Processing (NLP), or a recommendation engine, leverages your enterprise knowledge graph to extract, relate, and deliver answers, recommendations, and insights. With semantic technologies, several terms have been thrown around, such as ontology, triple store, semantic data model, graph database, and knowledge graph. And that is before we even get into the standards like SKOS, RDF, OWL, etc. While it is easy to get into the details, for the purposes of this blog, I will focus on a high level overview of the components that make up an enterprise knowledge graph.

The State of Machine Learning Adoption in the Enterprise

While the use of machine learning (ML) in production started near the turn of the century, it’s taken roughly 20 years for the practice to become mainstream throughout industry. With this report, you’ll learn how more than 11,000 data specialists responded to a recent O’Reilly survey about their organization’s approach – or intended approach – to machine learning. Data scientists, machine learning engineers, and deep learning engineers throughout the world answered detailed questions about their organization’s level of ML adoption. About half of the respondents work for enterprises in the early stages of exploring ML, while the rest have moderate or extensive experience deploying ML models to production.

When ‘Zoë’ !== ‘Zoë’. Or why you need to normalize Unicode strings

Never heard of Unicode normalization? You’re not alone. But it will save you a lot of trouble.

Iodide: an experimental tool for scientific communication and exploration on the web

Iodide lets you do data science entirely in your browser. Create, share, collaborate, and reproduce powerful reports and visualizations with tools you already know. In the last 10 years, there has been an explosion of interest in ‘scientific computing’ and ‘data science’: that is, the application of computation to answer questions and analyze data in the natural and social sciences. To address these needs, we’ve seen a renaissance in programming languages, tools, and techniques that help scientists and researchers explore and understand data and scientific concepts, and to communicate their findings. But to date, very few tools have focused on helping scientists gain unfiltered access to the full communication potential of modern web browsers. So today we’re excited to introduce Iodide, an experimental tool meant to help scientists write beautiful interactive documents using web technologies, all within an iterative workflow that will be familiar to many scientists.

Jupyter Lab: Evolution of the Jupyter Notebook

An overview of JupyterLab, the next generation of the Jupyter Notebook. Data says there are more than three million Jupyter Notebooks available publicly on Github. There is roughly a similar number of private ones too. Even without this data, we are quite aware of the popularity of the notebooks in the Data Science domain. The possibility of writing codes, inspecting the results, getting rich outputs are some of the features that really made Jupyter Notebooks very popular. But as it is said that all good things (must) come to an end, so will our favourite Notebook too. JupyterLab will eventually replace the classic Jupyter Notebook but for good.

Radical Change Is Coming To Data Science Jobs

Within 10 years, data science will be so enmeshed within industry-specific applications and broad productivity tools that we may no longer think of it is a hot career. Just as generations of math and statistics students have gone on to fill all manner of roles in business and academia without thinking of themselves as mathematicians or statisticians, the newly minted data scientist grads will be tomorrow’s manufacturing engineers, marketing leaders and medical researchers.

12 things I wish I’d known before starting as a Data Scientist

1. ‘Data science’ is a vague term, so treat it accordingly
2. Imposter syndrome is a normal part of the job
3. You’ll never have to know all the tools
4. However, learn your basic tools well
5. You’re an expert in a domain, not just methods
6. The most important skill is critical thinking
7. Take relevant classes – not just technical classes
8. Practice communication – written, visual, and verbal
9. Work on real data problems
10.Publish your work and get feedback however you can
11. Go to events – hackathons, conferences, meetups
12. Be flexible with how you enter the field

Let’s build an Article Recommender using LDA

Due to keen interest in learning new topics, I decided to work on a project where a Latent Dirichlet Allocation (LDA) model can recommend Wikipedia articles based on a search phrase. This article explains my approach towards building the project in Python. Check out the project on GitHub below.

Light on Math ML: Attention with Keras

In this article, first you will grok what a sequence to sequence model is, followed by why attention is important for sequential models? Next you will learn the nitty-gritties of the attention mechanism. This blog post will end by explaining how to use the attention layer.

Robotic Control with Graph Networks

Machine learning is helping to transform many fields across diverse industries, as anyone interested in technology undoubtedly knows. Things like computer vision and natural language processing were changed dramatically due to deep learning algorithms in the past few years, and the effects of that change are seeping in to our daily lives. One of the fields that artificial intelligence is expected to make drastic changes to, is the field of robotics. Decades ago, science fiction writers envisioned robots powered by artificial intelligence interacting with human society and either helping solve humanity’s problems or trying to destroy human-kind. Our reality is far from it, and we understand today that creating intelligent robots is a harder challenge than was expected back in those days. Robots must sense the world and understand their environment, they must reason about their goals and how to achieve them, and execute their plans using their actuation means.

Skip-Gram: NLP context words prediction algorithm

NLP is a field of Artificial Intelligence in which we try to process human language as text or speech to make computers similar to humans. Humans have a large amount of data written in a very unorganized format. So, it’s difficult for any machine to find meaning from raw text. To make a machine learn from the raw text we need to transform this data into a vector format which then can easily be processed by our computers. This transformation of raw text into a vector format is known as word representation.

PCA and SVD explained with numpy

How exactly are principal component analysis and singular value decomposition related and how to implement using numpy.

Hyper-parameter Tuning Techniques in Deep Learning

The process of setting the hyper-parameters requires expertise and extensive trial and error. There are no simple and easy ways to set hyper-parameters?-?specifically, learning rate, batch size, momentum, and weight decay.

How to create professional reports from R scripts, with custom styles.

If the practical tips for R Markdown post we talked briefly about how we can easily create professional reports directly from R scripts, without the need for converting them manually to Rmd and creating code chunks. In this one, we will provide useful tips on advanced options for styling, using themes and producing light-weight HTML reports directly from R scripts. We will also provide a repository with example R script and rendering code to get different styled and sized outputs easily.

Developing a DCGAN Model in Tensorflow 2.0

In early March 2019, TensorFlow 2.0 was released and we decided to create an image generator based on Taehoon Kim’s implementation of DCGAN. Here’s a tutorial on how to develop a DCGAN model in TensorFlow 2.0. ‘To avoid the fast convergence of D (discriminator) network, G (generator) network is updated twice for each D network update, which differs from original paper.’ • Taehoon Kim

Book Memo: “AIQ”

How artificial intelligence works and how we can harness its power for a better world
Two leading data scientists offer an up-close and user-friendly look at artificial intelligence: what it is, how it works, where it came from and how to harness its power for a better world. ‘There comes a time in the life of a subject when someone steps up and writes the book about it. AIQ explores the fascinating history of the ideas that drive this technology of the future and demystifies the core concepts behind it; the result is a positive and entertaining look at the great potential unlocked by marrying human creativity with powerful machines.’ Steven D. Levitt, co-author of Freakonomics Dozens of times per day, we all interact with intelligent machines that are constantly learning from the wealth of data now available to them. These machines, from smart phones to talking robots to self-driving cars, are remaking the world in the twenty first century in the same way that the Industrial Revolution remade the world in the nineteenth. AIQ is based on a simple premise: if you want to understand the modern world, then you have to know a little bit of the mathematical language spoken by intelligent machines. AIQ will teach you that language but in an unconventional way, anchored in stories rather than equations. You will meet a fascinating cast of historical characters who have a lot to teach you about data, probability and better thinking. Along the way, you’ll see how these same ideas are playing out in the modern age of big data and intelligent machines, and how these technologies will soon help you to overcome some of your built-in cognitive weaknesses, giving you a chance to lead a happier, healthier, more fulfilled life.

If you did not already know

Time Perception Machine google
Numerous powerful point process models have been developed to understand temporal patterns in sequential data from fields such as health-care, electronic commerce, social networks, and natural disaster forecasting. In this paper, we develop novel models for learning the temporal distribution of human activities in streaming data (e.g., videos and person trajectories). We propose an integrated framework of neural networks and temporal point processes for predicting when the next activity will happen. Because point processes are limited to taking event frames as input, we propose a simple yet effective mechanism to extract features at frames of interest while also preserving the rich information in the remaining frames. We evaluate our model on two challenging datasets. The results show that our model outperforms traditional statistical point process approaches significantly, demonstrating its effectiveness in capturing the underlying temporal dynamics as well as the correlation within sequential activities. Furthermore, we also extend our model to a joint estimation framework for predicting the timing, spatial location, and category of the activity simultaneously, to answer the when, where, and what of activity prediction. …

ARMA Point Process google
We introduce the ARMA (autoregressive-moving-average) point process, which is a Hawkes process driven by a Neyman-Scott process with Poisson immigration. It contains both the Hawkes and Neyman-Scott process as special cases and naturally combines self-exciting and shot-noise cluster mechanisms, useful in a variety of applications. The name ARMA is used because the ARMA point process is an appropriate analogue of the ARMA time series model for integer-valued series. As such, the ARMA point process framework accommodates a flexible family of models sharing methodological and mathematical similarities with ARMA time series. We derive an estimation procedure for ARMA point processes, as well as the integer ARMA models, based on an MCEM (Monte Carlo Expectation Maximization) algorithm. This powerful framework for estimation accommodates trends in immigration, multiple parametric specifications of excitement functions, as well as cases where marks and immigrants are not observed. …

Explanatory Graph google
This paper introduces a graphical model, namely an explanatory graph, which reveals the knowledge hierarchy hidden inside conv-layers of a pre-trained CNN. Each filter in a conv-layer of a CNN for object classification usually represents a mixture of object parts. We develop a simple yet effective method to disentangle object-part pattern components from each filter. We construct an explanatory graph to organize the mined part patterns, where a node represents a part pattern, and each edge encodes co-activation relationships and spatial relationships between patterns. More crucially, given a pre-trained CNN, the explanatory graph is learned without a need of annotating object parts. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. We transferred part patterns in the explanatory graph to the task of part localization, and our method significantly outperformed other approaches. …

Document worth reading: “Deep learning in bioinformatics: introduction, application, and perspective in big data era”

Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at \url{}. Deep learning in bioinformatics: introduction, application, and perspective in big data era