Magister Dixit

“The mechanical process by which data scientists and citizen data scientists make better use of data and analytics is underpinned by a deeper question about the organization as a whole: Does it have processes for sharing anything? This is not always a given in companies that have grown quickly, have grown through mergers and acquisitions or have begun to shrink. If the culture has never embraced or fostered the notion of transparency and sharing, then whatever process the company may put in place to use software to publish analytical models and the data they harvest is unlikely to succeed.” TIBCO ( 2017 )


R Packages worth a look

Bi-Level Selection of Conditional Main Effects (cmenet)
Provides functions for implementing cmenet – a bi-level variable selection method for conditional main effects (see Mak and Wu (2018) <doi:10.1080/0 …

Instrumental Variable Analysis in Case-Control Association Studies (iva)
Mendelian randomization (MR) analysis is a special case of instrumental variable analysis with genetic instruments. It is used to estimate the unconfou …

Segment Data With Maximum Likelihood (segmentr)
Given a likelihood provided by the user, this package applies it to a given matrix dataset in order to find change points in the data that maximize the …

Distilled News

Building Interactive Histograms with Bokeh

You are probably familiar with Matplotlib and Seaborn, two excellent (and highly related) Python plotting libraries. The purpose of this article is to get you started with Bokeh if you are not yet familiar with it. You will learn how to write a custom Python class to simplify plotting interactive histograms with Bokeh.

Seamlessly Integrated Deep Learning Environment with Terraform, Google cloud, Gitlab and Docker

When you are starting with some serious deep learning projects, you usually have the problem that you need a proper GPU. Buying reasonable workstations which are suitable for deep learning workloads can easily become very expensive. Luckily there are some options in the cloud. One that I tried out was using the wonderful Google Compute Engine. GPUs are available in the GCE as external accelerators of an instance. Currently, there are these GPUs available (prices for us-central1).

Implementing a ResNet model from scratch.

When implementing the ResNet architecture in a deep learning project I was working on, it was a huge leap from the basic, simple convolutional neural networks I was used to. One prominent feature of ResNet is that it utilizes a micro-architecture within it’s larger macroarchitecture: residual blocks! I decided to look into the model myself to gain a better understanding of it, as well as look into why it was so successful at ILSVRC. I implemented the exact same ResNet model class in Deep Learning for Computer Vision with Python by Dr. Adrian Rosebrock , which followed the ResNet model from the 2015 ResNet academic publication, Deep Residual Learning for Image Recognition by He et al. .

The Tentpoles of Data Science

When I ask myself the question ‘What is data science?’ I tend to think of the following five components. Data science is
• the application of design thinking to data problems;
• the creation and management of workflows for transforming and processing data;
• the negotiation of human relationships to identify context, allocate resources, and characterize audiences for data analysis products;
• the application of statistical methods to quantify evidence; and
• the transformation of data analytic information into coherent narratives and stories

AI Policy 101: An Introduction to the 10 Key Aspects of AI Policy

What in the world is AI policy? First, a definition: AI policy is defined as public policies that maximize the benefits of AI, while minimizing its potential costs and risks. From this perspective, the purpose of AI policy is two-fold. On the one hand, governments should invest in the development and adoption of AI to secure its many benefits for the economy and society. Governments can do this by investing in fundamental and applied research, the development of specialized AI and ‘AI + X’ talent, digital infrastructure and related technologies, and programs to help the private and public sectors adopt and apply new AI technologies. On the other hand, governments need to also respond to the economic and societal challenges brought on by advances in AI. Automation, algorithmic bias, data exploitation, and income inequality are just a few of the many challenges that governments around the world need to develop policy solutions for. These policies include investments into skills development, the creation of new regulations and standards, and targeted efforts to remove bias from AI algorithms and data sets.

A Common Data Science Mistake: Prediction/Recommendation by Manipulating Model Inputs

We trained a machine learning model with high performance. However, it did not work and was not useful in practice.’ I have heard this sentence several times, and each time I was eager to find out the reason. There could be different reasons that a model failed to work in practice. As these issues are not usually addressed in data science courses, in this article I address one of the common mistakes in designing and deploying a machine learning model. In the rest of this article, first, I will discuss the confusion between Correlation and Causation that leads to the misuse of machine learning models. I will illustrate the discussion with an example. After that, different possibilities between inputs and outputs of the model are shown. Finally, I provide some suggestions to avoid this mistake.

Gini Regressions and Heteroskedasticity

We propose an Aitken estimator for Gini regression. The suggested A-Gini estimator is proven to be a U-statistics. Monte Carlo simulations are provided to deal with heteroskedasticity and to make some comparisons between the generalized least squares and the Gini regression. A Gini-White test is proposed and shows that a better power is obtained compared with the usual White test when outlying observations contaminate the data

How to Monitor Machine Learning Models in Real-Time

We present practical methods for near real-time monitoring of machine learning systems which detect system-level or model-level faults and can see when the world changes.

Everything You Need to Know About Decision Trees

Tree-based methods can be used for regression or classification. They involve segmenting the prediction space into a number of simple regions. The set of splitting rules can be summarized in a tree, hence the name decision tree methods. A single decision tree is often not as performant as linear regression, logistic regression, LDA, etc. However, by introducing bagging, random forests, and boosting, it can result in dramatic improvements in prediction accuracy at the expense of some loss in interpretation. In this post, we introduce everything you need to know about decision trees, bagging, random forests, and boosting. It will be a long read, but it will be worth it!

Combining supervised learning and unsupervised learning to improve word vectors

To achieve state-of-the-art result in NLP tasks, researchers try tremendous way to let machine understand language and solving downstream tasks such as textual entailment, semantic classification. OpenAI released a new model which named as Generative Pre-Training (GPT). After reading this article, you will understand:
• Finetuned Transformer LM Design
• Architecture
• Experiments
• Implementation
• Take Away

Prediction task with Multivariate TimeSeries and VAR model.

Time Series data is can be confusing, but very interesting to explore. The reason this sort of data grabbed my attention is that it can be found in almost every business (sales, deliveries, weather conditions etc.). For instance: using Google BigQuery ho

Ethics Commission Automated and Connected Driving

Throughout the world, mobility is becoming increasingly shaped by the digital revolution. The ‘automation’ of private transport operating in the public road environment is taken to mean technological driving aids that relieve the pressure on drivers, assist or even replace them in part or in whole. The partial automation of driving is already standard equipment in new vehicles. Conditionally and highly automated systems which, without human intervention, can autonomously change lanes, brake and steer are available or about to go into mass production. In both Germany and the US, there are test tracks on which conditionally automated vehicles can operate. For local public transport, driverless robot taxis or buses are being developed and trialled. Today, processors are already available or are being developed that are able, by means of appropriate sensors, to detect in real time the traffic situation in the immediate surroundings of a car, determine the car’s own position on appropriate mapping material and dynamically plan and modify the car’s route and adapt it to the traffic conditions. As the ‘perception’ of the vehicle’s surroundings becomes increasingly perfected, there is likely to be an ever better differentiation of road users, obstacles and hazardous situations. This makes it likely that it will be possible to significantly enhance road safety. Indeed, it cannot be ruled out that, at the end of this development, there will be motor vehicles that are inherently safe, in other words will never be involved in an accident under any circumstances. Nevertheless, at the level of what is technologically possible today, and given the realities of heterogeneous and nonconnected road traffic, it will not be possible to prevent accidents completely. This makes it essential that decisions be taken when programming the software of conditionally and highly automated driving systems. The technological developments are forcing government and society to reflect on the emerging changes. The decision that has to be taken is whether the licensing of automated driving systems is ethically justifiable or possibly even imperative. If these systems are licensed – and it is already apparent that this is happening at international level – everything hinges on the conditions in which they are used and the way in which they are designed. At the fundamental level, it all comes down to the following questions. How much dependence on technologically complex systems – which in the future will be based on artificial intelligence, possibly with machine learning capabilities – are we willing to accept in order to achieve, in return, more safety, mobility and convenience? What precautions need to be taken to ensure controllability, transparency and data autonomy? What technological development guidelines are required to ensure that we do not blur the contours of a human society that places individuals, their freedom of development, their physical and intellectual integrity and their entitlement to social respect at the heart of its legal regime?

An Evaluation of Early Warning Models for Systemic Banking Crises: Does Machine Learning Improve Predictions?

This paper compares the out-of-sample predictive performance of different early warning models for systemic banking crises using a sample of advanced economies covering the past 45 years. We compare a benchmark logit approach to several machine learning approaches recently proposed in the literature. We find that while machine learning methods often attain a very high in-sample fit, they are outperformed by the logit approach in recursive out-of-sample evaluations. This result is robust to the choice of performance measure, crisis definition, preference parameter, and sample length, as well as to using different sets of variables and data transformations. Thus, our paper suggests that further enhancements to machine learning early warning models are needed before they are able to offer a substantial value-added for predicting systemic banking crises. Conventional logit models appear to use the available information already fairly effciently, and would for instance have been able to predict the 2007/2008 financial crisis out-of-sample for many countries. In line with economic intuition, these models identify credit expansions, asset price booms and external imbalances as key predictors of systemic banking crises.

Book Memo: “Advanced Time Series Data Analysis”

Forecasting Using EViews
Introduces the latest developments in forecasting in advanced quantitative data analysis This book presents advanced univariate multiple regressions, which can directly be used to forecast their dependent variables, evaluate their in-sample forecast values, and compute forecast values beyond the sample period. Various alternative multiple regressions models are presented based on a single time series, bivariate, and triple time-series, which are developed by taking into account specific growth patterns of each dependent variables, starting with the simplest model up to the most advanced model. Graphs of the observed scores and the forecast evaluation of each of the models are offered to show the worst and the best forecast models among each set of the models of a specific independent variable. Advanced Time Series Data Analysis: Forecasting Using EViews provides readers with a number of modern, advanced forecast models not featured in any other book. They include various interaction models, models with alternative trends (including the models with heterogeneous trends), and complete heterogeneous models for monthly time series, quarterly time series, and annually time series. Each of the models can be applied by all quantitative researchers.

R Packages worth a look

An R-Shiny Application for Creating Visual Abstracts (abstractr)
An R-Shiny application to create visual abstracts for original research. A variety of user defined options and formatting are included.

Bayesian Reliability Estimation (Bayesrel)
So far, it provides the most common single test reliability estimates, being: Coefficient Alpha, Guttman’s lambda-2/-4/-6, greatest lower bound and Mcd …

Inferring Causal Effects on Collective Outcomes under Interference (netchain)
In networks, treatments may spill over from the treated individual to his or her social contacts and outcomes may be contagious over time. Under this s …

Whats new on arXiv

A Multilevel Approach for the Performance Analysis of Parallel Algorithms

We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem decomposition. Decomposition level determines the granularity of the algorithm. A set of block matrices (decomposition and execution) highlights fundamental characteristics of the algorithm, such as inherent parallelism and sources of overheads.

Exploring Communities in Large Profiled Graphs

Given a graph G and a vertex q\in G, the community search (CS) problem aims to efficiently find a subgraph of G whose vertices are closely related to q. Communities are prevalent in social and biological networks, and can be used in product advertisement and social event recommendation. In this paper, we study profiled community search (PCS), where CS is performed on a profiled graph. This is a graph in which each vertex has labels arranged in a hierarchical manner. Extensive experiments show that PCS can identify communities with themes that are common to their vertices, and is more effective than existing CS approaches. As a naive solution for PCS is highly expensive, we have also developed a tree index, which facilitate efficient and online solutions for PCS.

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

This work addresses the problem of segmentation in time series data with respect to a statistical parameter of interest in Bayesian models. It is common to assume that the parameters are distinct within each segment. As such, many Bayesian change point detection models do not exploit the segment parameter patterns, which can improve performance. This work proposes a Bayesian change point detection algorithm that makes use of repetition in segment parameters, by introducing segment class labels that utilise a Dirichlet process prior.

Multi-Agent Pathfinding (MAPF) with Continuous Time

MAPF is the problem of finding paths for multiple agents such that every agent reaches its goal and the agents do not collide. Most prior work on MAPF were on grid, assumed all actions cost the same, agents do not have a volume, and considered discrete time steps. In this work we propose a MAPF algorithm that do not assume any of these assumptions, is complete, and provides provably optimal solutions. This algorithm is based on a novel combination of SIPP, a continuous time single agent planning algorithms, and CBS, a state of the art multi-agent pathfinding algorithm. We analyze this algorithm, discuss its pros and cons, and evaluate it experimentally on several standard benchmarks.

The information-theoretic value of unlabeled data in semi-supervised learning

We quantify the separation between the numbers of labeled examples required to learn in two settings: Settings with and without the knowledge of the distribution of the unlabeled data. More specifically, we prove a separation by \Theta(\log n) multiplicative factor for the class of projections over the Boolean hypercube of dimension n. We prove that there is no separation for the class of all functions on domain of any size. Learning with the knowledge of the distribution (a.k.a. fixed-distribution learning) can be viewed as an idealized scenario of semi-supervised learning where the number of unlabeled data points is so great that the unlabeled distribution is known exactly. For this reason, we call the separation the value of unlabeled data.

Trends in Demand, Growth, and Breadth in Scientific Computing Training Delivered by a High-Performance Computing Center

We analyze the changes in the training and educational efforts of the SciNet HPC Consortium, a Canadian academic High Performance Computing center, in the areas of Scientific Computing and High-Performance Computing, over the last six years. Initially, SciNet offered isolated training events on how to use HPC systems and write parallel code, but the training program now consists of a broad range of workshops and courses that users can take toward certificates in scientific computing, data science, or high-performance computing. Using data on enrollment, attendence, and certificate numbers from SciNet’s education website, used by almost 1800 users so far, we extract trends on the growth, demand, and breadth of SciNet’s training program. Among the results are a steady overall growth, a sharp and steady increase in the demand for data science training, and a wider participation of ‘non-traditional’ computing disciplines, which has motivated an increasingly broad spectrum of training offerings. Of interest is also that many of the training initiatives have evolved into courses that can be taken as part of the graduate curriculum at the University of Toronto.

Artificial Neural Networks

These are lecture notes for my course on Artificial Neural Networks that I have given at Chalmers (FFR135) and Gothenburg University (FIM720). This course describes the use of neural networks in machine learning: deep learning, recurrent networks, and other supervised and unsupervised machine-learning algorithms.

Ensemble Kalman Methods With Constraints

Ensemble Kalman methods constitute an increasingly important tool in both state and parameter estimation problems. Their popularity stems from the derivative-free nature of the methodology which may be readily applied when computer code is available for the underlying state-space dynamics (for state estimation) or for the parameter-to-observable map (for parameter estimation). There are many applications in which it is desirable to enforce prior information in the form of equality or inequality constraints on the state or parameter. This paper establishes a general framework for doing so, describing a widely applicable methodology, a theory which justifies the methodology, and a set of numerical experiments exemplifying it.

Evolving embodied intelligence from materials to machines

Natural lifeforms specialise to their environmental niches across many levels; from low-level features such as DNA and proteins, through to higher-level artefacts including eyes, limbs, and overarching body plans. We propose Multi-Level Evolution (MLE), a bottom-up automatic process that designs robots across multiple levels and niches them to tasks and environmental conditions. MLE concurrently explores constituent molecular and material ‘building blocks’, as well as their possible assemblies into specialised morphological and sensorimotor configurations. MLE provides a route to fully harness a recent explosion in available candidate materials and ongoing advances in rapid manufacturing processes. We outline a feasible MLE architecture that realises this vision, highlight the main roadblocks and how they may be overcome, and show robotic applications to which MLE is particularly suited. By forming a research agenda to stimulate discussion between researchers in related fields, we hope to inspire the pursuit of multi-level robotic design all the way from material to machine.

Efficient Matrix Profile Computation Using Different Distance Functions

Matrix profile has been recently proposed as a promising technique to the problem of all-pairs-similarity search on time series. Efficient algorithms have been proposed for computing it, e.g., STAMP, STOMP and SCRIMP++. All these algorithms use the z-normalized Euclidean distance to measure the distance between subsequences. However, as we observed, for some datasets other Euclidean measurements are more useful for knowledge discovery from time series. In this paper, we propose efficient algorithms for computing matrix profile for a general class of Euclidean distances. We first propose a simple but efficient algorithm called AAMP for computing matrix profile with the ‘pure’ (non-normalized) Euclidean distance. Then, we extend our algorithm for the p-norm distance. We also propose an algorithm, called ACAMP, that uses the same principle as AAMP, but for the case of z-normalized Euclidean distance. We implemented our algorithms, and evaluated their performance through experimentation. The experiments show excellent performance results. For example, they show that AAMP is very efficient for computing matrix profile for non-normalized Euclidean distances. The results also show that the ACAMP algorithm is significantly faster than SCRIMP++ (the state of the art matrix profile algorithm) for the case of z-normalized Euclidean distance.

AI Coding: Learning to Construct Error Correction Codes

In this paper, we investigate an artificial-intelligence (AI) driven approach to design error correction codes (ECC). Classic error correction code was designed upon coding theory that typically defines code properties (e.g., hamming distance, subchannel reliability, etc.) to reflect code performance. Its code design is to optimize code properties. However, an AI-driven approach doesn’t necessarily rely on coding theory any longer. Specifically, we propose a constructor-evaluator framework, in which the code constructor is realized by AI algorithms and the code evaluator provides code performance metric measurements. The code constructor keeps improving the code construction to maximize code performance that is evaluated by the code evaluator. As examples, we construct linear block codes and polar codes with reinforcement learning (RL) and evolutionary algorithms. The results show that comparable code performance can be achieved with respect to the existing codes. It is noteworthy that our method can provide superior performances where existing classic constructions fail to achieve optimum for a specific decoder (e.g., list decoding for polar codes).

SAFE: Scale Aware Feature Encoder for Scene Text Recognition

In this paper, we address the problem of having characters with different scales in scene text recognition. We propose a novel scale aware feature encoder (SAFE) that is designed specifically for encoding characters with different scales. SAFE is composed of a multi-scale convolutional encoder and a scale attention network. The multi-scale convolutional encoder targets at extracting character features under multiple scales, and the scale attention network is responsible for selecting features from the most relevant scale(s). SAFE has two main advantages over the traditional single-CNN encoder used in current state-of-the-art text recognizers. First, it explicitly tackles the scale problem by extracting scale-invariant features from the characters. This allows the recognizer to put more effort in handling other challenges in scene text recognition, like those caused by view distortion and poor image quality. Second, it can transfer the learning of feature encoding across different character scales. This is particularly important when the training set has a very unbalanced distribution of character scales, as training with such a dataset will make the encoder biased towards extracting features from the predominant scale. To evaluate the effectiveness of SAFE, we design a simple text recognizer named scale-spatial attention network (S-SAN) that employs SAFE as its feature encoder, and carry out experiments on six public benchmarks. Experimental results demonstrate that S-SAN can achieve state-of-the-art (or, in some cases, extremely competitive) performance without any post-processing.

Conditional Optimal Stopping: A Time-Inconsistent Optimization

Inspired by recent work of P.-L. Lions on conditional optimal control, we introduce a problem of optimal stopping under bounded rationality: the objective is the expected payoff at the time of stopping, conditioned on another event. For instance, an agent may care only about states where she is still alive at the time of stopping, or a company may condition on not being bankrupt. We observe that conditional optimization is time-inconsistent due to the dynamic change of the conditioning probability and develop an equilibrium approach in the spirit of R. H. Strotz’ work for sophisticated agents in discrete time. Equilibria are found to be essentially unique in the case of a finite time horizon whereas an infinite horizon gives rise to non-uniqueness and other interesting phenomena. We also introduce a theory which generalizes the classical Snell envelope approach for optimal stopping by considering a pair of processes with Snell-type properties.

AuxNet: Auxiliary tasks enhanced Semantic Segmentation for Automated Driving

Decision making in automated driving is highly specific to the environment and thus semantic segmentation plays a key role in recognizing the objects in the environment around the car. Pixel level classification once considered a challenging task which is now becoming mature to be productized in a car. However, semantic annotation is time consuming and quite expensive. Synthetic datasets with domain adaptation techniques have been used to alleviate the lack of large annotated datasets. In this work, we explore an alternate approach of leveraging the annotations of other tasks to improve semantic segmentation. Recently, multi-task learning became a popular paradigm in automated driving which demonstrates joint learning of multiple tasks improves overall performance of each tasks. Motivated by this, we use auxiliary tasks like depth estimation to improve the performance of semantic segmentation task. We propose adaptive task loss weighting techniques to address scale issues in multi-task loss functions which become more crucial in auxiliary tasks. We experimented on automotive datasets including SYNTHIA and KITTI and obtained 3% and 5% improvement in accuracy respectively.

EAT-NAS: Elastic Architecture Transfer for Accelerating Large-scale Neural Architecture Search

Neural architecture search (NAS) methods have been proposed to release human experts from tedious architecture engineering. However, most current methods are constrained in small-scale search due to the issue of computational resources. Meanwhile, directly applying architectures searched on small datasets to large-scale tasks often bears no performance guarantee. This limitation impedes the wide use of NAS on large-scale tasks. To overcome this obstacle, we propose an elastic architecture transfer mechanism for accelerating large-scale neural architecture search (EAT-NAS). In our implementations, architectures are first searched on a small dataset (the width and depth of architectures are taken into consideration as well), e.g., CIFAR-10, and the best is chosen as the basic architecture. Then the whole architecture is transferred with elasticity. We accelerate the search process on a large-scale dataset, e.g., the whole ImageNet dataset, with the help of the basic architecture. What we propose is not only a NAS method but a mechanism for architecture-level transfer. In our experiments, we obtain two final models EATNet-A and EATNet-B that achieve competitive accuracies, 73.8% and 73.7% on ImageNet, respectively, which also surpass the models searched from scratch on ImageNet under the same settings. For computational cost, EAT-NAS takes only less than 5 days on 8 TITAN X GPUs, which is significantly less than the computational consumption of the state-of-the-art large-scale NAS methods.

A Learning Framework for An Accurate Prediction of Rainfall Rates

The present work is aimed to examine the potential of advanced machine learning strategies to predict the monthly rainfall (precipitation) for the Indus Basin, using climatological variables such as air temperature, geo-potential height, relative humidity and elevation. In this work, the focus is on thirteen geographical locations, called index points, within the basin. Arguably, not all of the hydrological components are relevant to the precipitation rate, and therefore, need to be filtered out, leading to a lower-dimensional feature space. Towards this goal, we adopted the gradient boosting method to extract the most contributive features for precipitation rate prediction. Five state-of-the-art machine learning methods have then been trained where pearson correlation coefficient and mean absolute error have been reported as the prediction performance criteria. The Random Forest regression model outperformed the other regression models achieving the maximum pearson correlation coefficient and minimum mean absolute error for most of the index points. Our results suggest the relative humidity (for pressure levels of 300 mb and 150 mb, respectively), the u-direction wind (for pressure level of 700 mb), air temperature (for pressure levels of 150 mb and 10 mb, respectively) as the top five influencing features for accurate forecasting the precipitation rate.

Applying SVGD to Bayesian Neural Networks for Cyclical Time-Series Prediction and Inference

A regression-based BNN model is proposed to predict spatiotemporal quantities like hourly rider demand with calibrated uncertainties. The main contributions of this paper are (i) A feed-forward deterministic neural network (DetNN) architecture that predicts cyclical time series data with sensitivity to anomalous forecasting events; (ii) A Bayesian framework applying SVGD to train large neural networks for such tasks, capable of producing time series predictions as well as measures of uncertainty surrounding the predictions. Experiments show that the proposed BNN reduces average estimation error by 10% across 8 U.S. cities compared to a fine-tuned multilayer perceptron (MLP), and 4% better than the same network architecture trained without SVGD.

Diverse mini-batch Active Learning

We study the problem of reducing the amount of labeled training data required to train supervised classification models. We approach it by leveraging Active Learning, through sequential selection of examples which benefit the model most. Selecting examples one by one is not practical for the amount of training examples required by the modern Deep Learning models. We consider the mini-batch Active Learning setting, where several examples are selected at once. We present an approach which takes into account both informativeness of the examples for the model, as well as the diversity of the examples in a mini-batch. By using the well studied K-means clustering algorithm, this approach scales better than the previously proposed approaches, and achieves comparable or better performance.

If you did not already know

TechKG google
Knowledge graph is a kind of valuable knowledge base which would benefit lots of AI-related applications. Up to now, lots of large-scale knowledge graphs have been built. However, most of them are non-Chinese and designed for general purpose. In this work, we introduce TechKG, a large scale Chinese knowledge graph that is technology-oriented. It is built automatically from massive technical papers that are published in Chinese academic journals of different research domains. Some carefully designed heuristic rules are used to extract high quality entities and relations. Totally, it comprises of over 260 million triplets that are built upon more than 52 million entities which come from 38 research domains. Our preliminary ex-periments indicate that TechKG has high adaptability and can be used as a dataset for many diverse AI-related applications. We released TechKG at:

Vertex-Diminished Random Walk (VDRW) google
Imbalanced data widely exists in many high-impact applications. An example is in air traffic control, where we aim to identify the leading indicators for each type of accident cause from historical records. Among all three types of accident causes, historical records with ‘personnel issues’ are much more than the other two types (‘aircraft issues’ and ‘environmental issues’) combined. Thus, the resulting dataset is highly imbalanced, and can be naturally modeled as a network. Up until now, most existing work on imbalanced data analysis focused on the classification setting, and very little is devoted to learning the node representation from imbalanced networks. To address this problem, in this paper, we propose Vertex-Diminished Random Walk (VDRW) for imbalanced network analysis. The key idea is to encourage the random particle to walk within the same class by adjusting the transition probabilities each step. It resembles the existing Vertex Reinforced Random Walk in terms of the dynamic nature of the transition probabilities, as well as some convergence properties. However, it is more suitable for analyzing imbalanced networks as it leads to more separable node representations in the embedding space. Then, based on VDRW, we propose a semi-supervised network representation learning framework named ImVerde for imbalanced networks, in which context sampling uses VDRW and the label information to create node-context pairs, and balanced-batch sampling adopts a simple under-sampling method to balance these pairs in different classes. Experimental results demonstrate that ImVerde based on VDRW outperforms state-of-the-art algorithms for learning network representation from imbalanced data. …

Caffe google
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.

Document worth reading: “A Mathematical Theory of Interpersonal Interactions and Group Behavior”

Emergent collective group processes and capabilities have been studied through analysis of transactive memory, measures of group task performance, and group intelligence, among others. In their approach to collective behaviors, these approaches transcend traditional studies of group decision making that focus on how individual preferences combine through power relationships, social choice by voting, negotiation and game theory. Understanding more generally how individuals contribute to group effectiveness is important to a broad set of social challenges. Here we formalize a dynamic theory of interpersonal communications that classifies individual acts, sequences of actions, group behavioral patterns, and individuals engaged in group decision making. Group decision making occurs through a sequence of communications that convey personal attitudes and preferences among members of the group. The resulting formalism is relevant to psychosocial behavior analysis, rules of order, organizational structures and personality types, as well as formalized systems such as social choice theory. More centrally, it provides a framework for quantifying and even anticipating the structure of informal dialog, allowing specific conversations to be coded and analyzed in relation to a quantitative model of the participating individuals and the parameters that govern their interactions. A Mathematical Theory of Interpersonal Interactions and Group Behavior

R Packages worth a look

Fourier Transform Textural Ordination (foto)
The Fourier Transform Textural Ordination method uses a principal component analysis on radially averaged two dimensional Fourier spectra to characteri …

Calculation of the Integrated Flow of Particles Between Polygons (RCALI)
Calculate the flow of particles between polygons by two integration methods: integration by a cubature method and integration on a grid of points. Anni …

Comparison of Phylogenetic Trees Using Quartet and Bipartition Measures (Quartet)
Calculates the number of four-taxon subtrees consistent with a pair of cladograms, calculating the symmetric quartet distance of Bandelt & Dress (1 …