A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks

Information Retrieval (IR) plays a pivotal role in diverse Software Engineering (SE) tasks, e.g., bug localization and triaging, code retrieval, requirements analysis, etc. The choice of similarity measure is the core component of an IR technique. The performance of any IR method critically depends on selecting an appropriate similarity measure for the given application domain. Since different SE tasks operate on different document types like bug reports, software descriptions, source code, etc. that often contain non-standard domain-specific vocabulary, it is essential to understand which similarity measures work best for different SE documents. This paper presents two case studies on the effect of different similarity measure on various SE documents w.r.t. two tasks: (i) project recommendation: finding similar GitHub projects and (ii) bug localization: retrieving buggy source file(s) correspond to a bug report. These tasks contain a diverse combination of textual (i.e. description, readme) and code (i.e. source code, API, import package) artifacts. We observe that the performance of IR models varies when applied to different artifact types. We find that, in general, the context-aware models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better on code artifacts. On the other hand, the probabilistic ranking model BM25 performs better on a mixture of text and code artifacts. We further investigate how such an informed choice of similarity measure impacts the performance of SE tools. In particular, we analyze two previously proposed tools for project recommendation and bug localization tasks, which leverage diverse software artifacts, and observe that an informed choice of similarity measure indeed leads to improved performance of the existing SE tools.

(Sequential) Importance Sampling Bandits

The multi-armed bandit (MAB) problem is a sequential allocation task where the goal is to learn a policy that maximizes long term payoff, where only the reward of the executed action is observed; i.e., sequential optimal decisions are made, while simultaneously learning how the world operates. In the stochastic setting, the reward for each action is generated from an unknown distribution. To decide the next optimal action to take, one must compute sufficient statistics of this unknown reward distribution, e.g. upper-confidence bounds (UCB), or expectations in Thompson sampling. Closed-form expressions for these statistics of interest are analytically intractable except for simple cases. We here propose to leverage Monte Carlo estimation and, in particular, the flexibility of (sequential) importance sampling (IS) to allow for accurate estimation of the statistics of interest within the MAB problem. IS methods estimate posterior densities or expectations in probabilistic models that are analytically intractable. We first show how IS can be combined with state-of-the-art MAB algorithms (Thompson sampling and Bayes-UCB) for classic (Bernoulli and contextual linear-Gaussian) bandit problems. Furthermore, we leverage the power of sequential IS to extend the applicability of these algorithms beyond the classic settings, and tackle additional useful cases. Specifically, we study the dynamic linear-Gaussian bandit, and both the static and dynamic logistic cases too. The flexibility of (sequential) importance sampling is shown to be fundamental for obtaining efficient estimates of the key sufficient statistics in these challenging scenarios.

Feature Dimensionality Reduction for Video Affect Classification: A Comparative Study

Affective computing has become a very important research area in human-machine interaction. However, affects are subjective, subtle, and uncertain. So, it is very difficult to obtain a large number of labeled training samples, compared with the number of possible features we could extract. Thus, dimensionality reduction is critical in affective computing. This paper presents our preliminary study on dimensionality reduction for affect classification. Five popular dimensionality reduction approaches are introduced and compared. Experiments on the DEAP dataset showed that no approach can universally outperform others, and performing classification using the raw features directly may not always be a bad choice.

Sentimental Content Analysis and Knowledge Extraction from News Articles

In web era, since technology has revolutionized mankind life, plenty of data and information are published on the Internet each day. For instance, news agencies publish news on their websites all over the world. These raw data could be an important resource for knowledge extraction. These shared data contain emotions (i.e., positive, neutral or negative) toward various topics; therefore, sentimental content extraction could be a beneficial task in many aspects. Extracting the sentiment of news illustrates highly valuable information about the events over a period of time, the viewpoint of a media or news agency to these events. In this paper an attempt is made to propose an approach for news analysis and extracting useful knowledge from them. Firstly, we attempt to extract a noise robust sentiment of news documents; therefore, the news associated to six countries: United State, United Kingdom, Germany, Canada, France and Australia in 5 different news categories: Politics, Sports, Business, Entertainment and Technology are downloaded. In this paper we compare the condition of different countries in each 5 news topics based on the extracted sentiments and emotional contents in news documents. Moreover, we propose an approach to reduce the bulky news data to extract the hottest topics and news titles as a knowledge. Eventually, we generate a word model to map each word to a fixed-size vector by Word2Vec in order to understand the relations between words in our collected news database.

On feature selection and evaluation of transportation mode prediction strategies

Transportation modes prediction is a fundamental task for decision making in smart cities and traffic management systems. Traffic policies designed based on trajectory mining can save money and time for authorities and the public. It may reduce the fuel consumption and commute time and moreover, may provide more pleasant moments for residents and tourists. Since the number of features that may be used to predict a user transportation mode can be substantial, finding a subset of features that maximizes a performance measure is worth investigating. In this work, we explore wrapper and information retrieval methods to find the best subset of trajectory features. After finding the best classifier and the best feature subset, our results were compared with two related papers that applied deep learning methods and the results showed that our framework achieved better performance. Furthermore, two types of cross-validation approaches were investigated, and the performance results show that the random cross-validation method provides optimistic results.

Change Point Estimation in Panel Data with Time-Varying Individual Effects

This paper proposes a method for estimating multiple change points in panel data models with unobserved individual effects via ordinary least-squares (OLS). Typically, in this setting, the OLS slope estimators are inconsistent due to the unobserved individual effects bias. As a consequence, existing methods remove the individual effects before change point estimation through data transformations such as first-differencing. We prove that under reasonable assumptions, the unobserved individual effects bias has no impact on the consistent estimation of change points. Our simulations show that since our method does not remove any variation in the dataset before change point estimation, it performs better in small samples compared to first-differencing methods. We focus on short panels because they are commonly used in practice, and allow for the unobserved individual effects to vary over time. Our method is illustrated via two applications: the environmental Kuznets curve and the U.S. house price expectations after the financial crisis.

Training De-Confusion: An Interactive, Network-Supported Visual Analysis System for Resolving Errors in Image Classification Training Data

Convolutional neural networks gain more and more popularity in image classification tasks since they are often even able to outperform human classifiers. While much research has been targeted towards network architecture optimization, the optimization of the labeled training data has not been explicitly targeted yet. Since labeling of training data is time-consuming, it is often performed by less experienced domain experts or even outsourced to online services. Unfortunately, this results in labeling errors, which directly impact the classification performance of the trained network. To overcome this problem, we propose an interactive visual analysis system that helps to spot and correct errors in the training dataset. For this purpose, we have identified instance interpretation errors, class interpretation errors and similarity errors as frequently occurring errors, which shall be resolved to improve classification performance. After we detect these errors, users are guided towards them through a two-step visual analysis process, in which they can directly reassign labels to resolve the detected errors. Thus, with the proposed visual analysis system, the user has to inspect far fewer items to resolve labeling errors in the training dataset, and thus arrives at satisfying training results more quickly.

A Survey on Sentiment and Emotion Analysis for Computational Literary Studies

Emotions have often been a crucial part of compelling narratives: literature tells about people with goals, desires, passions, and intentions. In the past, classical literary studies usually scrutinized the affective dimension of literature within the framework of hermeneutics. However, with emergence of the research field known as Digital Humanities (DH) some studies of emotions in literary context have taken a computational turn. Given the fact that DH is still being formed as a science, this direction of research can be rendered relatively new. At the same time, the research in sentiment analysis started in computational linguistic almost two decades ago and is nowadays an established field that has dedicated workshops and tracks in the main computational linguistics conferences. This leads us to the question of what are the commonalities and discrepancies between sentiment analysis research in computational linguistics and digital humanities? In this survey, we offer an overview of the existing body of research on sentiment and emotion analysis as applied to literature. We precede the main part of the survey with a short introduction to natural language processing and machine learning, psychological models of emotions, and provide an overview of existing approaches to sentiment and emotion analysis in computational linguistics. The papers presented in this survey are either coming directly from DH or computational linguistics venues and are limited to sentiment and emotion analysis as applied to literary text.

OBOE: Collaborative Filtering for AutoML Initialization

Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. The number of machine learning applications is growing much faster than the number of machine learning experts, hence we see an increasing demand for efficient automation of learning processes. Here, we introduce OBOE, an algorithm for time-constrained model selection and hyperparameter tuning. Taking advantage of similarity between datasets, OBOE finds promising algorithm and hyperparameter configurations through collaborative filtering. Our system explores these models under time constraints, so that rapid initializations can be provided to warm-start more fine-grained optimization methods. One novel aspect of our approach is a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that OBOE delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems.

Counterfactual Normalization: Proactively Addressing Dataset Shift and Improving Reliability Using Causal Mechanisms

Predictive models can fail to generalize from training to deployment environments because of dataset shift, posing a threat to model reliability and the safety of downstream decisions made in practice. Instead of using samples from the target distribution to reactively correct dataset shift, we use graphical knowledge of the causal mechanisms relating variables in a prediction problem to proactively remove relationships that do not generalize across environments, even when these relationships may depend on unobserved variables (violations of the ‘no unobserved confounders’ assumption). To accomplish this, we identify variables with unstable paths of statistical influence and remove them from the model. We also augment the causal graph with latent counterfactual variables that isolate unstable paths of statistical influence, allowing us to retain stable paths that would otherwise be removed. Our experiments demonstrate that models that remove vulnerable variables and use estimates of the latent variables transfer better, often outperforming in the target domain despite some accuracy loss in the training domain.

A Survey on the Theory of Bonds
Application of Bounded Total Variation Denoising in Urban Traffic Analysis
Mobility helps problem-solving systems to avoid Groupthink
A New Optimization Layer for Real-Time Bidding Advertising Campaigns
ARQ with Cumulative Feedback to Compensate for Burst Errors
Efficient Methods in Counting Generalized Necklaces
Suitable sets of permutations, packings of triples, and Ramsey’s theorem
Packing colouring of some classes of cubic graphs
Bounds for the diameter of the weight polytope
Strong Subgraph Connectivity of Digraphs: A Survey
Efficient Continuous Top-$k$ Geo-Image Search on Road Network
A Hybrid Dynamic-regenerative Damping Scheme for Energy Regeneration in Variable Impedance Actuators
Quantum generative adversarial learning in a superconducting quantum circuit
Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems
On Rich Clubs of Path-Based Centralities in Networks
Random Walk Laplacian and Network Centrality Measures
OCT segmentation: Integrating open parametric contour model of the retinal layers and shape constraint to the Mumford-Shah functional
Characterizing Co-located Datacenter Workloads: An Alibaba Case Study
Towards Massive Machine Type Communications in Ultra-Dense Cellular IoT Networks: Current Issues and Machine Learning-Assisted Solutions
Nonparametric Gaussian mixture models for the multi-armed contextual bandit
Towards Learning Fine-Grained Disentangled Representations from Speech
On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization
Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking
Low-complexity 8-point DCT Approximation Based on Angle Similarity for Image and Video Coding
Some Statistical Problems with High Dimensional Financial data
Dynamic Laplace: Efficient Centrality Measure for Weighted or Unweighted Evolving Networks
Exploiting Effective Representations for Chinese Sentiment Analysis Using a Multi-Channel Convolutional Neural Network
Optimal Solutions to Infinite-Player Stochastic Teams and Mean-Field Teams
Central Limit Theorem for the volume of the zero set of Kostlan-Shub-Smale random polynomial systems
Dynamical counterexamples regarding the Extremal Index and the mean of the limiting cluster size distribution
Sampling-Based Tour Generation of Arbitrarily Oriented Dubins Sensor Platforms
Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation
Secrecy Outage Performance of Multi-antenna Wiretap Channels With Diversity Combinings Over Correlated Rayleigh Fading Channels
Object Detection in Satellite Imagery using 2-Step Convolutional Neural Networks
Sample size estimation for power and accuracy in the experimental comparison of algorithms
Compressed Sensing Using Binary Matrices of Nearly Optimal Dimensions
An Iterative Boundary Random Walks Algorithm for Interactive Image Segmentation
Advances in Distributed Graph Filtering
Upper density of monochromatic infinite paths
LED Arrays of Laser Printers as Sources of Valuable Emissions for Electromagnetic Penetration Process
Radon Inversion via Deep Learning
Random tree recursions: which fixed points correspond to tangible sets of trees?
Laplacian Controllability of Interconnected Graphs
Arithmetic Word Problem Solver using Frame Identification
Policy Optimization as Wasserstein Gradient Flows
Fractal and Multi-Fractal Analysis for A Family of Subset Sum Functions: Combinatorial Structures of Embedding Dimension 1
Passive Compliance Control of Aerial Manipulators
Efficient Outlier Removal for Large Scale Global Structure-from-Motion
Hunting for Tractable Languages for Judgment Aggregation
Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization
On Minimizing Energy Consumption for D2D Clustered Caching Networks
Optimal conditions for connectedness of discretized sets
Gradient and Newton Boosting for Classification and Regression
On the growth of Artin–Tits monoids and the partial theta function
Network-based Referral Mechanism in a Crowdfunding-based Marketing Pattern
Necessary Field Size and Probability for MDP and Complete MDP Convolutional Codes
Improved linear programming methods for checking avoiding sure loss
Paired 3D Model Generation with Conditional Generative Adversarial Networks
Generalized budgeted submodular set function maximization
Discrete Stieltjes classes for log-Heine type distributions
Image Inspired Poetry Generation in XiaoIce
Efficiently decoding the 3D toric codes and welded codes on cubic lattices
Robust classification via MOM minimization
Energy Efficiency Maximization for C-RANs: Discrete Monotonic Optimization, Penalty, and l0-Approximation Methods
Finite Query Answering in Expressive Description Logics with Transitive Roles
A Geostatistical Framework for Combining Spatially Referenced Disease Prevalence Data from Multiple Diagnostics
The Ramsey number of books
Construction of a Scale of Non-Gaussian Measures in 3D
The speed of critically biased random walk in a one-dimensional percolation model
Building a Kannada POS Tagger Using Machine Learning and Neural Network Models
Optimal transport and unitary orbits in C*-algebras
The financial value of knowing the distribution of stock prices in discrete market models
On the depth and Stanley depth of integral closure of powers of monomial ideals
Overcoming Missing and Incomplete Modalities with Generative Adversarial Networks for Building Footprint Segmentation
Learning to Optimize Join Queries With Deep Reinforcement Learning
A note on limit results for the Penrose-Banzhaf index
Estimation of Location and Orientation for Underwater Vehicles from Range Measurements
A note on optimal design for hierarchical generalized group testing
Data Rates for Network Linear Equations
Exponential line-crossing inequalities
The Buck-Passing Game
Divisibility of some binomial sums
Scalable Gaussian Process Computations Using Hierarchical Matrices
Data-driven polynomial chaos expansion for machine learning regression
Spatial extreme values: variational techniques and stochastic integrals
Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities?
Deep Video Color Propagation
Simple Conditions for Metastability of Continuous Markov Chains
User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks
Joint Transceiver Optimization for Wireless Communication PHY with Convolutional Neural Network
Augmenting Physical Simulators with Stochastic Neural Networks: Case Study of Planar Pushing and Bouncing
3D Shape Perception from Monocular Vision, Touch, and Shape Priors