BabelNet  BabelNet is a multilingual lexicalized semantic network and ontology developed at the Linguistic Computing Laboratory in the Department of Computer Science of the Sapienza University of Rome. BabelNet was automatically created by linking the largest multilingual Web encyclopedia, Wikipedia, to the most popular computational lexicon of the English language, WordNet. The integration is performed by means of an automatic mapping and by filling in lexical gaps in resourcepoor languages with the aid of statistical machine translation. The result is an ‘encyclopedic dictionary’ that provides concepts and named entities lexicalized in many languages and connected with large amounts of semantic relations. Additional lexicalizations and definitions are added by linking to freelicense wordnets, OmegaWiki, the English Wiktionary and Wikidata. Similarly to WordNet, BabelNet groups words in different languages into sets of synonyms, called Babel synsets. For each Babel synset, BabelNet provides short definitions (called glosses) in many languages harvested from both WordNet and Wikipedia. BabelNet 
BabyAI  Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural networkbased agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties. 
Bach2Bach  A model of music needs to have the ability to recall past details and have a clear, coherent understanding of musical structure. Detailed in the paper is a deep reinforcement learning architecture that predicts and generates polyphonic music aligned with musical rules. The probabilistic model presented is a Biaxial LSTM trained with a pseudokernel reminiscent of a convolutional kernel. To encourage exploration and impose greater global coherence on the generated music, a deep reinforcement learning approach DQN is adopted. When analyzed quantitatively and qualitatively, this approach performs well in composing polyphonic music. 
Backdoor Injection Attack  Deep learning models have consistently outperformed traditional machine learning models in various classification tasks, including image classification. As such, they have become increasingly prevalent in many real world applications including those where security is of great concern. Such popularity, however, may attract attackers to exploit the vulnerabilities of the deployed deep learning models and launch attacks against securitysensitive applications. In this paper, we focus on a specific type of data poisoning attack, which we refer to as a {\em backdoor injection attack}. The main goal of the adversary performing such attack is to generate and inject a backdoor into a deep learning model that can be triggered to recognize certain embedded patterns with a target label of the attacker’s choice. Additionally, a backdoor injection attack should occur in a stealthy manner, without undermining the efficacy of the victim model. Specifically, we propose two approaches for generating a backdoor that is hardly perceptible yet effective in poisoning the model. We consider two attack settings, with backdoor injection carried out either before model training or during model updating. We carry out extensive experimental evaluations under various assumptions on the adversary model, and demonstrate that such attacks can be effective and achieve a high attack success rate (above $90\%$) at a small cost of model accuracy loss (below $1\%$) with a small injection rate (around $1\%$), even under the weakest assumption wherein the adversary has no knowledge either of the original training data or the classifier model. 
Backdrop  We introduce backdrop, a flexible and simpletoimplement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multiscale, hierarchical structure. Backdrop can also be applied to problems with nondecomposable loss functions where standard SGD methods are not well suited. We perform a number of experiments and demonstrate that backdrop leads to significant improvements in generalization. 
Backplay  A longstanding problem in model free reinforcement learning (RL) is that it requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to increase the sample efficiency of RL when we have access to demonstrations. Our approach, which we call Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment’s fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. We perform experiments in a competitive four player game (Pommerman) and a pathfinding maze game. We find that this weak form of guidance provides significant gains in sample complexity with a stark advantage in sparse reward environments. In some cases, standard RL did not yield any improvement while Backplay reached success rates greater than 50% and generalized to unseen initial conditions in the same amount of training time. Additionally, we see that agents trained via Backplay can learn policies superior to those of the original demonstration. 
Backpropagation  Backpropagation, an abbreviation for “backward propagation of errors”, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respects to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function. Backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient. It is therefore usually considered to be a supervised learning method, although it is also used in some unsupervised networks such as autoencoders. It is a generalization of the delta rule to multilayered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the activation function used by the artificial neurons (or “nodes”) be differentiable. 
Backpropogation Through Time (BPTT) 
➘ “Predictive State Recurrent Neural Networks” 
Backtesting  Backtesting is jargon used in financial industries to refer to testing a trading strategy or predictive model using existing historic data. Backtesting is a special type of crossvalidation applied to time series data. 
Backward Projection  Dimensionality reduction is a common method for analyzing and visualizing highdimensional data. However, reasoning dynamically about the results of a dimensionality reduction is difficult. Dimensionalityreduction algorithms use complex optimizations to reduce the number of dimensions of a dataset, but these new dimensions often lack a clear relation to the initial data dimensions, thus making them difficult to interpret. Here we propose a visual interaction framework to improve dimensionalityreduction based exploratory data analysis. We introduce two interaction techniques, forward projection and backward projection, for dynamically reasoning about dimensionally reduced data. We also contribute two visualization techniques, prolines and feasibility maps, to facilitate the effective use of the proposed interactions. We apply our framework to PCA and autoencoderbased dimensionality reductions. Through dataexploration examples, we demonstrate how our visual interactions can improve the use of dimensionality reduction in exploratory data analysis. 
Backwards Analysis  The idea of backwards analysis (or backward analysis) is a technique to analyze randomized algorithms by imagining as if it was running backwards in time, from output to input. Most of the more interesting applications of backward analysis are in Computational Geometry, but nevertheless, there are some other applications that are interesting and we survey some of them here. 
BadNet  Deep learningbased techniques have achieved stateoftheart performance on a wide variety of recognition and classification tasks. However, these networks are typically computationally expensive to train, requiring weeks of computation on many GPUs; as a result, many users outsource the training procedure to the cloud or rely on pretrained models that are then finetuned for a specific task. In this paper we show that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a \emph{BadNet}) that has stateoftheart performance on the user’s training and validation samples, but behaves badly on specific attackerchosen inputs. We first explore the properties of BadNets in a toy example, by creating a backdoored handwritten digit classifier. Next, we demonstrate backdoors in a more realistic scenario by creating a U.S. street sign classifier that identifies stop signs as speed limits when a special sticker is added to the stop sign; we then show in addition that the backdoor in our US street sign detector can persist even if the network is later retrained for another task and cause a drop in accuracy of {25}\% on average when the backdoor trigger is present. These results demonstrate that backdoors in neural networks are both powerful and—because the behavior of neural networks is difficult to explicate—stealthy. This work provides motivation for further research into techniques for verifying and inspecting neural networks, just as we have developed tools for verifying and debugging software. 
Bag Of Centroids Model  https://…/Word2Vec_BagOfCentroids.py 
Bag of Little Bootstraps (BLB) 
Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate BLB’s favorable statistical performance via a theoretical analysis elucidating the procedure’s properties, as well as a simulation study comparing BLB to the bootstrap, the out of bootstrap, and subsampling. Introduction to Bag of Little Bootstrap 
Bag of Symbolic Fourier Approximation Symbols (BOSS) 
From BOP to BOSS and Beyond: Time Series Classification with Dictionary Based Classifiers 
Bag Of Words Model  The bagofwords model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words (matrix), disregarding grammar and even word order but keeping multiplicity. The bagofwords model is commonly used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature for training a classifier. 
Bagging Hierarchical Clustering  Bagging (bootstrap aggregating) is usually used with supervised methods to improve their stability and accuracy. The idea is to bootstrap the sample, build a predictive model on each bootstrapped sample and then combine the results to produce for classification a vote on the predicted class and for the continuous case an average prediction. If we bootstrap sample our data and build a separate hierarchical clustering solution on each sample can we then combine the results to produce a more stable clustering solution. 
BagNet  Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to understand how they reach their decisions. We here introduce a highperformance DNN architecture on ImageNet whose decisions are considerably easier to explain. Our model, a simple variant of the ResNet50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bagoffeature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet (87.6% top5 for 33 x 33 px features and Alexnet performance for 17 x 17 px features). The constraint on local features makes it straightforward to analyse how exactly each part of the image influences the classification. Furthermore, the BagNets behave similar to stateofthe art deep neural networks such as VGG16, ResNet152 or DenseNet169 in terms of feature sensitivity, error distribution and interactions between image parts. This suggests that the improvements of DNNs over previous bagoffeature classifiers in the last few years is mostly achieved by better finetuning rather than by qualitatively different decision strategies. 
BagofConcepts (BoC) 
This paper focuses on a traditional relation extraction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer followup treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called windowbounded cooccurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intrasentential cooccurrence baseline. We further introduce a new bagofconcepts (BoC) approach to feature engineering based on the stateoftheart word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset. 
Balanced and Conformal Optimized Prediction set (BCOPS) 
We consider the multiclass classification problem when the training data and the outofsample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction set) that constructs a prediction set C(x) which tries to optimize outofsample performance, aiming to include the correct class as often as possible, but also detecting outliers x, for which the method returns no prediction (corresponding to C(x) equal to the empty set). BCOPS combines supervisedlearning algorithms with the method of conformal prediction to minimize a misclassification loss averaged over the outofsample distribution. The constructed prediction sets have a finitesample coverage guarantee without distributional assumptions. We also develop a variant of BCOPS in the online setting where we optimize the misclassification loss averaged over a proxy of the outofsample distribution. We also describe new methods for the evaluation of outofsample performance with mismatched data. We prove asymptotic consistency and efficiency of the proposed methods under suitable assumptions and illustrate our methods on real data examples. 
Balanced Distribution Adaptation (BDA) 
Transfer learning has achieved promising results by leveraging knowledge from the source domain to annotate the target domain which has few or none labels. Existing methods often seek to minimize the distribution divergence between domains, such as the marginal distribution, the conditional distribution or both. However, these two distances are often treated equally in existing algorithms, which will result in poor performance in real applications. Moreover, existing methods usually assume that the dataset is balanced, which also limits their performances on imbalanced tasks that are quite common in real problems. To tackle the distribution adaptation problem, in this paper, we propose a novel transfer learning approach, named as Balanced Distribution \underline{A}daptation~(BDA), which can adaptively leverage the importance of the marginal and conditional distribution discrepancies, and several existing methods can be treated as special cases of BDA. Based on BDA, we also propose a novel Weighted Balanced Distribution Adaptation~(WBDA) algorithm to tackle the class imbalance issue in transfer learning. WBDA not only considers the distribution adaptation between domains but also adaptively changes the weight of each class. To evaluate the proposed methods, we conduct extensive experiments on several transfer learning tasks, which demonstrate the effectiveness of our proposed algorithms over several stateoftheart methods. 
Balanced kMeans  Mesh partitioning is an indispensable tool for efficient parallel numerical simulations. Its goal is to minimize communication between the processes of a simulation while achieving load balance. Established graphbased partitioning tools yield a high solution quality; however, their scalability is limited. Geometric approaches usually scale better, but their solution quality may be unsatisfactory for `nontrivial’ mesh topologies. In this paper, we present a scalable version of $k$means that is adapted to yield balanced clusters. Balanced $k$means constitutes the core of our new partitioning algorithm Geographer. Bootstrapping of initial centers is performed with spacefilling curves, leading to fast convergence of the subsequent balanced kmeans algorithm. Our experiments with up to 16384 MPI processes on numerous benchmark meshes show the following: (i) Geographer produces partitions with a lower communication volume than stateoftheart geometric partitioners from the Zoltan package; (ii) Geographer scales well on large inputs; (iii) a Delaunay mesh with a few billion vertices and edges can be partitioned in a few seconds. 
Balanced Linear Contextual Bandits  Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data. 
Balanced Random Forest Approach for WEKA  Data analysis and machine learning have become an integrative part of the modern scientific methodology, providing automated techniques to predict further information based on observations. One of these classification and regression techniques is the random forest approach. Those decision tree based predictors are best known for their good computational performance and scalability. However, in case of severely imbalanced training data, as often seen in medical studies’ data with large control groups, the training algorithm or the sampling process has to be altered in order to improve the prediction quality for minority classes. In this work, a balanced random forest approach for WEKA is proposed. Furthermore, the prediction quality of the unmodified random forest implementation and the new balanced random forest version for WEKA are evaluated against reference implementations in R. Twoclass problems on balanced data sets and imbalanced medical studies’ data are investigated. A superior prediction quality using the proposed method for imbalanced data is shown compared to the other three techniques. 
Balanced Similarity for Online Discrete Hashing (BSODH) 
When facing largescale image datasets, online hashing serves as a promising solution for online retrieval and prediction tasks. It encodes the online streaming data into compact binary codes, and simultaneously updates the hash functions to renew codes of the existing dataset. To this end, the existing methods update hash functions solely based on the new data batch, without investigating the correlation between such new data and the existing dataset. In addition, existing works update the hash functions using a relaxation process in its corresponding approximated continuous space. And it remains as an open problem to directly apply discrete optimizations in online hashing. In this paper, we propose a novel supervised online hashing method, termed Balanced Similarity for Online Discrete Hashing (BSODH), to solve the above problems in a unified framework. BSODH employs a welldesigned hashing algorithm to preserve the similarity between the streaming data and the existing dataset via an asymmetric graph regularization. We further identify the ‘dataimbalance’ problem brought by the constructed asymmetric graph, which restricts the application of discrete optimization in our problem. Therefore, a novel balanced similarity is further proposed, which uses two equilibrium factors to balance the similar and dissimilar weights and eventually enables the usage of discrete optimizations. Extensive experiments conducted on three widelyused benchmarks demonstrate the advantages of the proposed method over the stateoftheart methods. 
Balanced Sparsity  In trained deep neural networks, unstructured pruning can reduce redundant weights to lower storage cost. However, it requires the customization of hardwares to speed up practical inference. Another trend accelerates sparse model inference on generalpurpose hardwares by adopting coarsegrained sparsity to prune or regularize consecutive weights for efficient computation. But this method often sacrifices model accuracy. In this paper, we propose a novel finegrained sparsity approach, balanced sparsity, to achieve high model accuracy with commercial hardwares efficiently. Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. Experiment results show that balanced sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as finegrained sparsity. 
Balancing GAN (BAGAN) 
Image classification datasets are often imbalanced, characteristic that negatively affects the accuracy of deeplearning classifiers. In this work we propose balancing GANs (BAGANs) as an augmentation tool to restore balance in imbalanced datasets. This is challenging because the few minorityclass images may not be enough to train a GAN. We overcome this issue by including during training all available images of majority and minority classes. The generative model learns useful features from majority classes and uses these to generate images for minority classes. We apply classconditioning in the latent space to drive the generation process towards a target class. Additionally, we couple GANs with autoencoding techniques to reduce the risk of collapsing toward the generation of few foolish examples. We compare the proposed methodology with stateoftheart GANs and demonstrate that BAGAN generates images of superior quality when trained with an imbalanced dataset. 
Banach Wasserstein GAN  Wasserstein Generative Adversarial Networks (WGANs) can be used to generate realistic samples from complicated image distributions. The Wasserstein metric used in WGANs is based on a notion of distance between individual images, which induces a notion of distance between probability distributions of images. So far the community has considered $\ell^2$ as the underlying distance. We generalize the theory of WGAN with gradient penalty to Banach spaces, allowing practitioners to select the features to emphasize in the generator. We further discuss the effect of some particular choices of underlying norms, focusing on Sobolev norms. Finally, we demonstrate the impact of the choice of norm on model performance and show stateoftheart inception scores for nonprogressive growing GANs on CIFAR10. 
Bandit on Large Action set Graph (BLAG) 
Information diffusion in social networks facilitates rapid and largescale propagation of content. However, spontaneous diffusion behavior could also lead to the cascading of sensitive information, which is neglected in prior arts. In this paper, we present the first look into adaptive diffusion of sensitive information, which we aim to prevent from widely spreading without incurring much information loss. We undertake the investigation in networks with partially known topology, meaning that some users’ ability of forwarding information is unknown. Formulating the problem into a bandit model, we propose BLAG (Bandit on Large Action set Graph), which adaptively diffuses sensitive information towards users with weak forwarding ability that is learnt from tentative transmissions and corresponding feedbacks. BLAG enjoys a low complexity of O(n), and is provably more efficient in the sense of half regret bound compared with prior learning method. Experiments on synthetic and three real datasets further demonstrate the superiority of BLAG in terms of adaptive diffusion of sensitive information over several baselines, with at least 40 percent less information loss, at least 10 times of learning efficiency given limited learning rounds and significantly postponed cascading of sensitive information. 
Bandit Principal Component Analysis  We consider a partialfeedback variant of the wellstudied online PCA problem where a learner attempts to predict a sequence of $d$dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment’s choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decisionmaking problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove datadependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$. 
Banzhaf Random Forests (BRF) 
Random forests are a type of ensemble method which makes predictions by combining the results of several independent trees. However, the theory of random forests has long been outpaced by their application. In this paper, we propose a novel random forests algorithm based on cooperative game theory. Banzhaf power index is employed to evaluate the power of each feature by traversing possible feature coalitions. Unlike the previously used information gain rate of information theory, which simply chooses the most informative feature, the Banzhaf power index can be considered as a metric of the importance of each feature on the dependency among a group of features. More importantly, we have proved the consistency of the proposed algorithm, named Banzhaf random forests (BRF). This theoretical analysis takes a step towards narrowing the gap between the theory and practice of random forests for classification problems. Experiments on several UCI benchmark data sets show that BRF is competitive with stateoftheart classifiers and dramatically outperforms previous consistent random forests. Particularly, it is much more efficient than previous consistent random forests. 
Barista  Pretrained deep learning models are increasingly being used to offer a variety of computeintensive predictive analytics services such as fitness tracking, speech and image recognition. The stateless and highly parallelizable nature of deep learning models makes them wellsuited for serverless computing paradigm. However, making effective resource management decisions for these services is a hard problem due to the dynamic workloads and diverse set of available resource configurations that have their deployment and management costs. To address these challenges, we present a distributed and scalable deeplearning prediction serving system called Barista and make the following contributions. First, we present a fast and effective methodology for forecasting workloads by identifying various trends. Second, we formulate an optimization problem to minimize the total cost incurred while ensuring bounded prediction latency with reasonable accuracy. Third, we propose an efficient heuristic to identify suitable compute resource configurations. Fourth, we propose an intelligent agent to allocate and manage the compute resources by horizontal and vertical scaling to maintain the required prediction latency. Finally, using representative realworld workloads for urban transportation service, we demonstrate and validate the capabilities of Barista. 
Barnard’s Test  In statistics, Barnard’s test is an exact test used in the analysis of contingency tables. The test was first published by George Alfred Barnard (1945, 1947) who claimed this test is a more powerful alternative than Fisher’s exact test for 2×2 contingency tables. A previous barrier to the widespread use of Barnard’s test was likely the computational difficulty of calculating the pvalue; nowadays, computers can implement Barnard’s test. 
Basic Linear Algebra Subprograms (BLAS) 
The Basic Linear Algebra Subprograms (BLAS) are a specified set of lowlevel kernel subroutines that perform common linear algebra operations such as copying, vector scaling, vector dot products, linear combinations, and matrix multiplication. They were first published as a Fortran library in 1979 and are still used as a building block in higherlevel math programming languages and libraries, including LINPACK, LAPACK, MATLAB, Mathematica, NumPy and R. BLAS subroutines are a de facto standard API for linear algebra libraries and routines. Several BLAS library implementations have been tuned for specific computer architectures. Highly optimized implementations have been developed by hardware vendors such as Intel and AMD, as well as by other authors, e.g. GotoBLAS and ATLAS (a portable selfoptimizing BLAS). The LINPACK and HPL benchmarks rely heavily on DGEMM, a BLAS subroutine, for its performance measurements. 
Basic Recurrent Neural Network Model (bRNN) 
We present a model of a basic recurrent neural network (or bRNN) that includes a separate linear term with a slightly ‘stable’ fixed matrix to guarantee bounded solutions and fast dynamic response. We formulate a state space viewpoint and adapt the constrained optimization Lagrange Multiplier (CLM) technique and the vector Calculus of Variations (CoV) to derive the (stochastic) gradient descent. In this process, one avoids the commonly used reapplication of the circular chainrule and identifies the error backpropagation with the costate backward dynamic equations. We assert that this bRNN can successfully perform regression tracking of timeseries. Moreover, the ‘vanishing and exploding’ gradients are explicitly quantified and explained through the costate dynamics and the update laws. The adapted CoV framework, in addition, can correctly and principally integrate new loss functions in the network on any variable and for varied goals, e.g., for supervised learning on the outputs and unsupervised learning on the internal (hidden) states. 
BasisPath Norm  Recently, path norm was proposed as a new capacity measure for neural networks with Rectified Linear Unit (ReLU) activation function, which takes the rescalinginvariant property of ReLU into account. It has been shown that the generalization error bound in terms of the path norm explains the empirical generalization behaviors of the ReLU neural networks better than that of other capacity measures. Moreover, optimization algorithms which take path norm as the regularization term to the loss function, like PathSGD, have been shown to achieve better generalization performance. However, the path norm counts the values of all paths, and hence the capacity measure based on path norm could be improperly influenced by the dependency among different paths. It is also known that each path of a ReLU network can be represented by a small group of linearly independent basis paths with multiplication and division operation, which indicates that the generalization behavior of the network only depends on only a few basis paths. Motivated by this, we propose a new norm \emph{Basispath Norm} based on a group of linearly independent paths to measure the capacity of neural networks more accurately. We establish a generalization error bound based on this basis path norm, and show it explains the generalization behaviors of ReLU networks more accurately than previous capacity measures via extensive experiments. In addition, we develop optimization algorithms which minimize the empirical risk regularized by the basispath norm. Our experiments on benchmark datasets demonstrate that the proposed regularization method achieves clearly better performance on the test set than the previous regularization approaches. 
Bass Diffusion Model  The Bass Model or Bass Diffusion Model was developed by Frank Bass and it consists of a simple differential equation that describes the process of how new products get adopted in a population. The model presents a rationale of how current adopters and potential adopters of a new product interact. The basic premise of the model is that adopters can be classified as innovators or as imitators and the speed and timing of adoption depends on their degree of innovativeness and the degree of imitation among adopters. The Bass model has been widely used in forecasting, especially new products’ sales forecasting and technology forecasting. Mathematically, the basic Bass diffusion is a Riccati equation with constant coefficients. Frank Bass published his paper “A new product growth for model consumer durables” in 1969 whose title indeed contained a typographical error. Prior to this, Everett Rogers published Diffusion of Innovations, a highly influential work that described the different stages of product adoption. Bass contributed some mathematical ideas to the concept. 
Batch Normalization (BN) 
Batch normalization is a technique for improving the performance and stability of artificial neural networks. Batch normalization was introduced in a 2015 paper (https://…/1502.03167.pdf ). It is used to normalize the input layer by adjusting and scaling the activations. It can mitigate the problem of internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affects the learning rate of the network. 
Batch Normalized Recurrent Highway Network  Gradient control plays an important role in feedforward networks applied to various computer vision tasks. Previous work has shown that Recurrent Highway Networks minimize the problem of vanishing or exploding gradients. They achieve this by setting the eigenvalues of the temporal Jacobian to 1 across the time steps. In this work, batch normalized recurrent highway networks are proposed to control the gradient flow in an improved way for network convergence. Specifically, the introduced model can be formed by batch normalizing the inputs at each recurrence loop. The proposed model is tested on an image captioning task using MSCOCO dataset. Experimental results indicate that the batch normalized recurrent highway networks converge faster and performs better compared with the traditional LSTM and RHN based models. 
Batch Sampling  Deep Neural Networks (DNNs) thrive in recent years in which Batch Normalization (BN) plays an indispensable role. However, it has been observed that BN is costly due to the reduction operations. In this paper, we propose alleviating this problem through sampling only a small fraction of data for normalization at each iteration. Specifically, we model it as a statistical sampling problem and identify that by sampling less correlated data, we can largely reduce the requirement of the number of data for statistics estimation in BN, which directly simplifies the reduction operations. Based on this conclusion, we propose two sampling strategies, ‘Batch Sampling’ (randomly select several samples from each batch) and ‘Feature Sampling’ (randomly select a small patch from each feature map of all samples), that take both computational efficiency and sample correlation into consideration. Furthermore, we introduce an extremely simple variant of BN, termed as Virtual Dataset Normalization (VDN), that can normalize the activations well with few synthetical random samples. All the proposed methods are evaluated on various datasets and networks, where an overall training speedup by up to 20% on GPU is practically achieved without the support of any specialized libraries, and the loss on accuracy and convergence rate are negligible. Finally, we extend our work to the ‘microbatch normalization’ problem and yield comparable performance with existing approaches at the case of tiny batch size. 
Batch Tournament Selection (BTS) 
Lexicase selection achieves very good solution quality by introducing ordered test cases. However, the computational complexity of lexicase selection can prohibit its use in many applications. In this paper, we introduce Batch Tournament Selection (BTS), a hybrid of tournament and lexicase selection which is approximately one order of magnitude faster than lexicase selection while achieving a competitive quality of solutions. Tests on a number of regression datasets show that BTS compares well with lexicase selection in terms of mean absolute error while having a speedup of up to 25 times. Surprisingly, BTS and lexicase selection have almost no difference in both diversity and performance. This reveals that batches and ordered test cases are completely different mechanisms which share the same general principle fostering the specialization of individuals. This work introduces an efficient algorithm that sheds light onto the main principles behind the success of lexicase, potentially opening up a new range of possibilities for algorithms to come. 
Batch Virtual Adversarial Training (BVAT) 
We present batch virtual adversarial training (BVAT), a novel regularization method for graph convolutional networks (GCNs). BVAT addresses the shortcoming of GCNs that do not consider the smoothness of the model’s output distribution against local perturbations around the input. We propose two algorithms, samplebased BVAT and optimizationbased BVAT, which are suitable to promote the smoothness of the model for graphstructured data by either finding virtual adversarial perturbations for a subset of nodes far from each other or generating virtual adversarial perturbations for all nodes with an optimization process. Extensive experiments on three citation network datasets Cora, Citeseer and Pubmed and a knowledge graph dataset Nell validate the effectiveness of the proposed method, which establishes stateoftheart results in the semisupervised node classification tasks. 
Batched Successive Elimination (BaSE) 
In this paper, we study the multiarmed bandit problem in the batched setting where the employed policy must split data into a small number of batches. While the minimax regret for the twoarmed stochastic bandits has been completely characterized in \cite{perchet2016batched}, the effect of the number of arms on the regret for the multiarmed case is still open. Moreover, the question whether adaptively chosen batch sizes will help to reduce the regret also remains underexplored. In this paper, we propose the BaSE (batched successive elimination) policy to achieve the rateoptimal regret (within logarithmic factors) for batched multiarmed bandits, with matching lower bounds even if the batch sizes are determined in a datadriven manner. 
BatchExpansion Training (BET) 
We propose BatchExpansion Training (BET), a framework for running a batch optimizer on a gradually expanding dataset. As opposed to stochastic approaches, batches do not need to be resampled i.i.d. at every iteration, thus making BET more resource efficient in a distributed setting, and when diskaccess is constrained. Moreover, BET can be easily paired with most batch optimizers, does not require any parametertuning, and compares favorably to existing stochastic and batch methods. We show that when the batch size grows exponentially with the number of outer iterations, BET achieves optimal $\tilde{O}(1/\epsilon)$ dataaccess convergence rate for strongly convex objectives. 
BatchMode Active Learning  Recently, Convolutional Neural Networks (CNNs) have shown unprecedented success in the field of computer vision, especially on challenging image classification tasks by relying on a universal approach, i.e., training a deep model on a massive dataset of supervised examples. While unlabeled data are often an abundant resource, collecting a large set of labeled data, on the other hand, are very expensive, which often require considerable human efforts. One way to ease out this is to effectively select and label highly informative instances from a pool of unlabeled data (i.e., active learning). This paper proposed a new method of batchmode active learning, Dual Active Sampling(DAS), which is based on a simple assumption, if two deep neural networks (DNNs) of the same structure and trained on the same dataset give significantly different output for a given sample, then that particular sample should be picked for additional training. While other state of the art methods in this field usually require intensive computational power or relying on a complicated structure, DAS is simpler to implement and, managed to get improved results on Cifar10 with preferable computational time compared to the coreset method. 
BatchPPO  An efficient implementation of the proximal policy optimization algorithm. ➘ “TensorFlow Agents” 
Battery Reduction  Battery reduction is used to select a subset of m variables from an original set of n variables (m < n) that reproduce a large proportion of the variance in the original set of n variables. There are a number of procedures for performing battery reduction analysis. A popular method involves performing a principal components analysis first to select m components, which account for the salient variance in the original data. GramSchmidt orthogonal rotations are then performed to determine the m variables that account for the largest proportion of variance. 
BaumWelch Algorithm  In electrical engineering, computer science, statistical computing and bioinformatics, the BaumWelch algorithm is used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forwardbackward algorithm and is named for Leonard E. Baum and Lloyd R. Welch. 
Bayes Binning in Quantiles (BBQ) 
BBQ (Bayes Binning in Quantiles) Naeini (2015, ISBN:0262511290) CalibratR 
Bayes Factor  In statistics, the use of Bayes factors is a Bayesian alternative to classical hypothesis testing. Bayesian model comparison is a method of model selection based on Bayes factors. Bayes factors provide a numerical value that quantifies how well a hypothesis predicts the empirical data relative to a competing hypothesis. For example, if the BF is 4, this indicates: ‘This empirical data is 4 times more probable if H1 were true than if H0 were true.’. Hence, evidence points towards H1. A BF of 1 means that data are equally likely to be occured under both hypotheses. In this case, it would be impossible to decide between both. http://…/ashorttaxonomyofbayesfactors 
Bayes Imbalance Impact Index (BI^3) 
Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. In fact, other data factors, such as small disjuncts, noises and overlapping, also play the roles in tandem with imbalance ratio, which makes the problem difficult. Thus far, the empirical studies have demonstrated the relationship between the imbalance ratio and other data factors only. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Further, it is also unknown for a dataset which data factor is actually the main barrier for classification. In this paper, we focus on Bayes optimal classifier and study the influence of class imbalance from a theoretical perspective. Accordingly, we propose an instance measure called Individual Bayes Imbalance Impact Index ($IBI^3$) and a data measure called Bayes Imbalance Impact Index ($BI^3$). $IBI^3$ and $BI^3$ reflect the extent of influence purely by the factor of imbalance in terms of each minority class sample and the whole dataset, respectively. Therefore, $IBI^3$ can be used as an instance complexity measure of imbalance and $BI^3$ is a criterion to show the degree of how imbalance deteriorates the classification. As a result, we can therefore use $BI^3$ to judge whether it is worth using imbalance recovery methods like sampling or costsensitive methods to recover the performance loss of a classifier. The experiments show that $IBI^3$ is highly consistent with the increase of prediction score made by the imbalance recovery methods and $BI^3$ is highly consistent with the improvement of F1 score made by the imbalance recovery methods on both synthetic and real benchmark datasets. 
Bayes Point Machine  Kernelclassifiers comprise a powerful class of nonlinear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal realvalued output on the training sample, within the set of consistent hypotheses, the socalled version space. We suggest the Bayes point machine as a wellfounded improvement which approximates the Bayesoptimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for softboundaries in order to admit training errors. Experimentally, we find that – for the zero training error case – Bayes point machines consistently outperform support vector machines on both surrogate data and realworld benchmark data sets. In the softboundary/softmargin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the realvalued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion. http://…/bayes%20point%20machine%20tutorial.aspx 
Bayes via Goodness of fit  The two key issues of modern Bayesian statistics are: (i) establishing principled approach for distilling statistical prior that is consistent with the given data from an initial believable scientific prior; and (ii) development of a Bayesfrequentist consolidated data analysis workflow that is more effective than either of the two separately. In this paper, we propose the idea of ‘Bayes via goodness of fit’ as a framework for exploring these fundamental questions, in a way that is general enough to embrace almost all of the familiar probability models. Several illustrative examples show the benefit of this new point of view as a practical data analysis tool. Relationship with other Bayesian cultures is also discussed. 
BayesAdaptive Markov Decision Process (BAMDP) 
Addressing uncertainty is critical for autonomous systems to robustly adapt to the real world. We formulate the problem of model uncertainty as a continuous BayesAdaptive Markov Decision Process (BAMDP), where an agent maintains a posterior distribution over the latent model parameters given a history of observations and maximizes its expected longterm reward with respect to this belief distribution. Our algorithm, Bayesian Policy Optimization, builds on recent policy optimization algorithms to learn a universal policy that navigates the explorationexploitation tradeoff to maximize the Bayesian value function. To address challenges from discretizing the continuous latent parameter space, we propose a policy network architecture that independently encodes the belief distribution from the observable state. Our method significantly outperforms algorithms that address model uncertainty without explicitly reasoning about belief distributions, and is competitive with stateoftheart Partially Observable Markov Decision Process solvers. 
BayesCPACE  We present the first PAC optimal algorithm for BayesAdaptive Markov Decision Processes (BAMDPs) in continuous state and action spaces, to the best of our knowledge. The BAMDP framework elegantly addresses model uncertainty by incorporating Bayesian belief updates into longterm expected return. However, computing an exact optimal Bayesian policy is intractable. Our key insight is to compute a nearoptimal value function by covering the continuous statebeliefaction space with a finite set of representative samples and exploiting the Lipschitz continuity of the value function. We prove the nearoptimality of our algorithm and analyze a number of schemes that boost the algorithm’s efficiency. Finally, we empirically validate our approach on a number of discrete and continuous BAMDPs and show that the learned policy has consistently competitive performance against baseline approaches. 
BayesDB  BayesDB is a probabilistic programming platform that enables users to query the probable implications of their data as directly as SQL databases enable them to query the data itself. The default modeling assumptions that BayesDB makes are suitable for a broad class of problems, but statisticians can customize these assumptions when necessary. BayesDB also enables domain experts that lack statistical expertise to perform qualitative model checking and encode simple forms of qualitative prior knowledge. BayesDB: A probabilistic programming system for querying the probable implications of data BayesDB 
BayesGrad  Recent advances in graph convolutional networks have significantly improved the performance of chemical predictions, raising a new research question: ‘how do we explain the predictions of graph convolutional networks ‘ A possible approach to answer this question is to visualize evidence substructures responsible for the predictions. For chemical property prediction tasks, the sample size of the training data is often small and/or a label imbalance problem occurs, where a few samples belong to a single class and the majority of samples belong to the other classes. This can lead to uncertainty related to the learned parameters of the machine learning model. To address this uncertainty, we propose BayesGrad, utilizing the Bayesian predictive distribution, to define the importance of each node in an input graph, which is computed efficiently using the dropout technique. We demonstrate that BayesGrad successfully visualizes the substructures responsible for the label prediction in the artificial experiment, even when the sample size is small. Furthermore, we use a real dataset to evaluate the effectiveness of the visualization. The basic idea of BayesGrad is not limited to graphstructured data and can be applied to other data types. 
Bayesian Action Decoder (BAD) 
When observing the actions of others, humans carry out inferences about why the others acted as they did, and what this implies about their view of the world. Humans also use the fact that their actions will be interpreted in this manner when observed by others, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of twoplayer, zerosum games, scalable multiagent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multiagent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. Together with the public belief, this Bayesian update effectively defines a new Markov decision process, the public belief MDP, in which the action space consists of deterministic partial policies, parameterised by deep neural networks, that can be sampled for a given public state. It exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over partial policies mapping private information into environment actions. The Bayesian update is also closely related to the theory of mind reasoning that humans carry out when observing others’ actions. We first validate BAD on a proofofprinciple twostep matrix game, where it outperforms traditional policy gradient methods. We then evaluate BAD on the challenging, cooperative partialinformation card game Hanabi, where in the twoplayer setting the method surpasses all previously published learning and handcoded approaches. 
Bayesian Additive Regression Tree (BART) 
There is currently a dearth of appropriate methods to estimate the causal effects of multiple treatments when the outcome is binary. For such settings, we propose the use of nonparametric Bayesian modeling, Bayesian Additive Regression Trees (BART). We conduct an extensive simulation study to compare BART to several existing, propensity scorebased methods and to identify its operating characteristics when estimating average treatment effects on the treated. BART consistently demonstrates low bias and meansquared errors. We illustrate the use of BART through a comparative effectiveness analysis of a large dataset, drawn from the latest SEERMedicare linkage, on patients who were operated via roboticassisted surgery, videoassisted thoratic surgery or open thoracotomy. BayesTree,BART 
Bayesian Adjustment for Confounding (BAC) 

Bayesian Allocation Model (BAM) 
We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a Polya urn, an integer valued selfreinforcing process. This urn processes, which we name a PolyaBayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes. 
Bayesian Anomaly Detection And Classification (BADAC) 
Statistical uncertainties are rarely incorporated in machine learning algorithms, especially for anomaly detection. Here we present the Bayesian Anomaly Detection And Classification (BADAC) formalism, which provides a unified statistical approach to classification and anomaly detection within a hierarchical Bayesian framework. BADAC deals with uncertainties by marginalising over the unknown, true, value of the data. Using simulated data with Gaussian noise, BADAC is shown to be superior to standard algorithms in both classification and anomaly detection performance in the presence of uncertainties, though with significantly increased computational cost. Additionally, BADAC provides wellcalibrated classification probabilities, valuable for use in scientific pipelines. We show that BADAC can work in online mode and is fairly robust to model errors, which can be diagnosed through modelselection methods. In addition it can perform unsupervised new class detection and can naturally be extended to search for anomalous subsets of data. BADAC is therefore ideal where computational cost is not a limiting factor and statistical rigour is important. We discuss approximations to speed up BADAC, such as the use of Gaussian processes, and finally introduce a new metric, the RankWeighted Score (RWS), that is particularly suited to evaluating the ability of algorithms to detect anomalies. 
Bayesian Belief Network (BBN) 
A Bayesian belief network is a graphical representation of a probabilistic dependency model. It consists of a set of interconnected nodes, where each node represents a variable in the dependency model and the connecting arcs represent the causal relationships between these variables. Each node or variable may take one of a number of possible states or values. The belief in, or certainty of, each of these states is determined from the belief in each possible state of every node directly connected to it and its relationship with each of these nodes. The belief in each state of a node is updated whenever the belief in each state of any directly connected node changes. 
Bayesian Bootstrap  Bootstrapping can be interpreted in a Bayesian framework using a scheme that creates new datasets through reweighting the initial data. http://…parametricbootstrapasabayesianmodel http://…/1176345338 
Bayesian Bridge  Bridge regression is a regularized regression in which the regression coefficient’s prior is an exponential power distribution. BayesBridge 
Bayesian Causal Effect Estimation (BCEE) 
Estimating causal exposure effects in observational studies ideally requires the analyst to have a vast knowledge of the domain of application. Investigators often bypass difficulties related to the identification and selection of confounders through the use of fully adjusted outcome regression models. However, since such models likely contain more covariates than required, the variance of the regression coefficient for exposure may be unnecessarily large. Instead of using a fully adjusted model, model selection can be attempted. Most classical statistical model selection approaches, such as Bayesian model averaging, do not readily address causal effect estimation. We present a novel model averaged approach to causal inference, the Bayesian Causal Effect Estimation (BCEE) algorithm, which is motivated by the graphical framework for causal inference. BCEE aims to unbiasedly estimate the causal effect of a continuous exposure on a continuous outcome while being more efficient than a fully adjusted model. 
Bayesian Causal Forest  This paper develops a semiparametric Bayesian regression model for estimating heterogeneous treatment effects from observational data. Standard nonlinear regression models, which may work quite well for prediction, can yield badly biased estimates of treatment effects when fit to data with strong confounding. Our Bayesian causal forests model avoids this problem by directly incorporating an estimate of the propensity function in the specification of the response model, implicitly inducing a covariatedependent prior on the regression function. This new parametrization also allows treatment heterogeneity to be regularized separately from the prognostic effect of control variables, making it possible to informatively ‘shrink to homogeneity’, in contrast to existing Bayesian non and semiparametric approaches. bcf 
Bayesian Causal Inference (BCI) 
We address the problem of twovariable causal inference. This task is to infer an existing causal relation between two random variables, i.e. $X \rightarrow Y$ or $Y \rightarrow X$, from purely observational data. We briefly review a number of stateoftheart methods for this, including very recent ones. A novel inference method is introduced, Bayesian Causal Inference (BCI), which assumes a generative Bayesian hierarchical model to pursue the strategy of Bayesian model selection. In the model the distribution of the cause variable is given by a Poisson lognormal distribution, which allows to explicitly regard discretization effects. We assume Fourier diagonal Field covariance operators. The generative model assumed provides synthetic causal data for benchmarking our model in comparison to existing Stateoftheart models, namely LiNGAM, ANMHSIC, ANMMML, IGCI and CGNN. We explore how well the above methods perform in case of high noise settings, strongly discretized data and very sparse data. BCI performs generally reliable with synthetic data as well as with the real world TCEP benchmark set, with an accuracy comparable to stateoftheart algorithms. 
Bayesian Classifier Combination Neural Network  Machine learning research for developing countries can demonstrate clear sustainable impact by delivering actionable and timely information to incountry government organisations (GOs) and NGOs in response to their critical information requirements. We cocreate products with UK and incountry commercial, GO and NGO partners to ensure the machine learning algorithms address appropriate user needs whether for tactical decision making or evidencebased policy decisions. In one particular case, we developed and deployed a novel algorithm, BCCNet, to quickly process large quantities of unstructured data to prevent and respond to natural disasters. Crowdsourcing provides an efficient mechanism to generate labels from unstructured data to prime machine learning algorithms for large scale data analysis. However, these labels are often imperfect with qualities varying among different citizen scientists, which prohibits their direct use with many stateoftheart machine learning techniques. We describe BCCNet, a framework that simultaneously aggregates biased and contradictory labels from the crowd and trains an automatic classifier to process new data. Our case studies, mosquito sound detection for malaria prevention and damage detection for disaster response, show the efficacy of our method in the challenging context of developing world applications. 
Bayesian Conditional Generative Adverserial Networks (BCGAN) 
Traditional GANs use a deterministic generator function (typically a neural network) to transform a random noise input $z$ to a sample $\mathbf{x}$ that the discriminator seeks to distinguish. We propose a new GAN called Bayesian Conditional Generative Adversarial Networks (BCGANs) that use a random generator function to transform a deterministic input $y’$ to a sample $\mathbf{x}$. Our BCGANs extend traditional GANs to a Bayesian framework, and naturally handle unsupervised learning, supervised learning, and semisupervised learning problems. Experiments show that the proposed BCGANs outperforms the stateofthearts. 
Bayesian Constrained Generalised Linear Model  See Meyer et al. (2011) <doi/10.1080/10485252.2011.597852> for more details. bcgam 
Bayesian Convolutional Neural Network (BayesCNN) 
Artificial Neural Networks are connectionist systems that perform a given task by learning on examples without having prior knowledge about the task. This is done by finding an optimal point estimate for the weights in every node. Generally, the network using point estimates as weights perform well with large datasets, but they fail to express uncertainty in regions with little or no data, leading to overconfident decisions. In this paper, Bayesian Convolutional Neural Network (BayesCNN) using Variational Inference is proposed, that introduces probability distribution over the weights. Furthermore, the proposed BayesCNN architecture is applied to tasks like Image Classification, Image SuperResolution and Generative Adversarial Networks. The results are compared to pointestimates based architectures on MNIST, CIFAR10 and CIFAR100 datasets for Image CLassification task, on BSD300 dataset for Image Super Resolution task and on CIFAR10 dataset again for Generative Adversarial Network task. BayesCNN is based on Bayes by Backprop which derives a variational approximation to the true posterior. We, therefore, introduce the idea of applying two convolutional operations, one for the mean and one for the variance. Our proposed method not only achieves performances equivalent to frequentist inference in identical architectures but also incorporate a measurement for uncertainties and regularisation. It further eliminates the use of dropout in the model. Moreover, we predict how certain the model prediction is based on the epistemic and aleatoric uncertainties and empirically show how the uncertainty can decrease, allowing the decisions made by the network to become more deterministic as the training accuracy increases. Finally, we propose ways to prune the Bayesian architecture and to make it more computational and time effective. 
Bayesian Decision Theory  Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. It is considered the ideal case in which the probability structure underlying the categories is known perfectly. While this sort of stiuation rarely occurs in practice, it permits us to determine the optimal (Bayes) classifier against which we can compare all other classifiers. Moreover, in some problems it enables us to predict the error we will get when we generalize to novel patterns. This approach is based on quantifying the tradeoffs between various classification decisions using probability and the costs that accompany such decisions. It makes the assumption that the decision problem is posed in probabilistic terms, and that all of the relevant probability values are known. http://…/Bayesian_decision_theory http://…/bayesian.pdf 
Bayesian Distance Clustering  Modelbased clustering is widelyused in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the withincluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance and modelbased clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distancebased clustering; Mixture model; Modelbased clustering; Model misspecification; Pairwise distance matrix; Partial likelihood; Robustness. 
Bayesian Dynamic Factor Analysis (DFA) 
bayesdfa 
Bayesian Estimation  
Bayesian Exponential Random Graph Models  Bergm 
Bayesian Generative Active Deep Learning  Deep learning models have demonstrated outstanding performance in several problems, but their training process tends to require immense amounts of computational and human resources for training and labeling, constraining the types of problems that can be tackled. Therefore, the design of effective training methods that require small labeled training sets is an important research direction that will allow a more effective use of resources. Among current approaches designed to address this issue, two are particularly interesting: data augmentation and active learning. Data augmentation achieves this goal by artificially generating new training points, while active learning relies on the selection of the ‘most informative’ subset of unlabeled training samples to be labelled by an oracle. Although successful in practice, data augmentation can waste computational resources because it indiscriminately generates samples that are not guaranteed to be informative, and active learning selects a small subset of informative samples (from a large unannotated set) that may be insufficient for the training process. In this paper, we propose a Bayesian generative active deep learning approach that combines active learning with data augmentation — we provide theoretical and empirical evidence (MNIST, CIFAR${10,100}$, and SVHN) that our approach has more efficient training and better classification results than data augmentation and active learning. 
Bayesian Gradient Descent  We suggest a novel approach for the estimation of the posterior distribution of the weights of a neural network, using an online version of the variational Bayes method. Having a confidence measure of the weights allows to combat several shortcomings of neural networks, such as their parameter redundancy, and their notorious vulnerability to the change of input distribution (‘catastrophic forgetting’). Specifically, We show that this approach helps alleviate the catastrophic forgetting phenomenon – even without the knowledge of when the tasks are been switched. Furthermore, it improves the robustness of the network to weight pruning – even without retraining. 
Bayesian Graph Convolutional Neural Network (Bayesian GCNN) 
Recently, techniques for applying convolutional neural networks to graphstructured data have emerged. Graph convolutional neural networks (GCNNs) have been used to address node and graph classification and matrix completion. Although the performance has been impressive, the current implementations have limited capability to incorporate uncertainty in the graph structure. Almost all GCNNs process a graph as though it is a groundtruth depiction of the relationship between nodes, but often the graphs employed in applications are themselves derived from noisy data or modelling assumptions. Spurious edges may be included; other edges may be missing between nodes that have very strong relationships. In this paper we adopt a Bayesian approach, viewing the observed graph as a realization from a parametric family of random graphs. We then target inference of the joint posterior of the random graph parameters and the node (or graph) labels. We present the Bayesian GCNN framework and develop an iterative learning procedure for the case of assortative mixedmembership stochastic block models. We present the results of experiments that demonstrate that the Bayesian formulation can provide better performance when there are very few labels available during the training process. 
Bayesian Heatmap  Unstructured data from diverse sources, such as social media and aerial imagery, can provide valuable uptodate information for intelligent situation assessment. Mining these different information sources could bring major benefits to applications such as situation awareness in disaster zones and mapping the spread of diseases. Such applications depend on classifying the situation across a region of interest, which can be depicted as a spatial ‘heatmap’. Annotating unstructured data using crowdsourcing or automated classifiers produces individual classifications at sparse locations that typically contain many errors. We propose a novel Bayesian approach that models the relevance, error rates and bias of each information source, enabling us to learn a spatial Gaussian Process classifier by aggregating data from multiple sources with varying reliability and relevance. Our method does not require goldlabelled data and can make predictions at any location in an area of interest given only sparse observations. We show empirically that our approach can handle noisy and biased data sources, and that simultaneously inferring reliability and transferring information between neighbouring reports leads to more accurate predictions. We demonstrate our method on two realworld problems from disaster response, showing how our approach reduces the amount of crowdsourced data required and can be used to generate valuable heatmap visualisations from SMS messages and satellite images. 
Bayesian Hierarchical Matrix Factorization (BHMF) 
BHPMF 
Bayesian Highest Posterior Density (HPD) 

Bayesian Hypergraph  We propose a directed acyclic hypergraph framework for a probabilistic graphical model that we call Bayesian hypergraphs. The space of directed acyclic hypergraphs is much larger than the space of chain graphs. Hence Bayesian hypergraphs can model much finer factorizations than Bayesian networks or LWF chain graphs and provide simpler and more computationally efficient procedures for factorizations and interventions. Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as NoisyOR graphically (without additional annotations). We introduce global, local and pairwise Markov properties of Bayesian hypergraphs and prove under which conditions they are equivalent. We define a projection operator, called shadow, that maps Bayesian hypergraphs to chain graphs, and show that the Markov properties of a Bayesian hypergraph are equivalent to those of its corresponding chain graph. We extend the causal interpretation of LWF chain graphs to Bayesian hypergraphs and provide corresponding formulas and a graphical criterion for intervention. 
Bayesian Hypernetworks  We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork, $h$, is a neural network which learns to transform a simple noise distribution, $p(\epsilon) = \mathcal{N}(0,I)$, to a distribution $q(\theta) \doteq q(h(\epsilon))$ over the parameters $\theta$ of another neural network (the ‘primary network’). We train $q$ with variational inference, using an invertible $h$ to enable efficient estimation of the variational lower bound on the posterior $p(\theta  \mathcal{D})$ via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap i.i.d. sampling of $q(\theta)$. We demonstrate these qualitative advantages of Bayesian hypernets, which also achieve competitive performance on a suite of tasks that demonstrate the advantage of estimating model uncertainty, including active learning and anomaly detection. 
Bayesian Information Criterion (BIC) 
This paper considers a structuralfactor approach to modeling highdimensional time series where individual series are decomposed into trend, seasonal, and irregular components. For ease in analyzing many time series, we employ a time polynomial for the trend, a linear combination of trigonometric series for the seasonal component, and a new factor model for the irregular components. The new factor model can simplify the modeling process and achieve parsimony in parameterization. We propose a Bayesian Information Criterion (BIC) to consistently determine the order of the polynomial trend and the number of trigonometric functions. A test statistic is used to determine the number of common factors. The convergence rates for the estimators of the trend and seasonal components and the limiting distribution of the test statistic are established under the setting that the number of time series tends to infinity with the sample size, but at a slower rate. We use simulation to study the performance of the proposed analysis in finite samples and apply the proposed approach to two real examples. The first example considers modeling weekly PM$_{2.5}$ data of 15 monitoring stations in the southern region of Taiwan and the second example consists of monthly valueweighted returns of 12 industrial portfolios. 
Bayesian Inverse Hierarchical RL (BIHRL) 
We introduce a new generative model for human planning under the Bayesian Inverse Reinforcement Learning (BIRL) framework which takes into account the fact that humans often plan using hierarchical strategies. We describe the Bayesian Inverse Hierarchical RL (BIHRL) algorithm for inferring the values of hierarchical planners, and use an illustrative toy model to show that BIHRL retains accuracy where standard BIRL fails. Furthermore, BIHRL is able to accurately predict the goals of `Wikispeedia’ game players, with inclusion of hierarchical structure in the model resulting in a large boost in accuracy. We show that BIHRL is able to significantly outperform BIRL even when we only have a weak prior on the hierarchical structure of the plans available to the agent, and discuss the significant challenges that remain for scaling up this framework to more realistic settings. 
Bayesian Inversion  Deep Bayesian Inversion 
Bayesian Joint Matrix Decomposition (BJMD) 
Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of matrix decomposition methods have been extended for such multiview data integration and pattern discovery. While only few methods were designed to consider the heterogeneity of noise in such multiview data for data integration explicitly. To this end, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and realworld datasets demonstrate that BJMD considering the heterogeneity of noise is superior or competitive to the stateoftheart methods. 
Bayesian Kernel Machine Regression – Causal Mediation Analysis (BKMRCMA) 
Exposure to complex mixtures is a realworld scenario. As such, it is important to understand the mechanisms through which a mixture operates in order to reduce the burden of disease. Currently, there are few methods in the causal mediation analysis literature to estimate the direct and indirect effects of a exposure mixture on an outcome operating through a intermediate (mediator) variable. This paper presents new statistical methodology to estimate the natural direct effect (NDE), natural indirect effect (NIE), and controlled direct effects (CDEs) of a potentially complex mixture exposure on an outcome through a mediator variable. We implement Bayesian kernel machine regression (BKMR) to allow for all possible interactions and nonlinear effects of the coexposures on the mediator, and the coexposures and mediator on the outcome. From the posterior predictive distributions of the mediator and the outcome, we simulate counterfactual outcomes to obtain posterior samples, estimates, and credible intervals (CI) of the NDE, NIE, and CDE. We perform a simulation study that shows when the exposuremediator and exposuremediatoroutcome relationships are complex, our proposed Bayesian kernel machine regression — causal mediation analysis (BKMR–CMA) preforms better than current mediation methods. We apply our methodology to quantify the contribution of birth length as a mediator between in utero coexposure of arsenic, manganese and lead, and children’s neurodevelopment, in a prospective birth cohort in rural Bangladesh. We found a negative association of coexposure to lead, arsenic, and manganese and neurodevelopment, a negative association of exposure to this metal mixture and birth length, and evidence that birth length mediates the effect of coexposure to lead, arsenic, and manganese on children’s neurodevelopment. 
Bayesian Kernel Projection Classifier (BKPC) 
Bayesian kernel projection classifier is a nonlinear multicategory classifier which performs the classification of the projections of the data to the principal axes of the feature space. A Gibbs sampler is implemented to find the posterior distributions of the parameters. BKPC 
Bayesian Latent Class Analysis  ➘ “Latent Class Analysis” 
Bayesian Latent Space Model (BLSM) 
BLSM 
Bayesian Layer  We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with layers capturing uncertainty over weights (Bayesian neural nets), preactivation units (dropout), activations (‘stochastic output layers’), and the function itself (Gaussian processes). With reversible layers, one can also propagate uncertainty from input to output such as for flowbased distributions and constantmemory backpropagation. Bayesian Layers are a dropin replacement for other layers, maintaining core features that one typically desires for experimentation. As demonstration, we fit a 10billion parameter ‘Bayesian Transformer’ on 512 TPUv2 cores, which replaces attention layers with their Bayesian counterpart. 
BAyesian Least Squares Optimization with Nonnegative L1norm constraint (BALSON) 
A Bayesian approach termed BAyesian Least Squares Optimization with Nonnegative L1norm constraint (BALSON) is proposed. The error distribution of data fitting is described by Gaussian likelihood. The parameter distribution is assumed to be a Dirichlet distribution. With the Bayes rule, searching for the optimal parameters is equivalent to finding the mode of the posterior distribution. In order to explicitly characterize the nonnegative L1norm constraint of the parameters, we further approximate the true posterior distribution by a Dirichlet distribution. We estimate the statistics of the approximating Dirichlet posterior distribution by sampling methods. Four sampling methods have been introduced. With the estimated posterior distributions, the original parameters can be effectively reconstructed in polynomial fitting problems, and the BALSON framework is found to perform better than conventional methods. 
Bayesian LeastSquares Policy Iteration (BLSPI) 
We introduce Bayesian leastsquares policy iteration (BLSPI), an offpolicy, modelfree, policy iteration algorithm that uses the Bayesian leastsquares temporaldifference (BLSTD) learning algorithm to evaluate policies. An online variant of BLSPI has been also proposed, called randomised BLSPI (RBLSPI), that improves its policy based on an incomplete policy evaluation step. In online setting, the explorationexploitation dilemma should be addressed as we try to discover the optimal policy by using samples collected by ourselves. RBLSPI exploits the advantage of BLSTD to quantify our uncertainty about the value function. Inspired by Thompson sampling, RBLSPI first samples a value function from a posterior distribution over value functions, and then selects actions based on the sampled value function. The effectiveness and the exploration abilities of RBLSPI are demonstrated experimentally in several environments. 
Bayesian Linear Regression  In statistics, Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model’s parameters. 
Bayesian Masking  A common strategy for sparse linear regression is to introduce regularization, which eliminates irrelevant features by letting the corresponding weights be zeros. Regularization, however, often shrinks the estimator for relevant features, which leads incorrect feature selection. Motivated by the above issue, we propose Bayesian masking (BM), a sparse estimation method which imposes no regularization on the weights. The key concept of BM is to introduce binary latent variables that randomly mask features. Estimating the masking rates determines the relevances of the features automatically. We derive a variational Bayesian inference algorithm that maximizes a lower bound of the factorized information criterion (FIC), which is a recentlydeveloped asymptotic evaluation of the marginal loglikelihood. We also propose reparametrization that accelerates the convergence. We demonstrate that BM outperforms Lasso and automatic relevance determination (ARD) in terms of the sparsityshrinkage tradeoff. 
Bayesian Metanetwork Architecture Learning  For deep neural networks, the particular structure often plays a vital role in achieving stateoftheart performances in many practical applications. However, existing architecture search methods can only learn the architecture for a single task at a time. In this paper, we first propose a Bayesian inference view of architecture learning and use this novel view to derive a variational inference method to learn the architecture of a metanetwork, which will be shared across multiple tasks. To account for the task distribution in the posterior distribution of the architecture and its corresponding weights, we exploit the optimization embedding technique to design the parameterization of the posterior. Our method finds architectures which achieve stateoftheart performance on the fewshot learning problem and demonstrates the advantages of metanetwork learning for both architecture search and metalearning. 
Bayesian Minimum Expected Loss  Central to many inferential situations is the estimation of rational functions of parameters. The mainstream in statistics and econometrics estimates these quantities based on the plugin approach without consideration of the main objective of the inferential situation. We propose the Bayesian Minimum Expected Loss (MELO) approach focusing explicitly on the function of interest, and calculating its frequentist variability. Asymptotic properties of the MELO estimator are similar to the plugin approach. Nevertheless, simulation exercises show that our proposal is better in situations characterized by small sample sizes and noisy models. In addition, we observe in the applications that our approach gives lower standard errors than frequently used alternatives when datasets are not very informative. 
Bayesian Model Averaging  Bayesian Model Averaging is a technique designed to help account for the uncertainty inherent in the model selection process, something which traditional statistical analysis often neglects. By averaging over many different competing models, BMA incorporates model uncertainty into conclusions about parameters and prediction. BMA has been applied successfully to many statistical model classes including linear regression, generalized linear models, Cox regression models, and discrete graphical models, in all cases improving predictive performance. 
Bayesian ModelAgnostic MetaLearning  Learning to infer Bayesian posterior from a fewshot dataset is an important step towards robust metalearning due to the model uncertainty inherent in the problem. In this paper, we propose a novel Bayesian modelagnostic metalearning method. The proposed method combines scalable gradientbased metalearning with nonparametric variational inference in a principled probabilistic framework. During fast adaptation, the method is capable of learning complex uncertainty structure beyond a point estimate or a simple Gaussian approximation. In addition, a robust Bayesian metaupdate mechanism with a new metaloss prevents overfitting during metaupdate. Remaining an efficient gradientbased metalearner, the method is also modelagnostic and simple to implement. Experiment results show the accuracy and robustness of the proposed method in various tasks: sinusoidal regression, image classification, active learning, and reinforcement learning. 
Bayesian Network (BN) 
A Bayesian network, Bayes network, belief network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. 
Bayesian Neural Network (BNN) 
This paper describes and discusses Bayesian Neural Network (BNN). The paper showcases a few different applications of them for classification and regression problems. BNNs are comprised of a Probabilistic Model and a Neural Network. The intent of such a design is to combine the strengths of Neural Networks and Stochastic modeling. Neural Networks exhibit continuous function approximator capabilities. Stochastic models allow direct specification of a model with known interaction between parameters to generate data. During the prediction phase, stochastic models generate a complete posterior distribution and produce probabilistic guarantees on the predictions. Thus BNNs are a unique combination of neural network and stochastic models with the stochastic model forming the core of this integration. BNNs can then produce probabilistic guarantees on it’s predictions and also generate the distribution of parameters that it has learnt from the observations. That means, in the parameter space, one can deduce the nature and shape of the neural network’s learnt parameters. These two characteristics makes them highly attractive to theoreticians as well as practitioners. Recently there has been a lot of activity in this area, with the advent of numerous probabilistic programming libraries such as: PyMC3, Edward, Stan etc. Further this area is rapidly gaining ground as a standard machine learning approach for numerous problems 
Bayesian Nonparametric Model  A Bayesian nonparametric model is a Bayesian model on an infinitedimensional parameter space. The parameter space is typically chosen as the set of all possible solutions for a given learning problem. For example, in a regression problem the parameter space can be the set of continuous functions, and in a density estimation problem the space can consist of all densities. A Bayesian nonparametric model uses only a finite subset of the available parameter dimensions to explain a finite sample of observations, with the set of dimensions chosen depending on the sample, such that the effective complexity of the model (as measured by the number of dimensions used) adapts to the data. Classical adaptive problems, such as nonparametric estimation and model selection, can thus be formulated as Bayesian inference problems. Popular examples of Bayesian nonparametric models include Gaussian process regression, in which the correlation structure is refined with growing sample size, and Dirichlet process mixture models for clustering, which adapt the number of clusters to the complexity of the data. Bayesian nonparametric models have recently been applied to a variety of machine learning problems, including regression, classification, clustering, latent variable modeling, sequential modeling, image segmentation, source separation and grammar induction. 
Bayesian Nonparametric Principal Component Analysis (BNPPCA) 
Principal component analysis (PCA) is very popular to perform dimension reduction. The selection of the number of significant components is essential but often based on some practical heuristics depending on the application. Only few works have proposed a probabilistic approach able to infer the number of significant components. To this purpose, this paper introduces a Bayesian nonparametric principal component analysis (BNPPCA). The proposed model projects observations onto a random orthogonal basis which is assigned a prior distribution defined on the Stiefel manifold. The prior on factor scores involves an Indian buffet process to model the uncertainty related to the number of components. The parameters of interest as well as the nuisance parameters are finally inferred within a fully Bayesian framework via Monte Carlo sampling. A study of the (in)consistence of the marginal maximum a posteriori estimator of the latent dimension is carried out. A new estimator of the subspace dimension is proposed. Moreover, for sake of statistical significance, a KolmogorovSmirnov test based on the posterior distribution of the principal components is used to refine this estimate. The behaviour of the algorithm is first studied on various synthetic examples. Finally, the proposed BNP dimension reduction approach is shown to be easily yet efficiently coupled with clustering or latent factor models within a unique framework. 
Bayesian Optimisation Algorithm (BOA) 

Bayesian Optimization  Bayesian optimization is a sequential design strategy for global optimization of blackbox functions that doesn’t require derivatives. Since the objective function is unknown, the Bayesian strategy is to treat it as a random function and place a prior over it. The prior captures our beliefs about the behaviour of the function. After gathering the function evaluations, which are treated as data, the prior is updated to form the posterior distribution over the objective function. The posterior distribution, in turn, is used to construct an acquisition function (often also referred to as infill sampling criteria) that determines what the next query point should be. 
Bayesian Optimization with Cylindrical Kernels (BOCK) 
A major challenge in Bayesian Optimization is the boundary issue (Swersky, 2017) where an algorithm spends too many evaluations near the boundary of its search space. In this paper, we propose BOCK, Bayesian Optimization with Cylindrical Kernels, whose basic idea is to transform the ball geometry of the search space using a cylindrical transformation. Because of the transformed geometry, the Gaussian Processbased surrogate model spends less budget searching near the boundary, while concentrating its efforts relatively more near the center of the search region, where we expect the solution to be located. We evaluate BOCK extensively, showing that it is not only more accurate and efficient, but it also scales successfully to problems with a dimensionality as high as 500. We show that the better accuracy and scalability of BOCK even allows optimizing modestly sized neural network layers, as well as neural network hyperparameters. 
Bayesian Optimized Continual Learning (BOCL) 
Though neural networks have achieved much progress in various applications, it is still highly challenging for them to learn from a continuous stream of tasks without forgetting. Continual learning, a new learning paradigm, aims to solve this issue. In this work, we propose a new model for continual learning, called Bayesian Optimized Continual Learning with Attention Mechanism (BOCL) that dynamically expands the network capacity upon the arrival of new tasks by Bayesian optimization and selectively utilizes previous knowledge (e.g. feature maps of previous tasks) via attention mechanism. Our experiments on variants of MNIST and CIFAR100 demonstrate that our methods outperform the stateoftheart in preventing catastrophic forgetting and fitting new tasks better. 
Bayesian PassiveAggressive Learning (BayesPA) 
Online PassiveAggressive (PA) learning is an effective framework for performing maxmargin online learning. But the deterministic formulation and estimated single largemargin model could limit its capability in discovering descriptive structures underlying complex data. This paper presents online Bayesian PassiveAggressive (BayesPA) learning, which subsumes the online PA and extends naturally to incorporate latent variables and perform nonparametric Bayesian inference, thus providing great flexibility for explorative analysis. We apply BayesPA to topic modeling and derive efficient online learning algorithms for maxmargin topic models. We further develop nonparametric methods to resolve the number of topics. Experimental results on real datasets show that our approaches significantly improve time efficiency while maintaining comparable results with the batch counterparts. 
Bayesian Patchwork  Doctors often rely on their past experience in order to diagnose patients. For a doctor with enough experience, almost every patient would have similarities to key cases seen in the past, and each new patient could be viewed as a mixture of these key past cases. Because doctors often tend to reason this way, an efficient computationally aided diagnostic tool that thinks in the same way might be helpful in locating key past cases of interest that could assist with diagnosis. This article develops a novel mathematical model to mimic the type of logical thinking that physicians use when considering past cases. The proposed model can also provide physicians with explanations that would be similar to the way they would naturally reason about cases. The proposed method is designed to yield predictive accuracy, computational efficiency, and insight into medical data; the key element is the insight into medical data, in some sense we are automating a complicated process that physicians might perform manually. We finally implemented the result of this work on two publicly available healthcare datasets, for heart disease prediction and breast cancer prediction. 
Bayesian Programming  Bayesian programming is a formalism and a methodology to specify probabilistic models and solve problems when less than the necessary information is available. Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science he developed this theory and proposed what he called ‘the robot,’ which was not a physical device, but an inference engine to automate probabilistic reasoning – a kind of Prolog for probability instead of logic. Bayesian programming is a formal and concrete implementation of this ‘robot’. Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian Programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs. 
Bayesian Robust Principal Component Regression  Principal component regression is a linear regression model with principal components as regressors. This type of modelling is particularly useful for prediction in settings with highdimensional covariates. Surprisingly, the existing literature treating of Bayesian approaches is relatively sparse. In this paper, we aim at filling some gaps through the following practical contribution: we introduce a Bayesian approach with detailed guidelines for a straightforward implementation. The approach features two characteristics that we believe are important. First, it effectively involves the relevant principal components in the prediction process. This is achieved in two steps. The first one is model selection; the second one is to average out the predictions obtained from the selected models according to model averaging mechanisms, allowing to account for model uncertainty. The model posterior probabilities are required for model selection and model averaging. For this purpose, we include a procedure leading to an efficient reversible jump algorithm. The second characteristic of our approach is whole robustness, meaning that the impact of outliers on inference gradually vanishes as they approach plus or minus infinity. The conclusions obtained are consequently consistent with the majority of observations (the bulk of the data). 
Bayesian Tensor Factorization Linked to External Data (BaTFLED) 
The vast majority of current machine learning algorithms are designed to predict single responses or a vector of responses, yet many types of response are more naturally organized as matrices or higherorder tensor objects where characteristics are shared across modes. We present a new machine learning algorithm BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a threedimensional response tensor using input features for each of the dimensions. BaTFLED uses a probabilistic Bayesian framework to learn projection matrices mapping input features for each mode into latent representations that multiply to form the response tensor. By utilizing a Tucker decomposition, the model can capture weights for interactions between latent factors for each mode in a small core tensor. Priors that encourage sparsity in the projection matrices and core tensor allow for feature selection and model regularization. This method is shown to far outperform elastic net and neural net models on ‘cold start’ tasks from data simulated in a threemode structure. Additionally, we apply the model to predict doseresponse curves in a panel of breast cancer cell lines treated with drug compounds that was used as a Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge. 
Bayesian Weighted Mendelian Randomization (BWMR) 
The results from GenomeWide Association Studies (GWAS) on thousands of phenotypes provide an unprecedented opportunity to infer the causal effect of one phenotype (exposure) on another (outcome). Mendelian randomization (MR), an instrumental variable (IV) method, has been introduced for causal inference using GWAS data. Due to the polygenic architecture of complex traits/diseases and the ubiquity of pleiotropy, however, MR has many unique challenges compared to conventional IV methods. In this paper, we propose a Bayesian weighted Mendelian randomization (BWMR) for causal inference to address these challenges. In our BWMR model, the uncertainty of weak effects owing to polygenicity has been taken into account and the violation of IV assumption due to pleiotropy has been addressed through outlier detection by Bayesian weighting. To make the causal inference based on BWMR computationally stable and efficient, we developed a variational expectationmaximization (VEM) algorithm. Moreover, we have also derived an exact closedform formula to correct the posterior covariance which is well known to be underestimated in variational inference. Through comprehensive simulation studies, we evaluated the performance of BWMR, demonstrating the advantage of BWMR over its competitors. Then we applied BWMR to 10,762 pairs of exposure and outcome traits from 54 GWAS, uncovering novel casual relationship between exposure and outcome traits. The BWMR package is available at https://…/BWMR. 
Bayesian YingYang Learning Algorithm (BYY) 
YingYang learning considers a learning system featured with two pathways between the external observation domain X and its inner representation domain R. The domain R and the pathway R→X is modeled by one subsystem system, while the domain X and the pathway X→R is modeled by another subsystem. From the view of the ancient YingYang philosophy, the former is called Ying and the latter is called Yang, and the two coordinately form a YingYang system, with the structure of Ying subject to a principle of compactness (least complexity) and the structure of Yang subject to a principle of proper vitality (matched dynamic range) with respect to the Ying. Moreover, all the rest unknowns in the YingYang system are learned under the guidance of a YingYang best harmony principle. http://…/indexpubbyy.html http://…/chaptersXu03a.pdf 
BayesNAS  OneShot Neural Architecture Search (NAS) is a promising method to significantly reduce search time without any separate training. It can be treated as a Network Compression problem on the architecture parameters from an overparameterized network. However, there are two issues associated with most oneshot NAS methods. First, dependencies between a node and its predecessors and successors are often disregarded which result in improper treatment over zero operations. Second, architecture parameters pruning based on their magnitude is questionable. In this paper, we employ the classic Bayesian learning approach to alleviate these two issues by modeling architecture parameters using hierarchical automatic relevance determination (HARD) priors. Unlike other NAS methods, we train the overparameterized network for only one epoch then update the architecture. Impressively, this enabled us to find the architecture in both proxy and proxyless tasks on CIFAR10 within only 0.2 GPU days using a single GPU. As a byproduct, our approach can be transferred directly to compress convolutional neural networks by enforcing structural sparsity which achieves extremely sparse networks without accuracy deterioration. 
BayesOD  One of the challenging aspects of incorporating deep neural networks into robotic systems is the lack of uncertainty measures associated with their output predictions. Recent work has identified aleatoric and epistemic as two types of uncertainty in the output of deep neural networks, and provided methods for their estimation. However, these methods have had limited success when applied to the object detection task. This paper introduces, BayesOD, a Bayesian approach for estimating the uncertainty in the output of deep object detectors, which reformulates the neural network inference and NonMaximum suppression components of standard object detectors from a Bayesian perspective. As a result, BayesOD provides uncertainty estimates associated with detected object instances, which allows the deep object detector to be treated as any other sensor in a robotic system. BayesOD is shown to be capable of reliably identifying erroneous detection output instances using their estimated uncertainty measure. The estimated uncertainty measures are also shown to be better correlated with the correctness of a detection than the state of the art methods available in literature. 
BayesToMoP  Multiagent algorithms often aim to accurately predict the behaviors of other agents and find a best response during interactions accordingly. Previous works usually assume an opponent uses a stationary strategy or randomly switches among several stationary ones. However, in practice, an opponent may exhibit more sophisticated behaviors by adopting more advanced strategies, e.g., using a bayesian reasoning strategy. This paper presents a novel algorithm called BayesToMoP which can efficiently detect and handle opponents using either stationary or higherlevel reasoning strategies. BayesToMoP also supports the detection of previous unseen policies and learning a best response policy accordingly. Deep BayesToMoP is proposed by extending BayesToMoP with DRL techniques. Experimental results show both BayesToMoP and deep BayesToMoP outperform the stateoftheart approaches when faced with different types of opponents in twoagent competitive games. 
BBGAN  One major factor impeding more widespread adoption of deep neural networks (DNNs) is their issues with robustness, which is essential for safety critical applications such as autonomous driving. This has motivated much recent work on adversarial attacks for DNNs, which mostly focus on pixellevel perturbations void of semantic meaning. In contrast, we present a general framework for adversarial black box attacks on agents, which are intimately related to the semantics of the task being performed by the agent. To do this, our proposed adversary (denoted as BBGAN) is trained to appropriately parametrize the environment (black box) with which the agent interacts, such that this agent performs poorly on its dedicated task. We illustrate the application of our BBGAN framework on three different tasks (primarily targeting aspects of autonomous navigation): object detection, selfdriving, and autonomous UAV racing. On these tasks, our approach can be used to generate failure cases that fool an agent consistently. 
BCRNet  This paper proposes a novel neural network architecture inspired by the nonstandard form proposed by Beylkin, Coifman, and Rokhlin in [Communications on Pure and Applied Mathematics, 44(2), 141183]. The nonstandard form is a highly effective waveletbased compression scheme for linear integral operators. In this work, we first represent the matrixvector product algorithm of the nonstandard form as a linear neural network where every scale of the multiresolution computation is carried out by a locally connected linear subnetwork. In order to address nonlinear problems, we propose an extension, called BCRNet, by replacing each linear subnetwork with a deeper and more powerful nonlinear one. Numerical results demonstrate the efficiency of the new architecture by approximating nonlinear maps that arise in homogenization theory and stochastic computation. 
BeakerX  BeakerX is a collection of kernels and extensions to the Jupyter interactive computing environment. It provides JVM support, Spark cluster support, polyglot programming, interactive plots, tables, forms, publishing, and more. BeakerX supports: • Groovy, Scala, Clojure, Kotlin, Java, and SQL, including many magics; • Widgets for timeseries plotting, tables, forms, and more (there are Python and JavaScript APIs in addition to the JVM languages); • Polyglot magics and autotranslation, allowing you to access multiple languages in the same noteobook, and seamlessly communicate between them; • Apache Spark integration including GUI configuration, status, progress, interrupt, and tables; • Oneclick publication with interactive plots and tables, and • Jupyter Lab. BeakerX is available via conda, pip, and docker. 
Beetle Antennae Search (BAS) 
Metaheuristic algorithms have become very popular because of powerful performance on the optimization problem. A new algorithm called beetle antennae search algorithm (BAS) is proposed in the paper inspired by the searching behavior of longhorn beetles. The BAS algorithm imitates the function of antennae and the random walking mechanism of beetles in nature, and then two main steps of detecting and searching are implemented. Finally, the algorithm is benchmarked on 2 wellknown test functions, in which the numerical results validate the efficacy of the proposed BAS algorithm. BSAS: Beetle Swarm Antennae Search Algorithm for Optimization Problems 
Beetle Swarm Optimization Algorithm  In this paper, a new metaheuristic algorithm, called beetle swarm optimization algorithm, is proposed by enhancing the performance of swarm optimization through beetle foraging principles. The performance of 23 benchmark functions is tested and compared with widely used algorithms, including particle swarm optimization algorithm, genetic algorithm (GA) and grasshopper optimization algorithm . Numerical experiments show that the beetle swarm optimization algorithm outperforms its counterparts. Besides, to demonstrate the practical impact of the proposed algorithm, two classic engineering design problems, namely, pressure vessel design problem and himmelblaus optimization problem, are also considered and the proposed beetle swarm optimization algorithm is shown to be competitive in those applications. 
Behavior Tree  A Behavior Tree (BT) is a way to structure the switching between different tasks in an autonomous agent, such as a robot or a virtual entity in a computer game. BTs are a very efficient way of creating complex systems that are both modular and reactive. These properties are crucial in many applications, which has led to the spread of BT from computer game programming to many branches of AI and Robotics. 
Behavioral Analytics  Behavioral Analytics is a subset of business analytics that focuses on how and why users of eCommerce platforms, online games, & web applications behave. While business analytics has a more broad focus on the who, what, where and when of business intelligence, behavioral analytics narrows that scope, allowing one to take seemingly unrelated data points in order to extrapolate, predict and determine errors and future trends. It takes a more holistic and human view of data, connecting individual data points to tell us not only what is happening, but also how and why it is happening. Behavioral analytics utilizes user data captured while the web application, game, or website is in use by analytic platforms like Google Analytics. Platform traffic data like navigation paths, clicks, social media interactions, purchasing decisions and marketing responsiveness is all recorded. Also, other more specific advertising metrics like clicktoconversion time, and comparisons between other metrics like the monetary value of an order and the amount of time spent on the site. These data points are then compiled and analyzed, whether by looking at the timeline progression from when a user first entered the platform until a sale was made, or what other products a user bought or looked at before this purchase. Behavioral analysis allows future actions and trends to be predicted based on all the data collected. 
Behavioral Change Point Analysis (BCPA) 
The Behavioral Change Point Analysis (BCPA) is a method of identifying hidden shifts in the underlying parameters of a time series, developed specifically to be applied to animal movement data which is irregularly sampled. 
Behaviorbased Process Translator (BePT) 
Sharing process models on the web has emerged as a widely used concept. Users can collect and share their experimental process models with others. However, some users always feel confused about the shared process models for lack of necessary guidelines or instructions. Therefore, several process translators have been proposed to explain the semantics of process models in natural language (NL) in order to extract more value from process repositories. We find that previous studies suffer from information loss and generate semantically erroneous descriptions that diverge from original model behaviors. In this paper, we propose a novel process translator named BePT (Behaviorbased Process Translator) based on the encoderdecoder paradigm, encoding a process model into a middle representation and decoding the representation into a NL text. The theoretical analysis demonstrates that BePT satisfies behavior correctness, behavior completeness and description minimality. The qualitative and quantitative experiments show that BePT outperforms the stateoftheart methods in terms of capability, detailedness, consistency, understandability and reproducibility. 
BehaviorIntensive Neural Network (BINN) 
In the modern ecommerce, the behaviors of customers contain rich information, e.g., consumption habits, the dynamics of preferences. Recently, sessionbased recommendations are becoming popular to explore the temporal characteristics of customers’ interactive behaviors. However, existing works mainly exploit the shortterm behaviors without fully taking the customers’ longterm stable preferences and evolutions into account. In this paper, we propose a novel BehaviorIntensive Neural Network (BINN) for nextitem recommendation by incorporating both users’ historical stable preferences and present consumption motivations. Specifically, BINN contains two main components, i.e., Neural Item Embedding, and Discriminative Behaviors Learning. Firstly, a novel item embedding method based on user interactions is developed for obtaining an unified representation for each item. Then, with the embedded items and the interactive behaviors over item sequences, BINN discriminatively learns the historical preferences and present motivations of the target users. Thus, BINN could better perform recommendations of the next items for the target users. Finally, for evaluating the performances of BINN, we conduct extensive experiments on two realworld datasets, i.e., Tianchi and JD. The experimental results clearly demonstrate the effectiveness of BINN compared with several stateoftheart methods. 
BeholderGAN  Beauty is in the eye of the beholder. This maxim, emphasizing the subjectivity of the perception of beauty, has enjoyed a wide consensus since ancient times. In the digitalera, datadriven methods have been shown to be able to predict humanassigned beauty scores for facial images. In this work, we augment this ability and train a generative model that generates faces conditioned on a requested beauty score. In addition, we show how this trained generator can be used to beautify an input face image. By doing so, we achieve an unsupervised beautification model, in the sense that it relies on no ground truth target images. 
BELIEF  With the advent of Big Data era, data reduction methods are highly demanded given its ability to simplify huge data, and ease complex learning processes. Concretely, algorithms that are able to filter relevant dimensions from a set of millions are of huge importance. Although effective, these techniques suffer from the ‘scalability’ curse as well. In this work, we propose a distributed feature weighting algorithm, which is able to rank millions of features in parallel using large samples. This method, inspired by the wellknown RELIEF algorithm, introduces a novel redundancy elimination measure that provides similar schemes to those based on entropy at a much lower cost. It also allows smooth scale up when more instances are demanded in feature estimations. Empirical tests performed on our method show its estimation ability in manifold huge sets –both in number of features and instances–, as well as its simplified runtime cost (specially, at the redundancy detection step). 
Belief Functions  The theory of belief functions provides a nonBayesian way of using mathematical probability to quantify subjective judgements. Whereas a Bayesian assesses probabilities directly for the answer to a question of interest, a belieffunction user assesses probabilities for related questions and then considers the implications of these probabilities for the question of interest. 
Belief Network  see “Bayesian Network” or “Bayes Net” 
Belief Propagation  Belief propagation, also known as sumproduct message passing is a message passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory and has demonstrated empirical success in numerous applications including lowdensity paritycheck codes, turbo codes, free energy approximation, and satisfiability. The algorithm was first proposed by Judea Pearl in 1982, who formulated this algorithm on trees, and was later extended to polytrees. It has since been shown to be a useful approximate algorithm on general graphs. 
Bellman Equation  A Bellman equation, named after its discoverer, Richard Bellman, also known as a dynamic programming equation, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It writes the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices. This breaks a dynamic optimization problem into simpler subproblems, as Bellman’s Principle of Optimality prescribes. The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory. Almost any problem which can be solved using optimal control theory can also be solved by analyzing the appropriate Bellman equation. However, the term ‘Bellman equation’ usually refers to the dynamic programming equation associated with discretetime optimization problems. In continuoustime optimization problems, the analogous equation is a partial differential equation which is usually called the HamiltonJacobiBellman equation. 
Benford’s Law  Benford’s Law, also called the FirstDigit Law, refers to the frequency distribution of digits in many (but not all) reallife sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford’s Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution. 
BentCable Regression  We use the socalled bentcable model to describe natural phenomena which exhibit a potentially sharp change in slope. The model comprises two linear segments, joined smoothly by a quadratic bend. The class of bent cables includes, as a limiting case, the popular piecewiselinear model (with a sharp kink), otherwise known as the broken stick. Associated with bentcable regression is the estimation of the bendwidth parameter, through which the abruptness of the underlying transition may be assessed. 
BentoML  BentoML is a python library for packaging and deploying machine learning models. It provides highlevel APIs for defining an ML service and packaging its artifacts, source code, dependencies, and configurations into a productionsystemfriendly format that is ready for deployment. 
Berkeley Container Library (BCL) 
Onesided communication is a useful paradigm for irregular parallel applications, but most onesided programming environments, including MPI’s onesided interface and PGAS programming languages, lack application level libraries to support these applications. We present the Berkeley Container Library, a set of generic, crossplatform, highperformance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides onesided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNetEX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of highperformance computing systems, demonstrating that BCL programs are competitive with handoptimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization. 
Berkeley Data Analytics Stack (BDAS) 
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the Berkeley AMPLab to make sense of Big Data. 
BERTScore  We propose BERTScore, an automatic evaluation metric for text generation. Analogous to common metrics, \method computes a similarity score for each token in the candidate sentence with each token in the reference. However, instead of looking for exact matches, we compute similarity using contextualized BERT embeddings. We evaluate on several machine translation and image captioning benchmarks, and show that BERTScore correlates better with human judgments than existing metrics, often significantly outperforming even taskspecific supervised metrics. 
BERTSUM  BERT, a pretrained Transformer model, has achieved groundbreaking performance on multiple NLP tasks. In this paper, we describe BERTSUM, a simple variant of BERT, for extractive summarization. Our system is the state of the art on the CNN/Dailymail dataset, outperforming the previous bestperformed system by 1.65 on ROUGEL. The codes to reproduce our results are available at https://…/BertSum 
Bessels Correction  In statistics, Bessel’s correction, named after Friedrich Bessel, is the use of n – 1 instead of n in the formula for the sample variance and sample standard deviation, where n is the number of observations in a sample. This corrects the bias in the estimation of the population variance, and some (but not all) of the bias in the estimation of the population standard deviation, but often increases the mean squared error in these estimations. 
Best Estimate Results with Reduced Uncertainties (BERRU) 
Book: BERRU Predictive Modeling 
Best Friends For Ever (BFF) 
Graphs form a natural model for relationships and interactions between entities, for example, between people in social and cooperation networks, servers in computer networks,or tags and words in documents and tweets. But, which of these relationships or interactions are the most lasting ones? In this paper, given a set of graph snapshots, which may correspond to the state of a dynamic graph at different time instances, we look at the problem of identifying the set of nodes that are the most densely connected at all snapshots. We call this problem the Best Friends For Ever (Bff) problem. We provide definitions for density over multiple graph snapshots, that capture different semantics of connectedness over time, and we study the corresponding variants of the Bff problem. We then look at the OnOff Bff (O2Bff) problem that relaxes the requirement of nodes being connected in all snapshots, and asks for the densest set of nodes in at least k of a given set of graph snapshots. We show that this problem is NPcomplete for all definitions of density, and we propose a set of efficient algorithms. Finally, we present experiments with synthetic and real datasets that show both the efficiency of our algorithms and the usefulness of the Bff and the O2Bff problems. 
Best Linear Adaptive Enhancement (BLADE) 
The Rapid and Accurate Image Super Resolution (RAISR) method of Romano, Isidoro, and Milanfar is a computationally efficient image upscaling method using a trained set of filters. We describe a generalization of RAISR, which we name Best Linear Adaptive Enhancement (BLADE). This approach is a trainable edgeadaptive filtering framework that is general, simple, computationally efficient, and useful for a wide range of image processing problems. We show applications to denoising, compression artifact removal, demosaicing, and approximation of anisotropic diffusion equations. 
Best Linear Unbiased Estimator (BLUE) 
In statistics, the GaussMarkov theorem, named after Carl Friedrich Gauss and Andrey Markov, states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator. Here ‘best’ means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. The errors don’t need to be normal, nor do they need to be independent and identically distributed (only uncorrelated and homoscedastic). The hypothesis that the estimator be unbiased cannot be dropped, since otherwise estimators better than OLS exist. See for examples the JamesStein estimator (which also drops linearity) or ridge regression. ➘ “GaussMarkov Theorem” 
Best Linear Unbiased Prediction (BLUP) 
In statistics, best linear unbiased prediction (BLUP) is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term ‘best linear unbiased predictor’ (or ‘prediction’) seems not to have been used until 1962. ‘Best linear unbiased predictions’ (BLUPs) of random effects are similar to best linear unbiased estimates (BLUEs) (➘ “GaussMarkov Theorem”) of fixed effects. The distinction arises because it is conventional to talk about estimating fixed effects but predicting random effects, but the two terms are otherwise equivalent. (This is a bit strange since the random effects have already been ‘realized’ − they already exist. The use of the term ‘prediction’ may be because in the field of animal breeding in which Henderson worked, the random effects were usually genetic merit, which could be used to predict the quality of offspring. However, the equations for the ‘fixed’ effects and for the random effects are different. In practice, it is often the case that the parameters associated with the random effect(s) term(s) are unknown; these parameters are the variances of the random effects and residuals. Typically the parameters are estimated and plugged into the predictor, leading to the Empirical Best Linear Unbiased Predictor (EBLUP). Notice that by simply plugging in the estimated parameter into the predictor, additional variability is unaccounted for, leading to overly optimistic prediction variances for the EBLUP. Best linear unbiased predictions are similar to empirical Bayes estimates of random effects in linear mixed models, except that in the latter case, where weights depend on unknown values of components of variance, these unknown variances are replaced by samplebased estimates. 
Best Subsets  Best Subsets Regression is a method used to help determine which predictor (independent) variables should be included in a multiple regression model. This method involves examining all of the models created from all possible combination of predictor variables. Best Subsets Regression uses R2 to check for the best model. It would not be fun or fast to compute this method without the use of a statistical software program. First, all models that have only one predictor variable included are checked and the two models with the highest R2 are selected. Then all models that have only two predictor variables included are checked and the two models with the highest R2 are chosen, again. This process continues until all combinations of all predictors variables have been taken into account. Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms 
Bestscored Random Forest Density Estimation  This paper presents a brand new nonparametric density estimation strategy named the bestscored random forest density estimation whose effectiveness is supported by both solid theoretical analysis and significant experimental performance. The terminology bestscored stands for selecting one density tree with the best estimation performance out of a certain number of purely random density tree candidates and we then name the selected one the bestscored random density tree. In this manner, the ensemble of these selected trees that is the bestscored random density forest can achieve even better estimation results than simply integrating trees without selection. From the theoretical perspective, by decomposing the error term into two, we are able to carry out the following analysis: First of all, we establish the consistency of the bestscored random density trees under $L_1$norm. Secondly, we provide the convergence rates of them under $L_1$norm concerning with three different tail assumptions, respectively. Thirdly, the convergence rates under $L_{\infty}$norm is presented. Last but not least, we also achieve the above convergence rates analysis for the bestscored random density forest. When conducting comparative experiments with other stateoftheart density estimation approaches on both synthetic and real data sets, it turns out that our algorithm has not only significant advantages in terms of estimation accuracy over other methods, but also stronger resistance to the curse of dimensionality. 
BestWorst Scaling  BestWorst Scaling (BWS) can be a method of data collection, and/or a theory of how respondents provide top and bottom ranked items from a list. BWS is increasingly used to obtain more choice data from individuals and/or to understand choice processes. The three casesof BWS are described, together with the intuition behind the models that are applied in each case. A summary of the main theoretical results is provided, including an exposition of the possible theoretical relationships between estimates from the di¤erent cases, and of the theoretical properties of best minus worst scores.BWS data can be analysed using relatively simple extensions to maximumlikelihood based methods used in discrete choice experiments. These are summarised, before the bene ts of simple functions of the best and 
Beta Autoregressive Fractionally Integrated Moving Average Model  In this work we introduce the class of beta autoregressive fractionally integrated moving average models for continuous random variables taking values in the continuous unit interval $(0,1)$. The proposed model accommodates a set of regressors and a longrange dependent time series structure. We derive the partial likelihood estimator for the parameters of the proposed model, obtain the associated score vector and Fisher information matrix. We also prove the consistency and asymptotic normality of the estimator under mild conditions. Hypotheses testing, diagnostic tools and forecasting are also proposed. A Monte Carlo simulation is considered to evaluate the finite sample performance of the partial likelihood estimators and to study some of the proposed tests. An empirical application is also presented and discussed. 
Beta Process Sticky Hidden Markov Model (BPSHMM) 
Spectrum sensing in a largescale heterogeneous network is very challenging as it usually requires a large number of static secondary users (SUs) to obtain the global spectrum states. To tackle this problem, in this paper, we propose a new framework based on Bayesian machine learning. We exploit the mobility of multiple SUs to simultaneously collect spectrum sensing data, and cooperatively derive the global spectrum states. We first develop a novel nonparametric Bayesian learning model, referred to as beta process sticky hidden Markov model (BPSHMM), to capture the spatialtemporal correlation in the collected spectrum data, where SHMM models the latent statistical correlation within each mobile SU’s time series data, while BP realizes the cooperation among multiple SUs. Bayesian inference is then carried out to automatically infer the heterogeneous spectrum states. Based on the inference results, we also develop a new algorithm with a refinement mechanism to predict the spectrum availability, which enables a newly joining SU to immediately access the unoccupied frequency band without sensing. Simulation results show that the proposed framework can significantly improve spectrum sensing performance compared with the existing spectrum sensing techniques. 
Beta Regression  How should we one perform a regression analysis in which the dependent variable is restricted to the standard unit interval such as rates and proportions? Ferrari and CribariNeto, 2004 proposed a regression model for continuous variates that assume values in the standard unit interval, e.g., rates, proportions, or concentrations indices. The model is based on the assumption that the response is betadistributed, they called their model the beta regression model. The regression parameters are interpretable in terms of the mean of y (the variable of interest) and the model is naturally heteroskedastic and easily accommodates asymmetries. A variant of the beta regression model that allows for nonlinearities and variable dispersion was proposed by Simas et al., 2010. zoib: An R package for Bayesian Inference for Beta Regression and Zero/One Inflated Beta Regression A Short Course in Beta Regression Models 
Beta Seasonal Autoregressive Moving Average (betaSARMA) 
In this paper we introduce the class of beta seasonal autoregressive moving average ($\beta$SARMA) models for modeling and forecasting time series data that assume values in the standard unit interval. It generalizes the class of beta autoregressive moving average models [Rocha and CribariNeto, Test, 2009] by incorporating seasonal dynamics to the model dynamic structure. Besides introducing the new class of models, we develop parameter estimation, hypothesis testing inference, and diagnostic analysis tools. We also discuss outofsample forecasting. In particular, we provide closedform expressions for the conditional score vector and for the conditional Fisher information matrix. We also evaluate the finite sample performances of conditional maximum likelihood estimators and white noise tests using Monte Carlo simulations. An empirical application is presented and discussed. 
beta^3IRT  Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $\beta^3$IRT model, which models continuous responses and can generate a much enriched family of Item Characteristic Curve (ICC). In experiments we applied the proposed model to data from an online exam platform, and show our model outperforms a more standard 2PLND model on all datasets. Furthermore, we show how to apply \BIRT{} to assess the ability of machine learning classifiers. This novel application results in a new metric for evaluating the quality of the classifier’s probability estimates, based on the inferred difficulty and discrimination of data instances. 
Bezier Simplex Model  Multiobjective optimization problems require simultaneously optimizing two or more objective functions. Many studies have reported that the solution set of an Mobjective optimization problem often forms an (M1)dimensional topological simplex (a curved line for M=2, a curved triangle for M=3, a curved tetrahedron for M=4, etc.). Since the dimensionality of the solution set increases as the number of objectives grows, an exponentially large sample size is needed to cover the solution set. To reduce the required sample size, this paper proposes a Bezier simplex model and its fitting algorithm. These techniques can exploit the simplex structure of the solution set and decompose a highdimensional surface fitting task into a sequence of lowdimensional ones. An approximation theorem of Bezier simplices is proven. Numerical experiments with synthetic and realworld optimization problems demonstrate that the proposed method achieves an accurate approximation of highdimensional solution sets with small samples. In practice, such an approximation will be conducted in the postoptimization process and enable a better tradeoff analysis. 
BFGAN  In many natural language generation tasks, incorporating additional knowledge like lexical constraints into the model’s output is significant, which take the form of phrases or words that must be present in the output sequence. Unfortunately, existing neural language model cannot be used directly to generate lexically constrained sentences. In this paper, we propose a new algorithmic framework called BFGAN to address this challenge. We employ a backward generator and a forward generator to generate lexically constrained sentence together, and use a discriminator to guide the joint training of two generators by assigning them reward signals. Experimental results on automatic and human evaluation demonstrate significant improvements over previous baselines. 
Bhattacharyya Distance  In statistics, the Bhattacharyya distance measures the similarity of two discrete or continuous probability distributions. It is closely related to the Bhattacharyya coefficient which is a measure of the amount of overlap between two statistical samples or populations. Both measures are named after Anil Kumar Bhattacharya, a statistician who worked in the 1930s at the Indian Statistical Institute. The coefficient can be used to determine the relative closeness of the two samples being considered. It is used to measure the separability of classes in classification and it is considered to be more reliable than the Mahalanobis distance, as the Mahalanobis distance is a particular case of the Bhattacharyya distance when the standard deviations of the two classes are the same. Therefore, when two classes have similar means but different standard deviations, the Mahalanobis distance would tend to zero, however, the Bhattacharyya distance would grow depending on the difference between the standard deviations. 
BiAdversarial AutoEncoder  Existing generative ZeroShot Learning (ZSL) methods only consider the unidirectional alignment from the class semantics to the visual features while ignoring the alignment from the visual features to the class semantics, which fails to construct the visualsemantic interactions well. In this paper, we propose to synthesize visual features based on an autoencoder framework paired with biadversarial networks respectively for visual and semantic modalities to reinforce the visualsemantic interactions with a bidirectional alignment, which ensures the synthesized visual features to fit the real visual distribution and to be highly related to the semantics. The encoder aims at synthesizing reallike visual features while the decoder forces both the real and the synthesized visual features to be more related to the class semantics. To further capture the discriminative information of the synthesized visual features, both the real and synthesized visual features are forced to be classified into the correct classes via a classification network. Experimental results on four benchmark datasets show that the proposed approach is particularly competitive on both the traditional ZSL and the generalized ZSL tasks. 
Bias  Statistical bias is a feature of a statistical technique or of its results whereby the expected value of the results differs from the true underlying quantitative parameter being estimated. 
Bias Corrected Minimum Distance Estimator (BCMDE) 
This work proposes a new minimum distance estimator (MDE) for the parameters of short and long memory models. This bias corrected minimum distance estimator (BCMDE) considers a correction in the usual MDE to account for the bias of the sample autocorrelation function when the mean is unknown. We prove the weak consistency of the BCMDE for the general fractional autoregressive moving average (ARFIMA(p, d, q)) model and derive its asymptotic distribution for some particular cases. Simulation studies show that the BCMDE presents a good performance compared to other procedures frequently used in the literature, such as the maximum likelihood estimator, the Whittle estimator and the MDE. The results also show that the BCMDE presents, in general, the smallest mean squared error and is less biased than the MDE when the mean is a nontrivial function of time. 
Bias of an Estimator  In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, ‘bias’ is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term ‘bias’. Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes medianunbiased from the usual meanunbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased (hence converge to the correct value as the number of data points grows arbitrarily large), though individual estimators in a consistent sequence may be biased (so long as the bias converges to zero); see bias versus consistency. 
Bias of an Estimator / Unbiasedness  In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, “bias” is an objective statement about a function, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term “bias”. 
Bias/Variance Tradeoff  In machine learning, the biasvariance dilemma or biasvariance tradeoff is the problem of simultaneously minimizing the bias (how accurate a model is across different training sets) and variance of the model error (how sensitive the model is to small changes in training set). This tradeoff applies to all forms of supervised learning: classification, function fitting, and structured output learning. It has also been invoked to explain the effectiveness of heuristics in human learning. 
BiasCompensated Normalized Maximum Correntropy Criterion (BCNMCC) 
This paper proposed a biascompensated normalized maximum correntropy criterion (BCNMCC) algorithm charactered by its low steadystate misalignment for system identification with noisy input in an impulsive output noise environment. The normalized maximum correntropy criterion (NMCC) is derived from a correntropy based cost function, which is rather robust with respect to impulsive noises. To deal with the noisy input, we introduce a biascompensated vector (BCV) to the NMCC algorithm, and then an unbiasedness criterion and some reasonable assumptions are used to compute the BCV. Taking advantage of the BCV, the bias caused by the input noise can be effectively suppressed. System identification simulation results demonstrate that the proposed BCNMCC algorithm can outperform other related algorithms with noisy input especially in an impulsive output noise environment. 
BiasedWalk  Network embedding algorithms are able to learn latent feature representations of nodes, transforming networks into lower dimensional vector representations. Typical key applications, which have effectively been addressed using network embeddings, include link prediction, multilabel classification and community detection. In this paper, we propose BiasedWalk, a scalable, unsupervised feature learning algorithm that is based on biased random walks to sample context information about each node in the network. Our randomwalk based sampling can behave as BreathFirstSearch (BFS) and DepthFirstSearch (DFS) samplings with the goal to capture homophily and role equivalence between the nodes in the network. We have performed a detailed experimental evaluation comparing the performance of the proposed algorithm against various baseline methods, on several datasets and learning tasks. The experiment results show that the proposed method outperforms the baseline ones in most of the tasks and datasets. 
BiasJacobian  Nonlinear functions such as neural networks can be locally approximated by affine planes. Recent works make use of inputJacobians, which describe the normal to these planes. In this paper, we introduce fullJacobians, which includes this normal along with an additional intercept term called the biasJacobians, that together completely describe local planes. For ReLU neural networks, biasJacobians correspond to sums of gradients of outputs w.r.t. intermediate layer activations. We first use these fullJacobians for distillation by aligning gradients of their intermediate representations. Next, we regularize biasJacobians alone to improve generalization. Finally, we show that fullJacobian maps can be viewed as saliency maps. Experimental results show improved distillation on small datasets, improved generalization for neural network training, and sharper saliency maps. 
BiBit  Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divideandconquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bitpattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. ➘ “Biclustering” BiBitR 
Biclustering  Biclustering, coclustering, or twomode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. 
Bidirectional Conditional Generative Adversarial Network  Conditional variants of Generative Adversarial Networks (GANs), known as cGANs, are generative models that can produce data samples ($x$) conditioned on both latent variables ($z$) and known auxiliary information ($c$). Another GAN variant, Bidirectional GAN (BiGAN) is a recently developed framework for learning the inverse mapping from $x$ to $z$ through an encoder trained simultaneously with the generator and the discriminator of an unconditional GAN. We propose the Bidirectional Conditional GAN (BCGAN), which combines cGANs and BiGANs into a single framework with an encoder that learns inverse mappings from $x$ to both $z$ and $c$, trained simultaneously with the conditional generator and discriminator in an endtoend setting. We present crucial techniques for training BCGANs, which incorporate an extrinsic factor loss along with an associated dynamicallytuned importance weight. As compared to other encoderbased GANs, BCGANs not only encode $c$ more accurately but also utilize $z$ and $c$ more effectively and in a more disentangled way to generate data samples. 
Bidirectional Deep Echo State Network  In this work we propose a deep architecture for the classification of multivariate time series. By means of a recurrent and untrained reservoir we generate a vectorial representation that embeds the temporal relationships in the data. To overcome the limitations of the reservoir vanishing memory, we introduce a bidirectional reservoir, whose last state captures also the past dependencies in the input. We apply dimensionality reduction to the final reservoir states to obtain compressed fixed size representations of the time series. These are subsequently fed into a deep feedforward network, which is trained to perform the final classification. We test our architecture on benchmark datasets and on a realworld usecase of blood samples classification. Results show that our method performs better than a standard echo state network, and it can be trained much faster than a fullytrained recurrent network. 
Bidirectional Encoder Representations from Transformers for sequential Recommendation (BERT4Rec) 
Modeling users’ dynamic and evolving preferences from their historical behaviors is challenging and crucial for recommendation systems. Previous methods employ sequential neural networks (e.g., Recurrent Neural Network) to encode users’ historical interactions from left to right into hidden representations for making recommendations. Although these methods achieve satisfactory results, they often assume a rigidly ordered sequence which is not always practical. We argue that such lefttoright unidirectional architectures restrict the power of the historical sequence representations. For this purpose, we introduce a Bidirectional Encoder Representations from Transformers for sequential Recommendation (BERT4Rec). However, jointly conditioning on both left and right context in deep bidirectional model would make the training become trivial since each item can indirectly “see the target item”. To address this problem, we train the bidirectional model using the Cloze task, predicting the masked items in the sequence by jointly conditioning on their left and right context. Comparing with predicting the next item at each position in a sequence, the Cloze task can produce more samples to train a more powerful bidirectional model. Extensive experiments on four benchmark datasets show that our model outperforms various stateoftheart sequential models consistently. 
Bidirectional Inference Network (BIN) 
We consider the problem of inferring the values of an arbitrary set of variables (e.g., risk of diseases) given other observed variables (e.g., symptoms and diagnosed diseases) and highdimensional signals (e.g., MRI images or EEG). This is a common problem in healthcare since variables of interest often differ for different patients. Existing methods including Bayesian networks and structured prediction either do not incorporate highdimensional signals or fail to model conditional dependencies among variables. To address these issues, we propose bidirectional inference networks (BIN), which stich together multiple probabilistic neural networks, each modeling a conditional dependency. Predictions are then made via iteratively updating variables using backpropagation (BP) to maximize corresponding posterior probability. Furthermore, we extend BIN to composite BIN (CBIN), which involves the iterative prediction process in the training stage and improves both accuracy and computational efficiency by adaptively smoothing the optimization landscape. Experiments on synthetic and realworld datasets (a sleep study and a dermatology dataset) show that CBIN is a single model that can achieve stateoftheart performance and obtain better accuracy in most inference tasks than multiple models each specifically trained for a different task. 
Bidirectional Learning (BL) 
Bidirectional Learning for Robust Neural Networks 
BiDirectional Long Short Term Memory Network (BLSTM) 
Most existing methods for biomedical entity recognition task rely on explicit feature engineering where many features either are specific to a particular task or depends on output of other existing NLP tools. Neural architectures have been shown across various domains that efforts for explicit feature design can be reduced. In this work we propose an unified framework using bidirectional long short term memory network (BLSTM) for named entity recognition (NER) tasks in biomedical and clinical domains. Three important characteristics of the framework are as follows – (1) model learns contextual as well as morphological features using two different BLSTM in hierarchy, (2) model uses first order linear conditional random field (CRF) in its output layer in cascade of BLSTM to infer label or tag sequence, (3) model does not use any domain specific features or dictionary, i.e., in another words, same set of features are used in the three NER tasks, namely, disease name recognition (Disease NER), drug name recognition (Drug NER) and clinical entity recognition (Clinical NER). We compare performance of the proposed model with existing stateoftheart models on the standard benchmark datasets of the three tasks. We show empirically that the proposed framework outperforms all existing models. Further our analysis of CRF layer and wordembedding obtained using character based embedding show their importance. 
Bidirectional LSTM  Recurrent neural networks like long shortterm memory (LSTM) are important architectures for sequential prediction tasks. LSTMs (and RNNs in general) model sequences along the forward time direction. Bidirectional LSTMs (BiLSTMs) on the other hand model sequences along both forward and backward directions and are generally known to perform better at such tasks because they capture a richer representation of the data. In the training of BiLSTMs, the forward and backward paths are learned independently. 
Bidirectional Recurrent Imputation for Time Series (BRITS) 
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation.BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a datadriven imputation procedure and applies to general settings with missing data.We evaluate our model on three realworld datasets, including an air quality dataset, a healthcare data, and a localization data for human activity. Experiments show that our model outperforms the stateoftheart methods in both imputation and classification/regression accuracies. 
BidirectionalInference Variational Autoencoder (BIVA) 
With the introduction of the variational autoencoder (VAE), probabilistic latent variable models have received renewed attention as powerful generative models. However, their performance in terms of test likelihood and quality of generated samples has been surpassed by autoregressive models without stochastic units. Furthermore, flowbased models have recently been shown to be an attractive alternative that scales well to highdimensional data. In this paper we close the performance gap by constructing VAE models that can effectively utilize a deep hierarchy of stochastic variables and model complex covariance structures. We introduce the BidirectionalInference Variational Autoencoder (BIVA), characterized by a skipconnected generative model and an inference network formed by a bidirectional stochastic inference path. We show that BIVA reaches stateoftheart test likelihoods, generates sharp and coherent natural images, and uses the hierarchy of latent variables to capture different aspects of the data distribution. We observe that BIVA, in contrast to recent results, can be used for anomaly detection. We attribute this to the hierarchy of latent variables which is able to extract highlevel semantic features. Finally, we extend BIVA to semisupervised classification tasks and show that it performs comparably to stateoftheart results by generative adversarial networks. 
Big Data  “Big Data” is the term for a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. 
Big Data Analytics (BDA) 
Big data analytics refers to the process of collecting, organizing and analyzing large sets of data (“big data”) to discover patterns and other useful information. Not only will big data analytics help you to understand the information contained within the data, but it will also help identify the data that is most important to the business and future business decisions. Big data analysts basically want the knowledge that comes from analyzing the data. 
Big Data Discovery  Big Data Discovery = {Big Data, Data Discovery, Data Science} 
Big Data Integration Ontology  Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semiautomatically adapts the ontology upon new releases. This guarantees ontologymediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on realworld APIs is performed to validate our approach. 
Big Data Management (BDM) 
Big Data Management (BDM), an amalgam of old and new best practices, skills, teams, data types, and homegrown or vendorbuilt functionality. All of these are expanding and realigning so that businesses can fully leverage big data, not merely manage it. At the same time, big data must eventually find a permanent place in enterprise data management. BDM is well worth doing because managing big data leads to a number of benefits. According to this report’s survey, the business and technology tasks that improve most are analytic insights, the completeness of analytic data sets, business value drawn from big data, and all sales and marketing activities. BDM also has challenges, and common barriers include low organizational maturity relative to big data, weak business support, and the need to learn new technology approaches. 
Big O Notation  In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, BachmannLandau notation (after Edmund Landau and Paul Bachmann), or asymptotic notation. In computer science, big O notation is used to classify algorithms by how they respond (e.g., in their processing time or working space requirements) to changes in input size. In analytic number theory, it is used to estimate the ‘error committed’ while replacing the asymptotic size, or asymptotic mean size, of an arithmetical function, by the value, or mean value, it takes at a large finite argument. A famous example is the problem of estimating the remainder term in the prime number theorem. Big O notation characterizes functions according to their growth rates: different functions with the same growth rate may be represented using the same O notation. The letter O is used because the growth rate of a function is also referred to as order of the function. A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function. Associated with big O notation are several related notations, using the symbols o, Ω, ω, and Θ, to describe other kinds of bounds on asymptotic growth rates. Big O notation is also used in many other fields to provide similar estimates. 
Big Workflow  Big Workflow is an industry term coined by Adaptive Computing that accelerates insights by more efficiently processing intense simulations and big data analysis. Adaptive Computing’s Big Workflow solution derives it name from its ability to solve big data challenges by streamlining the workflow to deliver valuable insights from massive quantities of data across multiple platforms, environments and locations. 
BigDataBench  As architecture, system, data management, and machine learning communities pay greater attention to innovative big data and datadriven artificial intelligence (in short, AI) algorithms, architecture, and systems, the pressure of benchmarking rises. However, complexity, diversity, frequently changed workloads, and rapid evolution of big data, especially AI systems raise great challenges in benchmarking. First, for the sake of conciseness, benchmarking scalability, portability cost, reproducibility, and better interpretation of performance data, we need understand what are the abstractions of frequentlyappearing units of computation, which we call dwarfs, among big data and AI workloads. Second, for the sake of fairness, the benchmarks must include diversity of data and workloads. Third, for codesign of software and hardware, the benchmarks should be consistent across different communities. Other than creating a new benchmark or proxy for every possible workload, we propose using dwarfbased benchmarks–the combination of eight dwarfs–to represent diversity of big data and AI workloads. The current version–BigDataBench 4.0 provides 13 representative realworld data sets and 47 big data and AI benchmarks, including seven workload types: online service, offline analytics, graph analytics, AI, data warehouse, NoSQL, and streaming. BigDataBench 4.0 is publicly available from http://…/BigDataBench. Also, for the first time, we comprehensively characterize the benchmarks of seven workload types in BigDataBench 4.0 in addition to traditional benchmarks like SPECCPU, PARSEC and HPCC in a hierarchical manner and drill down on five levels, using the TopDown analysis from an architecture perspective. 
BigDL  In this paper, we present BigDL, a distributed deep learning framework for Big Data platforms and workflows. It is implemented on top of Apache Spark, and allows users to write their deep learning applications as standard Spark programs (running directly on largescale big data clusters in a distributed fashion). It provides an expressive, ‘dataanalytics integrated’ deep learning programming model, so that users can easily build the endtoend analytics + AI pipelines under a unified programming paradigm; by implementing an AllReduce like operation using existing primitives in Spark (e.g., shuffle, broadcast, and inmemory data persistence), it also provides a highly efficient ‘parameter server’ style architecture, so as to achieve highly scalable, dataparallel distributed training. Since its initial open source release, BigDL users have built many analytics and deep learning applications (e.g., object detection, sequencetosequence generation, neural recommendations, fraud detection, etc.) on Spark. 
Biggy  ➘ “Datar” 
BigLittle Net  In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multiscale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multibranch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multiscale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses stateoftheart CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multiscale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains. 
BigQuery  Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling superfast, SQLlike queries against appendonly tables, using the processing power of Google’s infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data. You can access BigQuery by using a web UI or a commandline tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, PHP or Python. There are also a variety of thirdparty tools that you can use to interact with BigQuery, such as visualizing the data or loading the data. Get started now with creating an app, running a web query or using the commandline tool, or read on for more information about BigQuery fundamentals and how you can work with the product. BigQuery Big Data Visualization With D3.js Introduction to BigQuery ML 
Bilateral Adversarial Training (BAT) 
In this paper, we study fast training of adversarially robust models. From the analyses on the stateoftheart defense method, i.e., the multistep adversarial training~\cite{madry2017towards}, we hypothesize that the gradient magnitude links to the model robustness. Motivated by this, we propose to perturb both the image and the label during training, which we call Bilateral Adversarial Training (BAT). To generate the adversarial label, we derive an closedform heuristic solution. To generate the adversarial image, we use onestep targeted attack with the target label being the most confusing class. In the experiment, we first show that random start and the most confusing target attack effectively prevent the label leaking and gradient masking problem. Then coupled with the adversarial label part, our model significantly improves the stateoftheart results. For example, against PGD100 attack with crossentropy loss, on CIFAR10, we achieve 63.7\% versus 47.2\%; on SVHN, we achieve 59.1\% versus 42.1\%; on CIFAR100, we achieve 25.3\% versus 23.4\%. Note that these results are obtained by the fast onestep adversarial training. 
Bilateral Segmentation Network (BiSeNet) 
Semantic segmentation requires both rich spatial information and sizeable receptive field. However, modern approaches usually compromise spatial resolution to achieve realtime inference speed, which leads to poor performance. In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). We first design a Spatial Path with a small stride to preserve the spatial information and generate highresolution features. Meanwhile, a Context Path with a fast downsampling strategy is employed to obtain sufficient receptive field. On top of the two paths, we introduce a new Feature Fusion Module to combine features efficiently. The proposed architecture makes a right balance between the speed and segmentation performance on Cityscapes, CamVid, and COCOStuff datasets. Specifically, for a 2048×1024 input, we achieve 68.4% Mean IOU on the Cityscapes test dataset with speed of 105 FPS on one NVIDIA Titan XP card, which is significantly faster than the existing methods with comparable performance. 
BiLayer Hidden Markov Model (BiHMM) 
As one of the most popular services over online communities, the social recommendation has attracted increasing research efforts recently. Among all the recommendation tasks, an important one is social item recommendation over high speed social media streams. Existing streaming recommendation techniques are not effective for handling social users with diverse interests. Meanwhile, approaches for recommending items to a particular user are not efficient when applied to a huge number of users over high speed streams. In this paper, we propose a novel framework for the social recommendation over streaming environments. Specifically, we first propose a novel BiLayer Hidden Markov Model (BiHMM) that adaptively captures the behaviors of social users and their interactions with influential official accounts to predict their longterm and shortterm interests. Then, we design a new probabilistic entity matching scheme for effectively identifying the relevance score of a streaming item to a user. Following that, we propose a novel indexing scheme called {\Tree} for improving the efficiency of our solution. Extensive experiments are conducted to prove the high performance of our approach in terms of the recommendation quality and time cost. 
Bilinear Attention Network (BAN) 
Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, coattention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given visionlanguage information seamlessly. BAN considers bilinear interactions among two groups of input channels, while lowrank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eightattention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new stateofthearts on both datasets. 
BilingualGAN  Latent space based GAN methods and attention based sequence to sequence models have achieved impressive results in text generation and unsupervised machine translation respectively. Leveraging the two domains, we propose an adversarial latent space based model capable of generating parallel sentences in two languages concurrently and translating bidirectionally. The bilingual generation goal is achieved by sampling from the latent space that is shared between both languages. First two denoising autoencoders are trained, with shared encoders and backtranslation to enforce a shared latent state between the two languages. The decoder is shared for the two translation directions. Next, a GAN is trained to generate synthetic ‘code’ mimicking the languages’ shared latent space. This code is then fed into the decoder to generate text in either language. We perform our experiments on Europarl and Multi30k datasets, on the EnglishFrench language pair, and document our performance using both supervised and unsupervised machine translation. 
Binacox  Determining significant prognostic biomarkers is of increasing importance in many areas of medicine. In order to translate a continuous biomarker into a clinical decision, it is often necessary to determine cutpoints. There is so far no standard method to help evaluate how many cutpoints are optimal for a given feature in a survival analysis setting. Moreover, most existing methods are univariate, hence not well suited for highdimensional frameworks. This paper introduces a prognostic method called Binacox to deal with the problem of detecting multiple cutpoints per features in a multivariate setting where a large number of continuous features are available. It is based on the Cox model and combines onehot encodings with the binarsity penalty. This penalty uses totalvariation regularization together with an extra linear constraint to avoid collinearity between the onehot encodings and enable feature selection. A nonasymptotic oracle inequality is established. The statistical performance of the method is then examined on an extensive Monte Carlo simulation study, and finally illustrated on three publicly available genetic cancer datasets with highdimensional features. On this datasets, our proposed methodology significantly outperforms the stateoftheart survival models regarding risk prediction in terms of Cindex, with a computing time orders of magnitude faster. In addition, it provides powerful interpretability by automatically pinpointing significant cutpoints on relevant features from a clinical point of view. 
Binarized Back Propagation  
Binarized Deep Neural Network (BDNN) 
In this work we introduce a binarized deep neural network (BDNN) model. BDNNs are trained using a novel binarized back propagation algorithm (BBP), which uses binary weights and binary neurons during the forward and backward propagation, while retaining precision of the stored weights in which gradients are accumulated. At test phase, BDNNs are fully binarized and can be implemented in hardware with low circuit complexity. The proposed binarized networks can be implemented using binary convolutions and proxy matrix multiplications with only standard binary XNOR and population count (popcount) operations. BBP is expected to reduce energy consumption by at least two orders of magnitude when compared to the hardware implementation of existing training algorithms. We obtained near stateoftheart results with BDNNs on the permutationinvariant MNIST, CIFAR10 and SVHN datasets. 
Binary Direct Feedback Alignment (BDFA) 
There were many algorithms to substitute the backpropagation (BP) in the deep neural network (DNN) training. However, they could not become popular because their training accuracy and the computational efficiency were worse than BP. One of them was direct feedback alignment (DFA), but it showed low training performance especially for the convolutional neural network (CNN). In this paper, we overcome the limitation of the DFA algorithm by combining with the conventional BP during the CNN training. To improve the training stability, we also suggest the feedback weight initialization method by analyzing the patterns of the fixed random matrices in the DFA. Finally, we propose the new training algorithm, binary direct feedback alignment (BDFA) to minimize the computational cost while maintaining the training accuracy compared with the DFA. In our experiments, we use the CIFAR10 and CIFAR100 dataset to simulate the CNN learning from the scratch and apply the BDFA to the online learning based object tracking application to examine the training in the small dataset environment. Our proposed algorithms show better performance than conventional BP in both two different training tasks especially when the dataset is small. 
Binary Ensemble Neural Network (BENN) 
Binary neural networks (BNN) have been studied extensively since they run dramatically faster at lower memory and power consumption than floatingpoint networks, thanks to the efficiency of bit operations. However, contemporary BNNs whose weights and activations are both single bits suffer from severe accuracy degradation. To understand why, we investigate the representation ability, speed and bias/variance of BNNs through extensive experiments. We conclude that the error of BNNs is predominantly caused by the intrinsic instability (training time) and nonrobustness (train \& test time). Inspired by this investigation, we propose the Binary Ensemble Neural Network (BENN) which leverages ensemble methods to improve the performance of BNNs with limited efficiency cost. While ensemble techniques have been broadly believed to be only marginally helpful for strong classifiers such as deep neural networks, our analyses and experiments show that they are naturally a perfect fit to boost BNNs. We find that our BENN, which is faster and much more robust than stateoftheart binary networks, can even surpass the accuracy of the fullprecision floating number network with the same architecture. 
Binary Image SelectiON (BISON) 
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as image captioning and retrieval were designed to test this ability, but come with complex evaluation measures that gauge various other abilities and biases simultaneously. This paper presents an alternative evaluation task for visualgrounding systems: given a caption the system is asked to select the image that best matches the caption from a pair of semantically similar images. The system’s accuracy on this Binary Image SelectiON (BISON) task is not only interpretable, but also measures the ability to relate finegrained text content in the caption to visual content in the images. We gathered a BISON dataset that complements the COCO Captions dataset and used this dataset in auxiliary evaluations of captioning and captionbased retrieval systems. While captioning measures suggest visual grounding systems outperform humans, BISON shows that these systems are still far away from human performance. 
Binary Matching Pursuit  We study the problem of learning latent feature models (LFMs) for tensor data commonly observed in science and engineering such as hyperspectral imagery. However, the problem is challenging not only due to the nonconvex formulation, the combinatorial nature of the constraints in LFMs, but also the highorder correlations in the data. In this work, we formulate a tensor latent feature learning problem by representing the data as a mixture of highorder latent features and binary codes, which are memory efficient and easy to interpret. To make the learning tractable, we propose a novel optimization procedure, Binary matching pursuit (BMP), that iteratively searches for binary bases via a MAXCUTlike boolean quadratic solver. Such a procedure is guaranteed to achieve an? suboptimal solution in O($1/\epsilon$) greedy steps, resulting in a tradeoff between accuracy and sparsity. When evaluated on both synthetic and real datasets, our experiments show superior performance over baseline methods. 
Binary Network Embedding (BinaryNE) 
Traditional network embedding primarily focuses on learning a dense vector representation for each node, which encodes network structure and/or node content information, such that offtheshelf machine learning algorithms can be easily applied to the vectorformat node representations for network analysis. However, the learned dense vector representations are inefficient for largescale similarity search, which requires to find the nearest neighbor measured by Euclidean distance in a continuous vector space. In this paper, we propose a search efficient binary network embedding algorithm called BinaryNE to learn a sparse binary code for each node, by simultaneously modeling node context relations and node attribute relations through a threelayer neural network. BinaryNE learns binary node representations efficiently through a stochastic gradient descent based online learning algorithm. The learned binary encoding not only reduces memory usage to represent each node, but also allows fast bitwise comparisons to support much quicker network node search compared to Euclidean distance or other distance measures. Our experiments and comparisons show that BinaryNE not only delivers more than 23 times faster search speed, but also provides comparable or better search quality than traditional continuous vector based network embedding methods. 
Binary Paragraph Vector  In this dissertation we report results of our research on dense distributed representations of text data. We propose two novel neural models for learning such representations. The first model learns representations at the document level, while the second model learns wordlevel representations. For documentlevel representations we propose Binary Paragraph Vector: a neural network models for learning binary representations of text documents, which can be used for fast document retrieval. We provide a thorough evaluation of these models and demonstrate that they outperform the seminal method in the field in the information retrieval task. We also report strong results in transfer learning settings, where our models are trained on a generic text corpus and then used to infer codes for documents from a domainspecific dataset. In contrast to previously proposed approaches, Binary Paragraph Vector models learn embeddings directly from raw text data. For wordlevel representations we propose Disambiguated Skipgram: a neural network model for learning multisense word embeddings. Representations learned by this model can be used in downstream tasks, like partofspeech tagging or identification of semantic relations. In the word sense induction task Disambiguated Skipgram outperforms stateoftheart models on three out of four benchmarks datasets. Our model has an elegant probabilistic interpretation. Furthermore, unlike previous models of this kind, it is differentiable with respect to all its parameters and can be trained with backpropagation. In addition to quantitative results, we present qualitative evaluation of Disambiguated Skipgram, including twodimensional visualisations of selected wordsense embeddings. 
Binary Stochastic Filtering (BSF) 
Binary Stochastic Filtering (BSF), the algorithm for feature selection and neuron pruning is proposed in this work. Filtering layer stochastically passes or filters out features based on individual weights, which are tuned during neural network training process. By placing BSF after the neural network input, the filtering of input features is performed, i.e. feature selection. More then 5fold dimensionality decrease was achieved in the experiments. Placing BSF layer in between hidden layers allows filtering of neuron outputs and could be used for neuron pruning. Up to 34fold decrease in the number of weights in the network was reached, which corresponds to the significant increase of performance, that is especially important for mobile and embedded applications. 
Binary Unique Number of Word (BUNOW) 
Text classification plays a vital role today especially with the intensive use of social networking media. Recently, different architectures of convolutional neural networks have been used for text classification in which onehot vector, and word embedding methods are commonly used. This paper presents a new language independent word encoding method for text classification. The proposed model converts raw text data to lowlevel feature dimension with minimal or no preprocessing steps by using a new approach called binary unique number of word ‘BUNOW’. BUNOW allows each unique word to have an integer ID in a dictionary that is represented as a kdimensional vector of its binary equivalent. The output vector of this encoding is fed into a convolutional neural network (CNN) model for classification. Moreover, the proposed model reduces the neural network parameters, allows faster computation with few network layers, where a word is atomic representation the document as in word level, and decrease memory consumption for character level representation. The provided CNN model is able to work with other languages or multilingual text without the need for any changes in the encoding method. The model outperforms the character level and very deep character level CNNs models in terms of accuracy, network parameters, and memory consumption; the results show total classification accuracy 91.99% and error 8.01% using AG’s News dataset compared to the state of art methods that have total classification accuracy 91.45% and error 8.55%, in addition to the reduction in input feature vector and neural network parameters by 62% and 34%, respectively. 
Binary Weight and Hadamardtransformed Image Network (BWHIN) 
Deep learning has made significant improvements at many image processing tasks in recent years, such as image classification, object recognition and object detection. Convolutional neural networks (CNN), which is a popular deep learning architecture designed to process data in multiple array form, show great success to almost all detection \& recognition problems and computer vision tasks. However, the number of parameters in a CNN is too high such that the computers require more energy and larger memory size. In order to solve this problem, we propose a novel energy efficient model Binary Weight and Hadamardtransformed Image Network (BWHIN), which is a combination of Binary Weight Network (BWN) and Hadamardtransformed Image Network (HIN). It is observed that energy efficiency is achieved with a slight sacrifice at classification accuracy. Among all energy efficient networks, our novel ensemble model outperforms other energy efficient models. 
BinaryForking Model  Optimal Parallel Algorithms in the BinaryForking Model 
BinaryNet  We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters’ gradient. We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR10 and SVHN with BinaryNet and achieve nearly stateoftheart results. At runtime, BinaryNet drastically reduces memory usage and replaces most multiplications by 1bit exclusivenotor (XNOR) operations, which might have a big impact on both generalpurpose and dedicated Deep Learning hardware. We wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST MLP 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for BinaryNet is available. 
Binci  Binci is a utility that allows you to easily containerize your development workflow using Docker. Simply put, it’s like having a cleanroom for all of your development processes which contain services (like databases) without needing to setup and maintain these environments manually. 
BinderHub  BinderHub allows you to BUILD and REGISTER a Docker image using a GitHub repository, then CONNECT with JupyterHub, allowing you to create a public IP address that allows users to interact with the code and environment within a live JupyterHub instance. You can select a specific branch name, commit, or tag to serve. BinderHub ties together: • JupyterHub to provide a scalable system for authenticating users and spawning single user Jupyter Notebook servers, and • Repo2Docker which generates a Docker image using a Git repository hosted online. BinderHub is created using Python, kubernetes, tornado, and traitlets. As such, it should be a familiar technical foundation for Jupyter developers. Why BinderHub Collections of Jupyter notebooks are becoming more common in scientific research and data science. The ability to serve these collections on demand enhances the usefulness of these notebooks. 
BindsNET  The development of spiking neural network simulation software is a critical component enabling the modeling of neural systems and the development of biologically inspired algorithms. Existing software frameworks support a wide range of neural functionality, software abstraction levels, and hardware devices, yet are typically not suitable for rapid prototyping or application to problems in the domain of machine learning. In this paper, we describe a new Python package for the simulation of spiking neural networks, specifically geared towards machine learning and reinforcement learning. Our software, called BindsNET, enables rapid building and simulation of spiking networks and features userfriendly, concise syntax. BindsNET is built on top of the PyTorch deep neural networks library, enabling fast CPU and GPU computation for large spiking networks. The BindsNET framework can be adjusted to meet the needs of other existing computing and hardware environments, e.g., TensorFlow. We also provide an interface into the OpenAI gym library, allowing for training and evaluation of spiking networks on reinforcement learning problems. We argue that this package facilitates the use of spiking networks for largescale machine learning experimentation, and show some simple examples of how we envision BindsNET can be used in practice. BindsNET code is available at https://…/bindsnet 
BINet  In this paper, we introduce BINet, a neural network architecture for realtime multiperspective anomaly detection in business process event logs. BINet is designed to handle both the control flow and the data perspective of a business process. Additionally, we propose a set of heuristics for setting the threshold of an anomaly detection algorithm automatically. We demonstrate that BINet can be used to detect anomalies in event logs not only on a case level but also on event attribute level. Finally, we demonstrate that a simple set of rules can be used to utilize the output of BINet for anomaly classification. We compare BINet to eight other stateoftheart anomaly detection algorithms and evaluate their performance on an elaborate data corpus of 29 synthetic and 15 reallife event logs. BINet outperforms all other methods both on the synthetic as well as on the reallife datasets. 
BinGAN  In this paper, we propose a novel regularization method for Generative Adversarial Networks, which allows the model to learn discriminative yet compact binary representations of image patches (image descriptors). We employ the dimensionality reduction that takes place in the intermediate layers of the discriminator network and train binarized lowdimensional representation of the penultimate layer to mimic the distribution of the higherdimensional preceding layers. To achieve this, we introduce two loss terms that aim at: (i) reducing the correlation between the dimensions of the binarized lowdimensional representation of the penultimate layer i. e. maximizing joint entropy) and (ii) propagating the relations between the dimensions in the highdimensional space to the lowdimensional space. We evaluate the resulting binary image descriptors on two challenging applications, image matching and retrieval, and achieve stateoftheart results. 
Binning  Data binning is a data preprocessing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization. Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard. http://…optimalbinningforscoringmodeling.html 
Binocular Speculation  MapReduce speculation plays an important role in finding potential task stragglers and failures. But a tacit dichotomy exists in MapReduce due to its inherent twophase (map and reduce) management scheme in which map tasks and reduce tasks have distinctly different execution behaviors, yet reduce tasks are dependent on the results of map tasks. We reveal that speculation policies for fault handling in MapReduce do not recognize this dichotomy between map and reduce tasks, which leads to an issue of speculation myopia for MapReduce fault recovery. These issues cause significant performance degradation upon network and node failures. To address the speculation myopia caused by MapReduce dichotomy, we introduce a new scheme called binocular speculation to help MapReduce increase its assessment scope for speculation. As part of the scheme, we also design three component techniques including neighborhood glance, collective speculation and speculative rollback. Our evaluation shows that, with these techniques, binocular speculation can increase the coordination of map and reduce phases, and enhance the efficiency of MapReduce fault recovery. 
Binomial Options Pricing Model (BOPM) 
In finance, the binomial options pricing model (BOPM) provides a generalizable numerical method for the valuation of options. The binomial model was first proposed by Cox, Ross and Rubinstein in 1979. Essentially, the model uses a ‘discretetime’ (lattice based) model of the varying price over time of the underlying financial instrument. In general, binomial options pricing models do not have closedform solutions. (Binomial Tree Option Model) 
Bio7  The application Bio7 is an integrated development environment for ecological modelling and contains powerful tools for model creation, scientific image analysis and statistical analysis. The application itself is based on an RCPEclipseEnvironment (RichClientPlatform) which offers a huge flexibility in configuration and extensibility because of its plugin structure and the possibility of customization. Features: · Creation and analysis of simulation models. · Statistical analysis. · Advanced R Graphical User Interface with editor, spreadsheet, ImageJ plot device and debugging interface. · Spatial statistics (possibility to send values from a specialized panel to R). · Image Analysis (embedded ImageJ). · Fast transfer of image data from ImageJ to R and vice versa. · Fast communication between R and Java (with RServe) and the possibilty to use R methods inside Java. · Interpretation of Java and script creation (BeanShell, Groovy, Jython). · Dynamic compilation of Java. · Creation of methods for Java, BeanShell, Groovy, Jython and R (integrated editors for Java, R, BeanShell, Groovy, Jython). · Sensitivity analysis with an embedded flowchart editor in which scripts, macros and compiled code can be dragged and executed. · Creation of 3d OpenGL (Jogl) models. · Visualizations and simulations on an embedded 3d globe (World Wind Java SDK). · Creation of Graphical User Interfaces with the embedded JavaFX SceneBuilder. 
BioWorkbench  Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of largescale bioinformatics experiments. Because these experiments are computation and dataintensive, they require highperformance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a prebuilt feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves highperformance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process. 
Bipartite Graph  In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint sets U and V (that is, U and V are each independent sets) such that every edge connects a vertex in U to one in V. Equivalently, a bipartite graph is a graph that does not contain any oddlength cycles. 
Biplot  Biplots are a type of exploratory graph used in statistics, a generalization of the simple twovariable scatterplot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both continuous and categorical variables. 
Biregular Irreducible Functions (BRI) 
It is investigated how to achieve semantic security for the wiretap channel. It is shown that asymptotically, every rate achievable with strong secrecy is also achievable with semantic security if the strong secrecy information leakage decreases sufficiently fast. If the decrease is slow, this continues to hold with a weaker formulation of semantic security. A new type of functions called biregular irreducible (BRI) functions, similar to universal hash functions, is introduced. BRI functions provide a universal method of establishing secrecy. It is proved that the known secrecy rates of any discrete and Gaussian wiretap channel are achievable with semantic security by modular wiretap codes constructed from a BRI function and an errorcorrecting code. A concrete universal hash function given by finitefield arithmetic can be converted into a BRI function for certain parameters. A characterization of BRI functions in terms of edgedisjoint biregular graphs on a common vertex set is derived. New BRI functions are constructed from families of Ramanujan graphs. It is shown that BRI functions used in modular schemes which achieve the semantic security capacity of discrete or Gaussian wiretap channels should be nearly Ramanujan. Moreover, BRI functions are universal hash functions on average. 
BiSeg  We present a simple and effective framework for simultaneous semantic segmentation and instance segmentation with Fully Convolutional Networks (FCNs). The method, called BiSeg, predicts instance segmentation as a posterior in Bayesian inference, where semantic segmentation is used as a prior. We extend the idea of positionsensitive score maps used in recent methods to a fusion of multiple score maps at different scales and partition modes, and adopt it as a robust likelihood for instance segmentation inference. As both Bayesian inference and map fusion are performed per pixel, BiSeg is a fully convolutional endtoend solution that inherits all the advantages of FCNs. We demonstrate stateoftheart instance segmentation accuracy on PASCAL VOC. 
BiStream Emotion AttributionClassification Network (BEACNet) 
Emotional content is a crucial ingredient in usergenerated videos. However, the sparsely expressed emotions in the usergenerated video cause difficulties to emotions analysis in videos. In this paper, we propose a new neural approach—Bistream Emotion AttributionClassification Network (BEACNet) to solve three related emotion analysis tasks: emotion recognition, emotion attribution and emotionoriented summarization, in an integrated framework. BEACNet has two major constituents, an attribution network and a classification network. The attribution network extracts the main emotional segment that classification should focus on in order to mitigate the sparsity problem. The classification network utilizes both the extracted segment and the original video in a bistream architecture. We contribute a new dataset for the emotion attribution task with humanannotated groundtruth labels for emotion segments. Experiments on two video datasets demonstrate superior performance of the proposed framework and the complementary nature of the dual classification streams. 
Bit Stream Computing  In this study, we propose a novel computing paradigm ‘Bit Stream Computing’ that is constructed on the logic used in stochastic computing, but does not necessarily employ randomly or Binomially distributed bit streams as stochastic computing does. Any type of streams can be used either stochastic or deterministic. The proposed paradigm benefits from the area advantage of stochastic logic and the accuracy advantage of conventional binary logic. We implement accurate arithmetic multiplier and adder circuits, classified as asynchronous or synchronous; we also consider their suitability of processing successive streams. The proposed circuits are simulated both in gate level and in transistor level with AMS 0.35um CMOS technology to show the circuits’ potential for practical use. We thoroughly compare the proposed adders and multipliers with their predecessors in the literature, individually and in a neural network application. Comparisons made in terms of area and accuracy clearly favor the proposed designs. We believe that this study opens up new horizons for computing that enables us to implement much smaller yet accurate arithmetic circuits compared to the conventional binary and stochastic ones. 
Bitcoin  Bitcoin is a payment system invented by Satoshi Nakamoto in 2008 and introduced as opensource software in 2009. The system is peertopeer; all nodes verify transactions in a public distributed ledger called the block chain. The ledger uses its own unit of account, also called bitcoin. The system works without a central repository or single administrator, which has led the US Treasury to categorize it as a decentralized virtual currency. While bitcoin is not the first virtual currency, it is the first decentralized digital currency and cryptocurrency. It is the largest of its kind in terms of total market value. 
BitFlip Attack (BFA) 
Several important security issues of Deep Neural Network (DNN) have been raised recently associated with different applications and components. The most widely investigated security concern of DNN is from its malicious input, a.k.a adversarial example. Nevertheless, the security challenge of DNN’s parameters is not well explored yet. In this work, we are the first to propose a novel DNN weight attack methodology called BitFlip Attack (BFA) which can crush a neural network through maliciously flipping extremely small amount of bits within its weight storage memory system (i.e., DRAM). The bitflip operations could be conducted through wellknown RowHammer attack, while our main contribution is to develop an algorithm to identify the most vulnerable bits of DNN weight parameters (stored in memory as binary bits), that could maximize the accuracy degradation with a minimum number of bitflips. Our proposed BFA utilizes a Progressive Bit Search (PBS) method which combines gradient ranking and progressive search to identify the most vulnerable bit to be flipped. With the aid of PBS, we can successfully attack a ResNet18 fully malfunction (i.e., top1 accuracy degrade from 69.8% to 0.1%) only through 13 bitflips out of 93 million bits, while randomly flipping 100 bits merely degrades the accuracy by less than 1%. 
BitRegularized Deep Neural Network (BitNet) 
We present a novel regularization scheme for training deep neural networks. The parameters of neural networks are usually unconstrained and have a dynamic range dispersed over the real line. Our key idea is to control the expressive power of the network by dynamically quantizing the range and set of values that the parameters can take. We formulate this idea using a novel endtoend approach that regularizes the traditional classification loss function. Our regularizer is inspired by the Minimum Description Length principle. For each layer of the network, our approach optimizes a translation and scaling factor along with integervalued parameters. We empirically compare BitNet to an equivalent unregularized model on the MNIST and CIFAR10 datasets. We show that BitNet converges faster to a superior quality solution. Additionally, the resulting model is significantly smaller in size due to the use of integer parameters instead of floats. 
BitSplitNet  Significant computational cost and memory requirements for deep neural networks (DNNs) make it difficult to utilize DNNs in resourceconstrained environments. Binary neural network (BNN), which uses binary weights and binary activations, has been gaining interests for its hardwarefriendly characteristics and minimal resource requirement. However, BNN usually suffers from accuracy degradation. In this paper, we introduce ‘BitSplitNet’, a neural network which maintains the hardwarefriendly characteristics of BNN while improving accuracy by using multibit precision. In BitSplitNet, each bit of multibit activations propagates independently throughout the network before being merged at the end of the network. Thus, each bit path of the BitSplitNet resembles BNN and hardware friendly features of BNN, such as bitwise binary activation function, are preserved in our scheme. We demonstrate that the BitSplit version of LeNet5, VGG9, AlexNet, and ResNet18 can be trained to have similar classification accuracy at a lower computational cost compared to conventional multibit networks with low bit precision (<= 4bit). We further evaluate BitSplitNet on GPU with custom CUDA kernel, showing that BitSplitNet can achieve better hardware performance in comparison to conventional multibit networks. 
Bivariate Pareto Model  Bivariate.Pareto 
blabr  Scientific computing for the web. Create your own interactive computation directly in the browser. Share on the web. 
Black Hole Metric  In network science, there is often the need to sort the graph nodes. While the sorting strategy may be different, in general sorting is performed by exploiting the network structure. In particular, the metric PageRank has been used in the past decade in different ways to produce a ranking based on how many neighbors point to a specific node. PageRank is simple, easy to compute and effective in many applications, however it comes with a price: as PageRank is an application of the random walker, the arc weights need to be normalized. This normalization, while necessary, introduces a series of unwanted sideeffects. In this paper, we propose a generalization of PageRank named Black Hole Metric which mitigates the problem. We devise a scenario in which the sideeffects are particularily impactful on the ranking, test the new metric in both real and synthetic networks, and show the results. 
BlackBox Optimization Benchmarking (BBOB) 

BlackLitterman Model  In finance, the BlackLitterman model is a mathematical model for portfolio allocation developed in 1990 at Goldman Sachs by Fischer Black and Robert Litterman, and published in 1992. It seeks to overcome problems that institutional investors have encountered in applying modern portfolio theory in practice. The model starts with the equilibrium assumption that the asset allocation of a representative agent should be proportional to the market values of the available assets, and then modifies that to take into account the ‘views’ (i.e., the specific opinions about asset returns) of the investor in question to arrive at a bespoke asset allocation. 
BlackOut  We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut; we outperform the stateofthe art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 110 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion of words. 
Blackwell Regret  Ndiscount optimality was introduced as a hierarchical form of policy and valuefunction optimality, with Blackwell optimality lying at the top level of the hierarchy Veinott (1969); Blackwell (1962). We formalize notions of myopic discount factors, value functions and policies in terms of Blackwell optimality in MDPs, and we provide a novel concept of regret, called Blackwell regret, which measures the regret compared to a Blackwell optimal policy. Our main analysis focuses on long horizon MDPs with sparse rewards. We show that selecting the discount factor under which zero Blackwell regret can be achieved becomes arbitrarily hard. Moreover, even with oracle knowledge of such a discount factor that can realize a Blackwell regretfree value function, an $\epsilon$Blackwell optimal value function may not even be gain optimal. Difficulties associated with this class of problems is discussed, and the notion of a policy gap is defined as the difference in expected return between a given policy and any other policy that differs at that state; we prove certain properties related to this gap. Finally, we provide experimental results that further support our theoretical results. 
BlandAltman Plot  A BlandAltman plot (Difference plot) in analytical chemistry and biostatistics is a method of data plotting used in analyzing the agreement between two different assays. It is identical to a Tukey meandifference plot, the name by which it is known in other fields, but was popularised in medical statistics by J. Martin Bland and Douglas G. Altman. BlandAltmanLeh,agRee 
Blaze  MapReduce and its variants have significantly simplified and accelerated the process of developing parallel programs. However, most MapReduce implementations focus on dataintensive tasks while many realworld tasks are compute intensive and their data can fit distributedly into the memory. For these tasks, the speed of MapReduce programs can be much slower than those handoptimized ones. We present Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks. At the core of Blaze is a highlyoptimized inmemory MapReduce function, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. We also offer additional conveniences that make developing parallel programs similar to developing serial programs. These improvements make Blaze an easytouse cluster computing library that approaches the speed of handoptimized parallel code. We apply Blaze to some common data mining tasks, including word frequency count, PageRank, kmeans, expectation maximization (Gaussian mixture model), and knearest neighbors. Blaze outperforms Apache Spark by more than 10 times on average for these tasks, and the speed of Blaze scales almost linearly with the number of nodes. In addition, Blaze uses only the MapReduce function and 3 utility functions in its implementation while Spark uses almost 30 different parallel primitives in its official implementation. 
Blazer  Explore your data with SQL. Easily create charts and dashboards, and share them with your team. 
Bleach  In this paper we address the problem of rulebased stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams. We design a system, called Bleach, which achieves realtime violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a ‘cumulative’ sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using a TPCDS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the microbatch streaming paradigm. 
BlinderOaxaca Decomposition  The BlinderOaxaca decomposition technique, or simply the Oaxaca decomposition, decomposes wage differentials into two components: a portion that arises because two comparison groups, on average, have different qualifications or credentials (e.g., years of schooling and experience in the labor market) when both groups receive the same treatment (explained component), and a portion that arises because one group is more favorably treated than the other given the same individual characteristics (unexplained component). The two portions are also called characteristics and coefficients effect using the terminology of regression analysis, which provides the basis of this decomposition technique. The coefficients effect is frequently interpreted as a measure of labor market discrimination. For a comprehensive review of issues related to labor market discrimination, see Joseph Altonji and Rebecca Blank (1999). oaxaca 
Bling Fire  We are a team at Microsoft called Bling (Beyond Language Understanding), we help Bing be smarter. Here we wanted to share with all of you our FInite State machine and REgular expression manipulation library (FIRE). We use Fire for many linguistic operations inside Bing such as Tokenization, Multiword expression matching, Unknown wordguessing, Stemming / Lemmatization just to mention a few. 
BlinkML  The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an adhoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decisionmaking process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, qualityguaranteed ML training. BlinkML allows users to make errorcomputation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of largescale ML tasks by 6.26x629x while guaranteeing the same predictions, with 95% probability, as the full model. 
Blip  Edge environments offer a number of advantages for software developers including the ability to create services which can offer lower latency, better privacy, and reduced operational costs than traditional cloud hosted services. However large technical challenges exist, which prevent developers from utilising the Edge; complexities related to the heterogeneous nature of the Edge environment, issues with orchestration and application management and lastly, the inherent issues in creating decentralised distributed applications which operate at a large geographic scale. In this conceptual and architectural paper we envision a solution, Blip, which offers an easy to use programming and operational environment which addresses the these issues. It aims to remove the technical barriers which will inhibit the wider adoption Edge application development. This paper validates the Blip concept by demonstrating how it will deliver on the advantages of the Edge for a familiar scenario. 
BlitzWS  By reducing optimization to a sequence of smaller subproblems, working set algorithms achieve fast convergence times for many machine learning problems. Despite such performance, working set implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose BlitzWS, a working set algorithm with useful theoretical guarantees. Our theory relates subproblem size and stopping criteria to the amount of progress during each iteration. This result motivates strategies for optimizing algorithmic parameters and discarding irrelevant components as BlitzWS progresses toward a solution. BlitzWS applies to many convex problems, including training L1regularized models and support vector machines. We showcase this versatility with empirical comparisons, which demonstrate BlitzWS is indeed a fast algorithm. 
Block Chain  A block chain is a transaction database shared by all nodes participating in a system based on the Bitcoin protocol. A full copy of a currency’s block chain contains every transaction ever executed in the currency. With this information, one can find out how much value belonged to each address at any point in history. Every block contains a hash of the previous block. This has the effect of creating a chain of blocks from the genesis block to the current block. Each block is guaranteed to come after the previous block chronologically because the previous block’s hash would otherwise not be known. Each block is also computationally impractical to modify once it has been in the chain for a while because every block after it would also have to be regenerated. These properties are what make doublespending of bitcoins very difficult. The block chain is the main innovation of Bitcoin. The block chain is a public ledger that records bitcoin transactions. A novel solution accomplishes this without any trusted central authority: maintenance of the block chain is performed by a network of communicating nodes running bitcoin software. Transactions of the form payer X sends Y bitcoins to payee Z are broadcast to this network using readily available software applications. Network nodes can validate transactions, add them to their copy of the ledger, and then broadcast these ledger additions to other nodes. The block chain is a distributed database; in order to independently verify the chain of ownership of any and every bitcoin (amount), each network node stores its own copy of the block chain. Approximately six times per hour, a new group of accepted transactions, a block, is created, added to the block chain, and quickly published to all nodes. This allows bitcoin software to determine when a particular bitcoin amount has been spent, which is necessary in order to prevent doublespending in an environment without central oversight. Whereas a conventional ledger records the transfers of actual bills or promissory notes that exist apart from it, the block chain is the only place that bitcoins can be said to exist in the form of unspent outputs of transactions. Blockchain Technology Explained 
Block Markov Chain (BMC) 
These Markov chains are characterized by a block structure in their transition matrix. More precisely, the $n$ possible states are divided into a finite number of $K$ groups or clusters, such that states in the same cluster exhibit the same transition rates to other states. One observes a trajectory of the Markov chain, and the objective is to recover, from this observation only, the (initially unknown) clusters. 
Block Neural Autoregressive Flow (BNAF) 
Normalising flows (NFS) map two density functions via a differentiable bijection whose Jacobian determinant can be computed efficiently. Recently, as an alternative to handcrafted bijections, Huang et al. (2018) proposed neural autoregressive flow (NAF) which is a universal approximator for density functions. Their flow is a neural network (NN) whose parameters are predicted by another NN. The latter grows quadratically with the size of the former and thus an efficient technique for parametrization is needed. We propose block neural autoregressive flow (BNAF), a much more compact universal approximator of density functions, where we model a bijection directly using a single feedforward network. Invertibility is ensured by carefully designing each affine transformation with block matrices that make the flow autoregressive and (strictly) monotone. We compare BNAF to NAF and other established flows on density estimation and approximate inference for latent variable models. Our proposed flow is competitive across datasets while using orders of magnitude fewer parameters. 
Block Point Process Model (BPPM) 
Many application settings involve the analysis of timestamped relations or events between a set of entities, e.g. messages between users of an online social network. Static and discretetime network models are typically used as analysis tools in these settings; however, they discard a significant amount of information by aggregating events over time to form network snapshots. In this paper, we introduce a block point process model (BPPM) for dynamic networks evolving in continuous time in the form of events at irregular time intervals. The BPPM is inspired by the wellknown stochastic block model (SBM) for static networks and is a simpler version of the recentlyproposed Hawkes infinite relational model (IRM). We show that networks generated by the BPPM follow an SBM in the limit of a growing number of nodes and leverage this property to develop an efficient inference procedure for the BPPM. We fit the BPPM to several real network data sets, including a Facebook network with over 3, 500 nodes and 130, 000 events, several orders of magnitude larger than the Hawkes IRM and other existing point process network models. 
Block Power Methods  This paper is concerned with the extension of the power method, used for finding the largest eigenvalue and associated eigenvector of a matrix, to its block from for computing the largest block eigenvalue and associated block eigenvector of a nonsymmetric matrix. Based on the developed block power method, several algorithms are developed for solving the complete set of solvents and spectral factors of a matrix polynomial, without prior knowledge of the latent roots of the matrix polynomial. Moreover, when any right/left solvent of a matrix polynomial is given, the proposed method can be used to determine the corresponding left/right solvent such that both right and left solvents have the same eigenspectra. The matrix polynomial of interest must have distinct block solvents and a corresponding nonsingular polynomial matrix. The established algorithms can be applied in the analysis and/or design of systems described by highdegree vector differential equations and/or matrix fraction descriptions. 
Block Randomized Adaptive Iterative Lasso ((BRAIL) 
Data integration methods that analyze multiple sources of data simultaneously can often provide more holistic insights than can separate inquiries of each data source. Motivated by the advantages of data integration in the era of ‘big data’, we investigate feature selection for highdimensional multiview data with mixed data types (e.g. continuous, binary, countvalued). This heterogeneity of multiview data poses numerous challenges for existing feature selection methods. However, after critically examining these issues through empirical and theoreticallyguided lenses, we develop a practical solution, the Block Randomized Adaptive Iterative Lasso (BRAIL), which combines the strengths of the randomized Lasso, adaptive weighting schemes, and stability selection. BRAIL serves as a versatile data integration method for sparse regression and graph selection, and we demonstrate the effectiveness of BRAIL through extensive simulations and a case study to infer the ovarian cancer gene regulatory network. In this case study, BRAIL successfully identifies wellknown biomarkers associated with ovarian cancer and hints at novel candidates for future ovarian cancer research. 
Block Term Network (BTnet) 
Recently, deep neural networks (DNNs) have been regarded as the stateoftheart classification methods in a wide range of applications, especially in image classification. Despite the success, the huge number of parameters blocks its deployment to situations with light computing resources. Researchers resort to the redundancy in the weights of DNNs and attempt to find how fewer parameters can be chosen while preserving the accuracy at the same time. Although several promising results have been shown along this research line, most existing methods either fail to significantly compress a welltrained deep network or require a heavy finetuning process for the compressed network to regain the original performance. In this paper, we propose the \textit{Block Term} networks (BTnets) in which the commonly used fullyconnected layers (FClayers) are replaced with block term layers (BTlayers). In BTlayers, the inputs and the outputs are reshaped into two lowdimensional highorder tensors, then blockterm decomposition is applied as tensor operators to connect them. We conduct extensive experiments on benchmark datasets to demonstrate that BTlayers can achieve a very large compression ratio on the number of parameters while preserving the representation power of the original FClayers as much as possible. Specifically, we can get a higher performance while requiring fewer parameters compared with the tensor train method. 
Block Tree (BT) 
The Block Tree (BT) is a novel compact data structure designed to compress sequence collections. It obtains compression ratios close to LempelZiv and supports efficient direct access to any substring. The BT divides the text recursively into fixedsize blocks and those appearing earlier are represented with pointers. On repetitive collections, a few blocks can represent all the others, and thus the BT reduces the size by orders of magnitude. 
BlockCNN  We present a general technique that performs both artifact removal and image compression. For artifact removal, we input a JPEG image and try to remove its compression artifacts. For compression, we input an image and process its 8 by 8 blocks in a sequence. For each block, we first try to predict its intensities based on previous blocks; then, we store a residual with respect to the input image. Our technique reuses JPEG’s legacy compression and decompression routines. Both our artifact removal and our image compression techniques use the same deep network, but with different training weights. Our technique is simple and fast and it significantly improves the performance of artifact removal and image compression. 
BlockDistributed Gradient Boosted Tree  The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include ClickThrough Rate (CTR) prediction and learningtorank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a rowdistributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for highdimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose blockdistributed Gradient Boosted Trees. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the blockdistributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for highdimensional data, and allow more costeffective scaleout without the need for expensive network communication. 
BlockPuzzle  In this work we propose a novel task framework under which a variety of physical reasoning puzzles can be constructed using very simple rules. Under sparse reward settings, most of these tasks can be very challenging for a reinforcement learning agent to learn. We build several simple environments with this task framework in Mujoco and OpenAI gym and attempt to solve them. We are able to solve the environments by designing curricula to guide the agent in learning and using imitation learning methods to transfer knowledge from a simpler environment. This is only a first step for the task framework, and further research on how to solve the harder tasks and transfer knowledge between tasks is needed. 
BlockSci  Analysis of blockchain data is useful for both scientific research and commercial applications. We present BlockSci, an opensource software platform for blockchain analysis. BlockSci is versatile in its support for different blockchains and analysis tasks. It incorporates an inmemory, analytical (rather than transactional) database, making it several hundred times faster than existing tools. We describe BlockSci’s design and present four analyses that illustrate its capabilities. This is a working paper that accompanies the first public release of BlockSci, available at https://…/BlockSci. We seek input from the community to further develop the software and explore other potential applications. 
Blockspring  Blockspring lets you dramatically scale analytics with limited technical resources. It’s a platform that makes distribution and consumption of technology simple within your organization. Here’s how it works: · Developers and data scientists post common company functions – queries, algorithms, visualizations, API calls, etc – to Blockspring in their favorite programming language. · Business users search Blockspring for the function they need, and easily use it in their spreadsheet. This model helps IT teams produced more functionality. Simultaneously, it lets business users find and use the tools they need, when they need them. 
BlockWise Network Generation Pipeline (BlockQNN) 
Convolutional neural networks have gained a remarkable success in computer vision. However, most usable network architectures are handcrafted and usually require expertise and elaborate design. In this paper, we provide a blockwise network generation pipeline called BlockQNN which automatically builds highperformance networks using the QLearning paradigm with epsilongreedy exploration strategy. The optimal network block is constructed by the learning agent which is trained to choose component layers sequentially. We stack the block to construct the whole autogenerated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The blockwise generation brings unique advantages: (1) it yields stateoftheart results in comparison to the handcrafted networks on image classification, particularly, the best network generated by BlockQNN achieves 2.35% top1 error rate on CIFAR10. (2) it offers tremendous reduction of the search space in designing networks, spending only 3 days with 32 GPUs. A faster version can yield a comparable result with only 1 GPU in 20 hours. (3) it has strong generalizability in that the network built on CIFAR also performs well on the largerscale dataset. The best network achieves very competitive accuracy of 82.0% top1 and 96.0% top5 on ImageNet. 
BlockwiseMajorizationDescent  gglasso 
Bloom Filter  A Bloom filter is a spaceefficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either ‘possibly in set’ or ‘definitely not in set’. Elements can be added to the set, but not removed (though this can be addressed with a ‘counting’ filter). The more elements that are added to the set, the larger the probability of false positives. Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if ‘conventional’ errorfree hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an errorfree hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom’s technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal errorfree hash still eliminates 85% of the disk accesses (Bloom (1970)). Role of Bloom Filter in Big Data Research: A Survey 
BLOSSOM  We develop the first Bayesian Optimization algorithm, BLOSSOM, which selects between multiple alternative acquisition functions and traditional local optimization at each step. This is combined with a novel stopping condition based on expected regret. This pairing allows us to obtain the best characteristics of both local and Bayesian optimization, making efficient use of function evaluations while yielding superior convergence to the global minimum on a selection of optimization problems, and also halting optimization once a principled and intuitive stopping condition has been fulfilled. 
Blossom Belief Propagation (BlossomBP) 
Maxproduct Belief Propagation (BP) is a popular messagepassing algorithm for computing a MaximumAPosteriori (MAP) assignment over a distribution represented by a Graphical Model (GM). It has been shown that BP can solve a number of combinatorial optimization problems including minimum weight matching, shortest path, network flow and vertex cover under the following common assumption: the respective Linear Programming (LP) relaxation is tight, i.e., no integrality gap is present. However, when LP shows an integrality gap, no model has been known which can be solved systematically via sequential applications of BP. In this paper, we develop the first such algorithm, coined BlossomBP, for solving the minimum weight matching problem over arbitrary graphs. Each step of the sequential algorithm requires applying BP over a modified graph constructed by contractions and expansions of blossoms, i.e., odd sets of vertices. Our scheme guarantees termination in O(n^2) of BP runs, where n is the number of vertices in the original graph. In essence, the BlossomBP offers a distributed version of the celebrated Edmonds’ Blossom algorithm by jumping at once over many substeps with a single BP. Moreover, our result provides an interpretation of the Edmonds’ algorithm as a sequence of LPs. 
BlueSky Statistics  Fully featured Statistics application and development framework built on the open source R project. Provides familiar powerful user interface available in mainstream statistical applications like SPSS, SAS etc. Unlocks the power of R for the analyst community by providing a rich GUI and output for several popular statistics, data mining, data manipulation and graphics commands, all out of the box… Provide a rich development framework for developing and deploying new statistical modules, applications or functions with rich graphical user interfaces and output, all through intuitive drag and drop user interfaces (No programming required). A quick look at BlueSky Statistics 
BlurRing  A code package, BlurRing, is developed as a method to allow for multidimensional likelihood visualisation. From the BlurRing visualisation additional information about the likelihood can be extracted. The spread in any direction of the overlaid likelihood curves gives information about the uncertainty on the confidence intervals presented in the twodimensional likelihood plots. 
BMGAN  Machine learning (ML) has progressed rapidly during the past decade and the major factor that drives such development is the unprecedented largescale data. As data generation is a continuous process, this leads to ML service providers updating their models frequently with newlycollected data in an online learning scenario. In consequence, if an ML model is queried with the same set of data samples at two different points in time, it will provide different results. In this paper, we investigate whether the change in the output of a blackbox ML model before and after being updated can leak information of the dataset used to perform the update. This constitutes a new attack surface against blackbox ML models and such information leakage severely damages the intellectual property and data privacy of the ML model owner/provider. In contrast to membership inference attacks, we use an encoderdecoder formulation that allows inferring diverse information ranging from detailed characteristics to full reconstruction of the dataset. Our new attacks are facilitated by stateoftheart deep learning techniques. In particular, we propose a hybrid generative model (BMGAN) that is based on generative adversarial networks (GANs) but includes a reconstructive loss that allows generating accurate samples. Our experiments show effective prediction of dataset characteristics and even full reconstruction in challenging conditions. 
BNN+  Deep neural networks (DNN) are widely used in many applications. However, their deployment on edge devices has been difficult because they are resource hungry. Binary neural networks (BNN) help to alleviate the prohibitive resource requirements of DNN, where both activations and weights are limited to $1$bit. We propose an improved binary training method (BNN+), by introducing a regularization function that encourages training weights around binary values. In addition to this, to enhance model performance we add trainable scaling factors to our regularization functions. Furthermore, we use an improved approximation of the derivative of the sign activation function in the backward computation. These additions are based on linear operations that are easily implementable into the binary training framework. We show experimental results on CIFAR10 obtaining an accuracy of $86.7\%$, on AlexNet and $91.3\%$ with VGG network. On ImageNet, our method also outperforms the traditional BNN method and XNORnet, using AlexNet by a margin of $4\%$ and $2\%$ top$1$ accuracy respectively. 
BOABased Optimisation Approach  The Bayesian Optimisation Algorithm (BOA) is an Estimation of Distribution Algorithm (EDA) that uses a Bayesian network as probabilistic graphical model (PGM). Determining the optimal Bayesian network structure given a solution sample is an NPhard problem. This step should be completed at each iteration of BOA, resulting in a very timeconsuming process. For this reason most implementations use greedy estimation algorithms such as K2. However, we show in this paper that significant changes in PGM structure do not occur so frequently, and can be particularly sparse at the end of evolution. A statistical study of BOA is thus presented to characterise a pattern of PGM adjustments that can be used as a guide to reduce the frequency of PGM updates during the evolutionary process. This is accomplished by proposing a new BOAbased optimisation approach (FBOA) whose PGM is not updated at each iteration. This new approach avoids the computational burden usually found in the standard BOA. The results compare the performances of both algorithms on an NKlandscape optimisation problem using the correlation between the ruggedness and the expected runtime over enumerated instances. The experiments show that FBOA presents competitive results while significantly saving computational time. 
BOAug  In recent years, deep learning has achieved remarkable achievements in many fields, including computer vision, natural language processing, speech recognition and others. Adequate training data is the key to ensure the effectiveness of the deep models. However, obtaining valid data requires a lot of time and labor resources. Data augmentation (DA) is an effective alternative approach, which can generate new labeled data based on existing data using labelpreserving transformations. Although we can benefit a lot from DA, designing appropriate DA policies requires a lot of expert experience and time consumption, and the evaluation of searching the optimal policies is costly. So we raise a new question in this paper: how to achieve automated data augmentation at as low cost as possible? We propose a method named BOAug for automating the process by finding the optimal DA policies using the Bayesian optimization approach. Our method can find the optimal policies at a relatively low search cost, and the searched policies based on a specific dataset are transferable across different neural network architectures or even different datasets. We validate the BOAug on three widely used image classification datasets, including CIFAR10, CIFAR100 and SVHN. Experimental results show that the proposed method can achieve stateoftheart or near advanced classification accuracy. Code to reproduce our experiments is available at https://…/BOAug. 
BOINC  Volunteer computing’ is the use of consumer digital devices for highthroughput scientific computing. It can provide large computing capacity at low cost, but presents challenges due to device heterogeneity, unreliability, and churn. BOINC, a widelyused opensource middleware system for volunteer computing, addresses these challenges. We describe its features, architecture, and implementation. 
Bokeh  Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, but also deliver this capability with highperformance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. 
Bollinger Bands  The Bollinger Band was introduce by John Bollinger in 1980s. These Bands depict the volatility of stock as it increases or decreases. The bands are placed above and below the moving average line of the stocks. The wider the gap between the bands, higher is the degree of volatility. On the other hand, as the width within the band decreases, lower is the degree of volatility of the stock. At times, the width within the band is constant over a period of time, which shows the constant behavior of a certain stock over that period of time. There are three lines in the Bollinger Band, · The middle line with Nperiod moving average (MA); 20day SMA · An upper band at K times an Nperiod standard deviation above the moving average; 20day SMA + (20day standard deviation of price x 2) · A lower band at K times an Nperiod standard deviation below the moving average; 20day SMA – (20day standard deviation of price x 2) 
bolt  Bringing multidimensional arrays to distributed settings through a unified Python interface. Bolt is an open source library providing a Python interface to ndarrays backed by local or distributed implementations (currently targeting Spark). We want to make working with big array data in Python as easy and seamless as in local settings, while exploiting the speed of proven distributed engines. 
BOLTSure Screening Interactions (BOLTSSI) 
Detecting interaction effects is a crucial step in various applications. In this paper, we first propose a simple method for sure screening interactions (SSI). SSI works well for problems of moderate dimensionality, without heredity assumptions. For ultrahigh dimensional problems, we propose a fast algorithm, named ‘BOLTsure screening interactions’. This is motivated by that the interaction effects on a response variable can be exactly evaluated using the contingency table when they are all discrete variables. The numbers in contingency table can be collected in an efficient manner by Boolean representation and operations. To generalize this idea, we propose a discritization step such that BOLTSSI is applicable for interaction detection among continuous variables. Statistical theory has been established for SSI and BOLTSSI, guaranteeing their sure screening property. Experimental results demonstrate that SSI and BOLTSSI can often outperform their competitors in terms of computational efficiency and statistical accuracy, especially for the data with more than 300,000 predictors. Based on results, we believe there is a great need to rethink the relationship between statistical accuracy and computational efficiency. The computational performance of a statistical method can often be greatly improved by exploring advantages of the computational architecture, and the loss of statistical accuracy can be tolerated. 
Bonsai  Extreme multilabel classification refers to supervised multilabel learning involving hundreds of thousand or even millions of labels. In this paper, we develop a shallow treebased algorithm, called Bonsai, which promotes diversity of the label space and easily scales to millions of labels. Bonsai relaxes the two main constraints of the recently proposed treebased algorithm, Parabel, which partitions labels at each tree node into exactly two child nodes, and imposes label balancedness between these nodes. Instead, Bonsai encourages diversity in the partitioning process by (i) allowing a much larger fanout at each node, and (ii) maintaining the diversity of the label set further by enabling potentially imbalanced partitioning. By allowing such flexibility, it achieves the best of both worlds – fast training of treebased methods, and prediction accuracy better than Parabel, and at par with onevsrest methods. As a result, Bonsai outperforms stateoftheart onevsrest methods such as DiSMEC in terms of prediction accuracy, while being orders of magnitude faster to train. The code for \bonsai is available at https://…/bonsai. 
Boolean Satisfiability Problem (SAT) 
In computer science, the Boolean satisfiability problem (sometimes called Propositional Satisfiability Problem and abbreviated as SATISFIABILITY or SAT) is the problem of determining if there exists an interpretation that satisfies a given Boolean formula. In other words, it asks whether the variables of a given Boolean formula can be consistently replaced by the values TRUE or FALSE in such a way that the formula evaluates to TRUE. If this is the case, the formula is called satisfiable. On the other hand, if no such assignment exists, the function expressed by the formula is FALSE for all possible variable assignments and the formula is unsatisfiable. For example, the formula ‘a AND NOT b’ is satisfiable because one can find the values a = TRUE and b = FALSE, which make (a AND NOT b) = TRUE. In contrast, ‘a AND NOT a’ is unsatisfiable. rpicosat 
Boomerang  Paid crowdsourcing platforms suffer from lowquality work and unfair rejections, but paradoxically, most workers and requesters have high reputation scores. These inflated scores, which make highquality work and workers difficult to find, stem from social pressure to avoid giving negative feedback. We introduce Boomerang, a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it. With Boomerang, requesters find that their highlyrated workers gain earliest access to their future tasks, and workers find tasks from their highlyrated requesters at the top of their task feed. Field experiments verify that Boomerang causes both workers and requesters to provide feedback that is more closely aligned with their private opinions. Inspired by a gametheoretic notion of incentivecompatibility, Boomerang opens opportunities for interaction design to incentivize honest reporting over strategic dishonesty. 
Boosting  Boosting is a machine learning metaalgorithm for reducing bias in supervised learning. Boosting is based on the question posed by Kearns: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily wellcorrelated with the true classification. Schapire’s affirmative answer to Kearns’ question has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting. 
Boosting Independent Embeddings Robustly (BIER) 
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over stateoftheart methods on the CUB 2002011, Cars196, Stanford Online Products, InShop Clothes Retrieval and VehicleID datasets. 
Boosting Smooth Transition Regression Tree (BooST) 
In this paper we introduce a new machine learning (ML) model for nonlinear regression called Boosting Smooth Transition Regression Tree (BooST). The main advantage of the BooST is that it estimates the derivatives (partial effects) of very general nonlinear models, providing more interpretation than other tree based models concerning the mapping between the covariates and the dependent variable. We provide some asymptotic theory that shows consistency of the partial derivatives and we present some examples on simulated and empirical data. 
Boosting Variational Inference  Variational Inference is a popular technique to approximate a possibly intractable Bayesian posterior with a more tractable one. Recently, Boosting Variational Inference has been proposed as a new paradigm to approximate the posterior by a mixture of densities by greedily adding components to the mixture. In the present work, we study the convergence properties of this approach from a modern optimization viewpoint by establishing connections to the classic FrankWolfe algorithm. Our analyses yields novel theoretical insights on the Boosting of Variational Inference regarding the sufficient conditions for convergence, explicit sublinear/linear rates, and algorithmic simplifications. 
BoostJet  Recommenders have become widely popular in recent years because of their broader applicability in many ecommerce applications. These applications rely on recommenders for generating advertisements for various offers or providing content recommendations. However, the quality of the generated recommendations depends on user features (like demography, temporality), offer features (like popularity, price), and useroffer features (like implicit or explicit feedback). Current stateoftheart recommenders do not explore such diverse features concurrently while generating the recommendations. In this paper, we first introduce the notion of Trackers which enables us to capture the abovementioned features and thus incorporate users’ online behaviour through statistical aggregates of different features (demography, temporality, popularity, price). We also show how to capture offertooffer relations, based on their consumption sequence, leveraging neural embeddings for offers in our Offer2Vec algorithm. We then introduce BoostJet, a novel recommender which integrates the Trackers along with the neural embeddings using MatrixNet, an efficient distributed implementation of gradient boosted decision tree, to improve the recommendation quality significantly. We provide an indepth evaluation of BoostJet on Yandex’s dataset, collecting online behaviour from tens of millions of online users, to demonstrate the practicality of BoostJet in terms of recommendation quality as well as scalability. 
Bootstrap Aggregating (Bagging) 
Bootstrap aggregating, also called bagging, is a machine learning ensemble metaalgorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. 
Bootstrap CUSUM Test  Cumulative sum (CUSUM) statistics are widely used in the change point inference and identification. This paper studies the two problems for highdimensional mean vectors based on the supremum norm of the CUSUM statistics. For the problem of testing for the existence of a change point in a sequence of independent observations generated from the meanshift model, we introduce a Gaussian multiplier bootstrap to approximate critical values of the CUSUM test statistics in high dimensions. The proposed bootstrap CUSUM test is fully datadependent and it has strong theoretical guarantees under arbitrary dependence structures and mild moment conditions. Specifically, we show that with a boundary removal parameter the bootstrap CUSUM test enjoys the uniform validity in size under the null and it achieves the minimax separation rate under the sparse alternatives when the dimension $p$ can be larger than the sample size $n$. Once a change point is detected, we estimate the change point location by maximizing the supremum norm of the generalized CUSUM statistics at two different weighting scales. The first estimator is based on the covariance stationary CUSUM statistics at each data point, which is consistent in estimating the location at the nearly parametric rate $n^{1/2}$ for subexponential observations. The second estimator is a nonstationary CUSUM statistics, assigning less weights on the boundary data points. In the latter case, we show that it achieves the nearly best possible rate of convergence on the order $n^{1}$. In both cases, the dimension impacts the rate of convergence only through the logarithm factors, and therefore consistency of the CUSUM location estimators is possible when $p$ is much larger than $n$. 
Bootstrap for Rapid Inference on Spatial Covariances (BRISC) 
Saha and Datta (2018) <doi:10.1002/sta4.184> BRISC 
Bootstrap Lasso + Partial Ridge (LPR) 
For highdimensional sparse linear models, how to construct confidence intervals for coefficients remains a difficult question. The main reason is the complicated limiting distributions of common estimators such as the Lasso. Several confidence interval construction methods have been developed, and Bootstrap Lasso+OLS is notable for its simple technicality, good interpretability, and comparable performance with other more complicated methods. However, Bootstrap Lasso+OLS depends on the betamin assumption, a theoretic criterion that is often violated in practice. In this paper, we introduce a new method called Bootstrap Lasso+Partial Ridge (LPR) to relax this assumption. LPR is a twostage estimator: first using Lasso to select features and subsequently using Partial Ridge to refit the coefficients. Simulation results show that Bootstrap LPR outperforms Bootstrap Lasso+OLS when there exist small but nonzero coefficients, a common situation violating the betamin assumption. For such coefficients, compared to Bootstrap Lasso+OLS, confidence intervals constructed by Bootstrap LPR have on average 50% larger coverage probabilities. Bootstrap LPR also has on average 35% shorter confidence interval lengths than the desparsified Lasso methods, regardless of whether linear models are misspecified. Additionally, we provide theoretical guarantees of Bootstrap LPR under appropriate conditions and implement it in the R package ‘HDCI.’ 
Bootstrap Percolation  The name percolation probably relates for most people to brewing coffee, where the water fumes go throw the coffee powder. This example is only one of many systems in which perculation phenomenon exists.To understand it a bit more, one can think of a mixture of glass and metal balls in a jar. up to a certain precentage of metal balls the mixture would behave as an insulator, that is there would be no group of metal balls touching each other that would reach from one side of the jar to the other. From a certain precentage called the percolation threshhold Pc such a group would exist and the mixture would behave as a conductor.The percolation threshold is defined as the probability below which no infinite cluster is found in the infinite system. A group of touching metal balls is named a cluster and the group that reaches from one end to the other is called the spanning cluster. 
BootstrapEnhanced Least Absolute Shrinkage Operator (Bolasso) 
Using the bootstrap and intersecting the supports, we actually get a consistent model estimate, without the consistency condition required by the regular Lasso. We refer to this new procedure as the Bolasso (bootstrapenhanced least absolute shrinkage operator). Finally, our Bolasso framework could be seen as a voting scheme applied to the supports of the bootstrap Lasso estimates; however, our procedure may rather be considered as a consensus combination scheme, as we keep the (largest) subset of variables on which all regressors agree in terms of variable selection, which is in our case provably consistent and also allows to get rid of a potential additional hyperparameter. 
Bootstrapping  In statistics, bootstrapping is a method for assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods. 
Bordered Blocked Diagonal Form (BBDF) 
This paper presents a distributed simulation based method for harmonic resonance assessment (HRA) in multiarea largescale power systems. Further consideration is devoted to the early harmonic frequencyscan formulation to shape them into a Bordered Blocked Diagonal Form (BBDF), which is suitable for parallel processing. The proposed algorithm (BBDF) allows operator of each area of an interconnected system to independently conduct the HRA. A largechange sensitivity based approach is then handled in a secure platform to apply the effects of whole network to each single area. The introduced decentralized HRA is capable to find the exact values as those of the interconnected system through TCP/IP communication media. The developed method is successfully implemented in an existing software package and applied to IEEE 14bus harmonic test system, followed by a discussion on results. 
BorderPeeling Clustering  In this paper, we present a novel nonparametric clustering technique, which is based on an iterative algorithm that peels off layers of points around the clusters. Our technique is based on the notion that each latent cluster is comprised of layers that surround its core, where the external layers, or border points, implicitly separate the clusters. Analyzing the Knearest neighbors of the points makes it possible to identify the border points and associate them with points of inner layers. Our clustering algorithm iteratively identifies border points, peels them, and separates the latent clusters. We show that the peeling process adapts to the local density and successfully separates adjacent clusters. A notable quality of the BorderPeeling algorithm is that it does not require any parameter tuning in order to outperform stateoftheart finelytuned nonparametric clustering methods, including MeanShift and DBSCAN. We further assess our technique on highdimensional datasets that vary in size and characteristics. In particular, we analyze the space of deep features that were trained by a convolutional neural network. 
Borealis  A generalized global update algorithm for Boolean optimization problems. Optimization problems with Boolean variables that fall into the nondeterministic polynomial (NP) class are of fundamental importance in computer science, mathematics, physics and industrial applications. Most notably, solving constraintsatisfaction problems, which are related to spinglasslike Hamiltonians in physics, remains a difficult numerical task. As such, there has been great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by parallel tempering Monte Carlo in conjunction with the rejectionfree isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized global update optimization heuristic that can be applied to different NPcomplete problems with Boolean variables. The global cluster updates allow for a widespread sampling of phase space, thus considerably speeding up optimization. By carefully tuning the pseudotemperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle optimization problems with overconstraints or on topologies with a large sitepercolation threshold. We illustrate the efficiency of the heuristic on paradigmatic optimization problems, such as the maximum satisfiability problem and the vertex cover problem. 
BornAgain Network (BAN) 
Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher}) to another (the student). Commonly, the teacher is a highcapacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness. %we desire a compact model with performance close to the teacher’s. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these {BornAgain Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate stateoftheart performance on the CIFAR10 (3.5%) and CIFAR100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) ConfidenceWeighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and nonpredicted classes. We present experiments with students of various capacities, focusing on the underexplored case where students overpower teachers. Our experiments show significant advantages from transferring knowledge between DenseNets and ResNets in either direction. 
Boruta  Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences. Boruta 
BoTorch  BoTorch (pronounced like ‘blowtorch’) is a library for Bayesian Optimization research built on top of PyTorch, and is part of the PyTorch ecosystem. Bayesian Optimization (BayesOpt) is an established technique for sequential optimization of costlytoevaluate blackbox functions. It can be applied to a wide variety of problems, including hyperparameter optimization for machine learning algorithms, A/B testing, as well as many scientific and engineering problems. BoTorch is best used in tandem with Ax, Facebook’s opensource adaptive experimentation platform, which provides an easytouse interface for defining, managing and running sequential experiments, while handling (meta)data management, transformations, and systems integration. Users who just want an easytouse suite for Bayesian Optimization should start with Ax. 
Bottleneck Attention Module (BAM) 
Recent advances in deep neural networks have been developed via architecture search for stronger representational power. In this work, we focus on the effect of attention in general deep neural networks. We propose a simple and effective attention module, named Bottleneck Attention Module (BAM), that can be integrated with any feedforward convolutional neural networks. Our module infers an attention map along two separate pathways, channel and spatial. We place our module at each bottleneck of models where the downsampling of feature maps occurs. Our module constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an endtoend manner jointly with any feedforward models. We validate our BAM through extensive experiments on CIFAR100, ImageNet1K, VOC 2007 and MS COCO benchmarks. Our experiments show consistent improvement in classification and detection performances with various models, demonstrating the wide applicability of BAM. The code and models will be publicly available. 
Bottleneck Simulator  Deep reinforcement learning has recently shown many impressive successes. However, one major obstacle towards applying such methods to realworld problems is their lack of dataefficiency. To this end, we propose the Bottleneck Simulator: a modelbased reinforcement learning method which combines a learned, factorized transition model of the environment with rollout simulations to learn an effective policy from few examples. The learned transition model employs an abstract, discrete (bottleneck) state, which increases sample efficiency by reducing the number of model parameters and by exploiting structural properties of the environment. We provide a mathematical analysis of the Bottleneck Simulator in terms of fixed points of the learned policy, which reveals how performance is affected by four distinct sources of error: an error related to the abstract space structure, an error related to the transition model estimation variance, an error related to the transition model estimation bias, and an error related to the transition model class bias. Finally, we evaluate the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a realworld, complex dialogue response selection task. On both tasks, the Bottleneck Simulator yields excellent performance beating competing approaches. 
BottleNet  Recent studies have shown the latency and energy consumption of deep neural networks can be significantly improved by splitting the network between the mobile device and cloud. This paper introduces a new deep learning architecture, called BottleNet, for reducing the feature size needed to be sent to the cloud. Furthermore, we propose a training method for compensating for the potential accuracy loss due to the lossy compression of features before transmitting them to the cloud. BottleNet achieves on average 30x improvement in endtoend latency and 40x improvement in mobile energy consumption compared to the cloudonly approach with negligible accuracy loss. 
Boulevard  This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled Boulevard, is shown to converge as the number of trees grows. We also demonstrate a central limit theorem for this limit, allowing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior. 
Boundary Attack++  Decisionbased adversarial attack studies the generation of adversarial examples that solely rely on output labels of a target model. In this paper, decisionbased adversarial attack was formulated as an optimization problem. Motivated by zerothorder optimization, we develop Boundary Attack++, a family of algorithms based on a novel estimate of gradient direction using binary information at the decision boundary. By switching between two types of projection operators, our algorithms are capable of optimizing $L_2$ and $L_\infty$ distances respectively. Experiments show Boundary Attack++ requires significantly fewer model queries than Boundary Attack. We also show our algorithm achieves superior performance compared to stateoftheart whitebox algorithms in attacking adversarially trained models on MNIST. 
Boundary Equilibrium Generative Adversarial Networ (BEGAN) 
We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training autoencoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the tradeoff between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure. TensorflowBEGAN: Boundary Equilibrium Generative Adversarial Networks 
Boundary Optimizing Network (BON) 
Despite all the success that deep neural networks have seen in classifying certain datasets, the challenge of finding optimal solutions that generalize well still remains. In this paper, we propose the Boundary Optimizing Network (BON), a new approach to generalization for deep neural networks when used for supervised learning. Given a classification network, we propose to use a collaborative generative network that produces new synthetic data points in the form of perturbations of original data points. In this way, we create a data support around each original data point which prevents decision boundaries to pass too close to the original data points, i.e. prevents overfitting. To prevent catastrophic forgetting during training, we propose to use a variation of Memory Aware Synapses to optimize the generative networks. On the Iris dataset, we show that the BON algorithm creates better decision boundaries when compared to a network regularized by the popular dropout scheme. 
BoundarySensitive Network (BSN) 
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing realworld videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named BoundarySensitive Network (BSN), which adopts ‘local to global’ fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with BoundarySensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet1.3 and THUMOS14, where BSN outperforms other stateoftheart temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the stateoftheart temporal action detection performance. 
Bounded Dijkstra (BD) 
The shortest path (SP) and shortest paths tree (SPT) problems arise both as direct applications and as subroutines of overlay algorithms solving more complex problems such as the constrained shortest path (CSP) or the constrained minimum Steiner tree (CMST) problems. Often, such algorithms do not use the result of an SP subroutine if its total cost is greater than a given bound. For example, for delayconstrained problems, paths resulting from a leastdelay SP run and whose delay is greater than the delay constraint of the original problem are not used by the overlay algorithm to construct its solution. As a result of the existence of these bounds, and because the Dijkstra SP algorithm discovers paths in increasing order of cost, we can terminate the SP search earlier, i.e., once it is known that paths with a greater total cost will not be considered by the overlay algorithm. This early termination allows to reduce the runtime of the SP subroutine, thereby reducing the runtime of the overlay algorithm without impacting its final result. We refer to this adaptation of Dijkstra for centralized implementations as bounded Dijkstra (BD). On the example of CSP algorithms, we confirm the usefulness of BD by showing that it can reduce the runtime of some algorithms by 75% on average. 
Bounded Fuzzy Possibilistic Method (BFPM) 
This paper introduces Bounded Fuzzy Possibilistic Method (BFPM) by addressing several issues that previous clustering/classification methods have not considered. In fuzzy clustering, object’s membership values should sum to 1. Hence, any object may obtain full membership in at most one cluster. Possibilistic clustering methods remove this restriction. However, BFPM differs from previous fuzzy and possibilistic clustering approaches by allowing the membership function to take larger values with respect to all clusters. Furthermore, in BFPM, a data object can have full membership in multiple clusters or even in all clusters. BFPM relaxes the boundary conditions (restrictions) in membership assignment. The proposed methodology satisfies the necessity of obtaining full memberships and overcomes the issues with conventional methods on dealing with overlapping. Analysing the objects’ movements from their own cluster to another (mutation) is also proposed in this paper. BFPM has been applied in different domains in geometry, set theory, anomaly detection, risk management, diagnosis diseases, and other disciplines. Validity and comparison indexes have been also used to evaluate the accuracy of BFPM. BFPM has been evaluated in terms of accuracy, fuzzification constant (different norms), objects’ movement analysis, and covering diversity. The promising results prove the importance of considering the proposed methodology in learning methods to track the behaviour of data objects, in addition to obtain accurate results. 
BoundedAbstention Method With two Constraints of Reject Rates (BA2) 
Abstaining classificaiton aims to reject to classify the easily misclassified examples, so it is an effective approach to increase the clasificaiton reliability and reduce the misclassification risk in the costsensitive applications. In such applications, different types of errors (false positive or false negative) usaully have unequal costs. And the error costs, which depend on specific applications, are usually unknown. However, current abstaining classification methods either do not distinguish the error types, or they need the cost information of misclassification and rejection, which are realized in the framework of costsensitive learning. In this paper, we propose a boundedabstention method with two constraints of reject rates (BA2), which performs abstaining classification when error costs are unequal and unknown. BA2 aims to obtain the optimal area under the ROC curve (AUC) by constraining the reject rates of the positive and negative classes respectively. Specifically, we construct the receiver operating characteristic (ROC) curve, and stepwise search the optimal reject thresholds from both ends of the curve, untill the two constraints are satisfied. Experimental results show that BA2 obtains higher AUC and lower total cost than the stateoftheart abstaining classification methods. Meanwhile, BA2 achieves controllable reject rates of the positive and negative classes. 
BoundedInformationRate Variational Autoencoder (BIRVAE) 
This paper introduces a new member of the family of Variational Autoencoders (VAE) that constrains the rate of information transferred by the latent layer. The latent layer is interpreted as a communication channel, the information rate of which is bound by imposing a preset signaltonoise ratio. The new constraint subsumes the mutual information between the input and latent variables, combining naturally with the likelihood objective of the observed data as used in a conventional VAE. The resulting BoundedInformationRate Variational Autoencoder (BIRVAE) provides a meaningful latent representation with an information resolution that can be specified directly in bits by the system designer. The rate constraint can be used to prevent overtraining, and the method naturally facilitates quantisation of the latent variables at the set rate. Our experiments confirm that the BIRVAE has a meaningful latent representation and that its performance is at least as good as stateoftheart competing algorithms, but with lower computational complexity. 
Bowtie  Bowtie is a library for writing dashboards in Python. No need to know web frameworks or JavaScript, focus on building functionality in Python. Interactively explore your data in new ways! Deploy and share with others! 
BoxLevel Tracking for Video Object Segmentation (BoLTVOS) 
We approach video object segmentation (VOS) by splitting the task into two subtasks: bounding box level tracking, followed by bounding box segmentation. Following this paradigm, we present BoLTVOS (BoxLevel Tracking for video object segmentation), which consists of an RCNN detector conditioned on the firstframe bounding box to detect the object of interest, a temporal consistency rescoring algorithm, and a Box2Seg network that converts bounding boxes to segmentation masks. BoLTVOS performs VOS using only the firstframe bounding box without the mask. We evaluate our approach on DAVIS 2017 and YouTubeVOS, and show that it outperforms all methods that do not perform firstframe finetuning. We further present BoLTVOSft, which learns to segment the object in question using the firstframe mask while it is being tracked, without increasing the runtime. BoLTVOSft outperforms PReMVOS, the previously best performing VOS method on DAVIS 2016 and YouTubeVOS, while running up to 45 times faster. Our bounding box tracker also outperforms all previous shortterm and longterm trackers on the bounding box level tracking datasets OTB 2015 and LTB35. 
Box–Muller Transform  The BoxMuller transform (by George Edward Pelham Box and Mervin Edgar Muller 1958) is a pseudorandom number sampling method for generating pairs of independent, standard, normally distributed (zero expectation, unit variance) random numbers, given a source of uniformly distributed random numbers. It is commonly expressed in two forms. The basic form as given by Box and Muller takes two samples from the uniform distribution on the interval (0, 1] and maps them to two standard, normally distributed samples. The polar form takes two samples from a different interval, [1, +1], and maps them to two normally distributed samples without the use of sine or cosine functions. The BoxMuller transform was developed as a more computationally efficient alternative to the inverse transform sampling method. The Ziggurat algorithm gives an even more efficient method. 
Boxplot  In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms boxandwhisker plot and boxandwhisker diagram. Outliers may be plotted as individual points. 
Brain2Text  Nowadays, the Internet represents a vast informational space, growing exponentially and the problem of search for relevant data becomes essential as never before. The algorithm proposed in the article allows to perform natural language queries on content of the document and get comprehensive meaningful answers. The problem is partially solved for English as SQuAD contains enough data to learn on, but there is no such dataset in Russian, so the methods used by scientists now are not applicable to Russian. Brain2 framework allows to cope with the problem – it stands out for its ability to be applied on small datasets and does not require impressive computing power. The algorithm is illustrated on Sberbank of Russia Strategy’s text and assumes the use of a neuromodel consisting of 65 mln synapses. The trained model is able to construct wordbyword answers to questions based on a given text. The existing limitations are its current inability to identify synonyms, pronoun relations and allegories. Nevertheless, the results of conducted experiments showed high capacity and generalisation ability of the suggested approach. 
BrainTorrent  Access to sufficient annotated data is a common challenge in training deep neural networks on medical images. As annotating data is expensive and timeconsuming, it is difficult for an individual medical center to reach large enough sample sizes to build their own, personalized models. As an alternative, data from all centers could be pooled to train a centralized model that everyone can use. However, such a strategy is often infeasible due to the privacysensitive nature of medical data. Recently, federated learning (FL) has been introduced to collaboratively learn a shared prediction model across centers without the need for sharing data. In FL, clients are locally training models on sitespecific datasets for a few epochs and then sharing their model weights with a central server, which orchestrates the overall training process. Importantly, the sharing of models does not compromise patient privacy. A disadvantage of FL is the dependence on a central server, which requires all clients to agree on one trusted central body, and whose failure would disrupt the training process of all clients. In this paper, we introduce BrainTorrent, a new FL framework without a central server, particularly targeted towards medical applications. BrainTorrent presents a highly dynamic peertopeer environment, where all centers directly interact with each other without depending on a central body. We demonstrate the overall effectiveness of FL for the challenging task of whole brain segmentation and observe that the proposed serverless BrainTorrent approach does not only outperform the traditional serverbased one but reaches a similar performance to a model trained on pooled data. 
Branch Convolutional Neural Network (BCNN) 
Convolutional Neural Network (CNN) image classifiers are traditionally designed to have sequential convolutional layers with a single output layer. This is based on the assumption that all target classes should be treated equally and exclusively. However, some classes can be more difficult to distinguish than others, and classes may be organized in a hierarchy of categories. At the same time, a CNN is designed to learn internal representations that abstract from the input data based on its hierarchical layered structure. So it is natural to ask if an inverse of this idea can be applied to learn a model that can predict over a classification hierarchy using multiple output layers in decreasing order of class abstraction. In this paper, we introduce a variant of the traditional CNN model named the Branch Convolutional Neural Network (BCNN). A BCNN model outputs multiple predictions ordered from coarse to fine along the concatenated convolutional layers corresponding to the hierarchical structure of the target classes, which can be regarded as a form of prior knowledge on the output. To learn with BCNNs a novel training strategy, named the Branch Training strategy (BTstrategy), is introduced which balances the strictness of the prior with the freedom to adjust parameters on the output layers to minimize the loss. In this way we show that CNN based models can be forced to learn successively coarse to fine concepts in the internal layers at the output stage, and that hierarchical prior knowledge can be adopted to boost CNN models’ classification performance. Our models are evaluated to show that the BCNN extensions improve over the corresponding baseline CNN on the benchmark datasets MNIST, CIFAR10 and CIFAR100. 
Branched Autoencoder NET (BAENET) 
We treat shape cosegmentation as a representation learning problem and introduce BAENET, a branched autoencoder network, for the task. The unsupervised BAENET is trained with all shapes in an input collection using a shape reconstruction loss, without groundtruth segmentations. Specifically, the network takes an input shape and encodes it using a convolutional neural network, whereas the decoder concatenates the resulting feature code with a point coordinate and outputs a value indicating whether the point is inside/outside the shape. Importantly, the decoder is branched: each branch learns a compact representation for one commonly recurring part of the shape collection, e.g., airplane wings. By complementing the shape reconstruction loss with a label loss, BAENET is easily tuned for oneshot learning. We show unsupervised, weakly supervised, and oneshot learning results by BAENET, demonstrating that using only a couple of exemplars, our network can generally outperform stateoftheart supervised methods trained on hundreds of segmented shapes. 
Branched MultiTask Network  In the context of deep learning, neural networks with multiple branches have been used that each solve different tasks. Such ramified networks typically start with a number of shared layers, after which different tasks branch out into their own sequence of layers. As the number of possible network configurations is combinatorially large, prior work has often relied on ad hoc methods to determine the level of layer sharing. This work proposes a novel method to assess the relatedness of tasks in a principled way. We base the relatedness of a task pair on the usefulness of a set of features of one task for the other, and vice versa. The resulting task affinities are used for the automated construction of a branched multitask network in which deeper layers gradually grow more taskspecific. Our multitask network outperforms the stateoftheart on CelebA. Additionally, the layer sharing schemes devised by our method outperform common multitask learning models which were constructed ad hoc. We include additional experiments on Cityscapes and SUN RGBD to illustrate the wide applicability of our approach. Code and trained models for this paper are made available https://…/SimonVandenhende 
Breadthfirst Search (BFS) 
In graph theory, breadthfirst search (BFS) is a strategy for searching in a graph when search is limited to essentially two operations: (a) visit and inspect a node of a graph; (b) gain access to visit the nodes that neighbor the currently visited node. The BFS begins at a root node and inspects all the neighboring nodes. Then for each of those neighbor nodes in turn, it inspects their neighbor nodes which were unvisited, and so on. Compare BFS with the equivalent, but more memoryefficient Iterative deepening depthfirst search and contrast with depthfirst search. 
Break Down Plot  Break Down Plots are inspired by waterfall plots created by ‘xgboostExplainer’ package (see <https://…/xgboostExplainer> ). The idea behind Break Down Plots it to decompose model prediction for a single observation. Break Down Plots show the contribution of every variable present in the model. Such plots will work for binary classifiers and general regression models. breakDown 
Breakout  A breakout is typically characterized by two steady states and an intermediate transition period. Broadly speaking, breakouts have two flavors: 1. Mean shift: A sudden jump in the time series corresponds to a mean shift. A sudden jump in CPU utilization from 40% to 60% would exemplify a mean shift. 2. Ramp up: A gradual increase in the value of the metric from one steady state to another constitutes a ramp up. A gradual increase in CPU utilization from 40% to 60% would exemplify a ramp up. 
Bregman Distance  A Framework for Covariate Balance using Bregman Distances 
Bridge Sampling  (Bennett, 1976; Meng & Wong, 1996), a reliable and relatively straightforward sampling method that allows researchers to obtain the marginal likelihood for models of varying complexity. 
BridgeNet  Age estimation is an important yet very challenging problem in computer vision. Existing methods for age estimation usually apply a divideandconquer strategy to deal with heterogeneous data caused by the nonstationary aging process. However, the facial aging process is also a continuous process, and the continuity relationship between different components has not been effectively exploited. In this paper, we propose BridgeNet for age estimation, which aims to mine the continuous relation between age labels effectively. The proposed BridgeNet consists of local regressors and gating networks. Local regressors partition the data space into multiple overlapping subspaces to tackle heterogeneous data and gating networks learn continuity aware weights for the results of local regressors by employing the proposed bridgetree structure, which introduces bridge connections into tree models to enforce the similarity between neighbor nodes. Moreover, these two components of BridgeNet can be jointly learned in an endtoend way. We show experimental results on the MORPH II, FGNET and Chalearn LAP 2015 datasets and find that BridgeNet outperforms the stateoftheart methods. 
BriskStream  We introduce BriskStream, an inmemory data stream processing system (DSPSs) specifically designed for modern sharedmemory multicore architectures. BriskStream’s key contribution is an execution plan optimization paradigm, namely RLAS, which takes relativelocation (i.e., NUMA distance) of each pair of producerconsumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multicore architectures when processing different types of workloads. 
Broadcasting Convolutional Network  While convolutional neural networks (CNNs) are widely used for handling spatiotemporal scenes, there exist limitations in reasoning relations among spatial features caused by their inherent structures, which have been issued consistently in many studies. In this paper, we propose Broadcasting Convolutional Networks (BCN) that allow global receptive fields to share spatial information. BCNs are simple network modules that collect effective spatial features, embed location informations and broadcast them to the entire feature maps without much additional computational cost. This method gains great improvements in feature localization problems through efficiently extending the receptive fields, and can easily be implemented within any structure of CNNs. We further utilize BCN to propose MultiRelational Networks (multiRN) that greatly improve existing Relation Networks (RNs). In pixelbased relation reasoning problems, multiRN with BCNs implanted extends the concept of `pairwise relations’ from conventional RNs to `multiple relations’ by relating each object with multiple objects at once and not in pairs. This yields in O(n) complexity for n number of objects, which is a vast computational gain from RNs that take O(n^2). Through experiments, BCNs are proven for their usability on relation reasoning problems, which is due from their efficient handlings of spatial information. 
Broyden–Fletcher–Goldfarb–Shanno Algorithm (BFGS) 
In numerical optimization, the BroydenFletcherGoldfarbShanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems. The BFGS method approximates Newton’s method, a class of hillclimbing optimization techniques that seeks a stationary point of a (preferably twice continuously differentiable) function. For such problems, a necessary condition for optimality is that the gradient be zero. Newton’s method and the BFGS methods are not guaranteed to converge unless the function has a quadratic Taylor expansion near an optimum. These methods use both the first and second derivatives of the function. However, BFGS has proven to have good performance even for nonsmooth optimizations. 
BRPC  An industrialgrade RPC framework used throughout Baidu, with 1,000,000+ instances(not counting clients) and thousands kinds of services, called ‘baidurpc’ inside Baidu. Only C++ implementation is opensourced right now. 
Bubble Generative Adversarial Network (BubGAN) 
Bubble segmentation and size detection algorithms have been developed in recent years for their high efficiency and accuracy in measuring bubbly twophase flows. In this work, we proposed an architecture called bubble generative adversarial networks (BubGAN) for the generation of realistic synthetic images which could be further used as training or benchmarking data for the development of advanced image processing algorithms. The BubGAN is trained initially on a labeled bubble dataset consisting of ten thousand images. By learning the distribution of these bubbles, the BubGAN can generate more realistic bubbles compared to the conventional models used in the literature. The trained BubGAN is conditioned on bubble feature parameters and has full control of bubble properties in terms of aspect ratio, rotation angle, circularity and edge ratio. A million bubble dataset is pregenerated using the trained BubGAN. One can then assemble realistic bubbly flow images using this dataset and associated image processing tool. These images contain detailed bubble information, therefore do not require additional manual labeling. This is more useful compared with the conventional GAN which generates images without labeling information. The tool could be used to provide benchmarking and training data for existing image processing algorithms and to guide the future development of bubble detecting algorithms. 
BubbleNet  Semisupervised video object segmentation has made significant progress on real and challenging videos in recent years. The current paradigm for segmentation methods and benchmark datasets is to segment objects in video provided a single annotation in the first frame. However, we find that segmentation performance across the entire video varies dramatically when selecting an alternative frame for annotation. This paper address the problem of learning to suggest the single best frame across the video for user annotation—this is, in fact, never the first frame of video. We achieve this by introducing BubbleNets, a novel deep sorting network that learns to select frames using a performancebased loss function that enables the conversion of expansive amounts of training examples from already existing datasets. Using BubbleNets, we are able to achieve an 11% relative improvement in segmentation performance on the DAVIS benchmark without any changes to the underlying method of segmentation. 
BUbiNG  BUbiNG is an opensource Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host and IPbased. Unlike existing opensource distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern highspeed protocols so to achieve very high throughput. 
Bucketing  We study the effect of impairment on stochastic multiarmed bandits and develop new ways to mitigate it. Impairment effect is the phenomena where an agent only accrues reward for an action if they have played it at least a few times in the recent past. It is practically motivated by repetition and recency effects in domains such as advertising (here consumer behavior may require repeat actions by advertisers) and vocational training (here actions are complex skills that can only be mastered with repetition to get a payoff). Impairment can be naturally modelled as a temporal constraint on the strategy space, and we provide two novel algorithms that achieve sublinear regret, each working with different assumptions on the impairment effect. We introduce a new notion called bucketing in our algorithm design, and show how it can effectively address impairment as well as a broader class of temporal constraints. Our regret bounds explicitly capture the cost of impairment and show that it scales (sub)linearly with the degree of impairment. Our work complements recent work on modeling delays and corruptions, and we provide experimental evidence supporting our claims. 
Bucketization  
Budding Perceptron  Traditionally, deep learning algorithms update the network weights whereas the network architecture is chosen manually, using a process of trial and error. In this work, we propose two novel approaches that automatically update the network structure while also learning its weights. The novelty of our approach lies in our parameterization where the depth, or additional complexity, is encapsulated continuously in the parameter space through control parameters that add additional complexity. We propose two methods: In tunnel networks, this selection is done at the level of a hidden unit, and in budding perceptrons, this is done at the level of a network layer; updating this control parameter introduces either another hidden unit or another hidden layer. We show the effectiveness of our methods on the synthetic twospirals data and on two real data sets of MNIST and MIRFLICKR, where we see that our proposed methods, with the same set of hyperparameters, can correctly adjust the network complexity to the task complexity. 
Budget Aware Object Detection (BAOD) 
We study the problem of object detection from a novel perspective in which annotation budget constraints are taken into consideration, appropriately coined Budget Aware Object Detection (BAOD). When provided with a fixed budget, we propose a strategy for building a diverse and informative dataset that can be used to optimally train a robust detector. We investigate both optimization and learningbased methods to sample which images to annotate and what type of annotation (strongly or weakly supervised) to annotate them with. We adopt a hybrid supervised learning framework to train the object detector from both these types of annotation. We conduct a comprehensive empirical study showing that a handcrafted optimization method outperforms other selection techniques including random sampling, uncertainty sampling and active learning. By combining an optimal image/annotation selection scheme with hybrid supervised learning to solve the BAOD problem, we show that one can achieve the performance of a strongly supervised detector on PASCALVOC 2007 while saving 12.8% of its original annotation budget. Furthermore, when $100\%$ of the budget is used, it surpasses this performance by 2.0 mAP percentage points. 
Buffered Stochastic Variational Inference (BSVI) 
The recognition network in deep latent variable models such as variational autoencoders (VAEs) relies on amortized inference for efficient posterior approximation that can scale up to large datasets. However, this technique has also been demonstrated to select suboptimal variational parameters, often resulting in considerable additional error called the amortization gap. To close the amortization gap and improve the training of the generative model, recent works have introduced an additional refinement step that applies stochastic variational inference (SVI) to improve upon the variational parameters returned by the amortized inference model. In this paper, we propose the Buffered Stochastic Variational Inference (BSVI), a new refinement procedure that makes use of SVI’s sequence of intermediate variational proposal distributions and their corresponding importance weights to construct a new generalized importanceweighted lower bound. We demonstrate empirically that training the variational autoencoders with BSVI consistently outperforms SVI, yielding an improved training procedure for VAEs. 
Bumping  Bumping is a simple algorithm that can help your classifier escape from a local minimum. The idea behind bumping is that we can break the symmetry of the problem (or escape the local minimum) by training a decision tree on random subsample. This is similar to bagging. The hope is that in the subsample there will be a preferred split so the tree can pick it. We fit several trees on different bootstrap) samples (sampling with replacement) and choose the one with the best performance on the full training set as the winner. The more rounds of bumping we do, the more likely we are to escape. It costs more CPU time as well though. 
Bumps Chart  Bump charts got their name from ‘bumps race’, a term used to refer to a boat race where each boat tries to ‘bump’ the one in front and move up the chart. Bump charts have become quite common of late and are typically used to represent changes in the position of a given number of competing entities over a fixed time duration. 
Bundle Generation Network  Product bundling, offering a combination of items to customers, is one of the marketing strategies commonly used in online ecommerce and offline retailers. A highquality bundle generalizes frequent items of interest, and diversity across bundles boosts the userexperience and eventually increases transaction volume. In this paper, we formalize the personalized bundle list recommendation as a structured prediction problem and propose a bundle generation network (BGN), which decomposes the problem into quality/diversity parts by the determinantal point processes (DPPs). BGN uses a typical encoderdecoder framework with a proposed featureaware softmax to alleviate the inadequate representation of traditional softmax, and integrates the masked beam search and DPP selection to produce highquality and diversified bundle list with an appropriate bundle size. We conduct extensive experiments on three public datasets and one industrial dataset, including two generated from copurchase records and the other two extracted from realworld online bundle services. BGN significantly outperforms the stateoftheart methods in terms of quality, diversity and response time over all datasets. In particular, BGN improves the precision of the best competitors by 16\% on average while maintaining the highest diversity on four datasets, and yields a 3.85x improvement of response time over the best competitors in the bundle list recommendation problem. 
Burning Number  We introduce a new graph parameter called the burning number, inspired by contact processes on graphs such as graph bootstrap percolation, and graph searching paradigms such as Firefighter. The burning number measures the speed of the spread of contagion in a graph; the lower the burning number, the faster the contagion spreads. We provide a number of properties of the burning number, including characterizations and bounds. The burning number is computed for several graph classes, and is derived for the graphs generated by the Iterated Local Transitivity model for social networks. 
Business Analysis Body of Knowledge (BABOK) 
A Guide to the Business Analysis Body of Knowledge (BABOK) is the written guide to the collection of business analysis knowledge reflecting current best practice, providing a framework that describes the areas of knowledge, with associated activities and tasks and techniques required. According to Capability Maturity Model Integration, organisations interested in process improvement need to adopt industry standards from the Business Analysis Body of Knowledge (and other associated references) to lift their project delivery from the ad hoc to the managed level. 
Business Analytics  Business analytics (BA) refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods. 
Business Function Library (BFL) 
The Business Function Library (BFL) is one of the SAP AFL (Application Function) Libraries. It contains prebuilt parameterdriven functions in the financial area. The functions are implemented by C++. This library helps you develop compound business algorithms that are fully compliant with the SAP HANA calculation engine. It offers you the flexibility and efficiency to develop HANAbased applications with incredible performance. 
Business Intelligence (BI) 
Business intelligence (BI) is a set of theories, methodologies, architectures, and technologies that transform raw data into meaningful and useful information for business purposes. BI can handle enormous amounts of unstructured data to help identify, develop and otherwise create new opportunities. BI, in simple words, makes interpreting voluminous data friendly. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and longterm stability. 
Business Intelligence Competency Centers (BICC) 
A Business Intelligence Competency Center (BICC) is a crossfunctional organizational team that has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of Business Intelligence (BI) across an organization. As early as 2001, Gartner, an information technology research and advisory company, started advocating that companies need a BICC to develop and focus resources to be successful using business intelligence. Since then, the BICC concept has been further refined through practical implementations in organizations that have implemented BI and analytical software. In practice, the term ‘BICC’ is not well integrated into the nomenclature of business or public sector organizations and there are a large degree of variances in the organizational design for BICCs. Nevertheless, the popularity of the BICC concept has caused the creation of units that focus on ensuring the use of the information for decisionmaking from BI software and increasing the return on investment (ROI) of BI. A BICC coordinates the activities and resources to ensure that a factbased approach to decision making is systematically implemented throughout an organization. It has responsibility for the governance structure for BI and analytical programs, projects, practices, software, and architecture. It is responsible for building the plans, priorities, infrastructure, and competencies that the organization needs to take forwardlooking strategic decisions by using the BI and analytical software capabilities. A BICC’s influence transcends that of a typical business unit, playing a crucial central role in the organizational change and strategic process. Accordingly, the BICC’s purpose is to empower the entire organization to coordinate BI from all units. Through centralization, it ‘…ensures that information and best practices are communicated and shared through the entire organization so that everyone can benefit from successes and lessons learned.’ The BICC also plays an important organizational role facilitating interaction among the various cultures and units within the organization. Knowledge transfer, enhancement of analytic skills, coaching and training are central to the mandate of the BICC. A BICC should be pivotal in ensuring a high degree of information consumption and a ROI for BI. 
Butterfly  Unsupervised domain adaptation (UDA) trains with clean labeled data in source domain and unlabeled data in target domain to classify targetdomain data. However, in realworld scenarios, it is hard to acquire fullyclean labeled data in source domain due to the expensive labeling cost. This brings us a new but practical adaptation called wildlyunsupervised domain adaptation (WUDA), which aims to transfer knowledge from noisy labeled data in source domain to unlabeled data in target domain. To tackle the WUDA, we present a robust onestep approach called Butterfly, which trains four networks. Specifically, two networks are jointly trained on noisy labeled data in source domain and pseudolabeled data in target domain (i.e., data in mixture domain). Meanwhile, the other two networks are trained on pseudolabeled data in target domain. By using dualchecking principle, Butterfly can obtain highquality targetspecific representations. We conduct experiments to demonstrate that Butterfly significantly outperforms other baselines on simulated and realworld WUDA tasks in most cases. 
ButterflyNet  Deep networks, especially Convolutional Neural Networks (CNNs), have been successfully applied in various areas of machine learning as well as to challenging problems in other scientific and engineering fields. This paper introduces ButterflyNet, a lowcomplexity CNN with structured hardcoded weights and sparse acrosschannel connections, which aims at an optimal hierarchical function representation of the input signal. Theoretical analysis of the approximation power of ButterflyNet to the Fourier representation of input data shows that the error decays exponentially as the depth increases. Due to the ability of ButterflyNet to approximate Fourier and local Fourier transforms, the result can be used for approximation upper bound for CNNs in a large class of problems. The analysis results are validated in numerical experiments on the approximation of a 1D Fourier kernel and of solving a 2D Poisson’s equation. 
Byzantine Gradient Descent  We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and timevarying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning — a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failurefree setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function. 
Byzantine Stochastic Gradient Descent (BSGD) 
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional minibatch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is informationtheoretically optimal both in terms of sampling complexity and time complexity. 
ByzantineRobust Stochastic Aggregation (RSA) 
In this paper, we propose a class of robust stochastic subgradient methods for distributed learning from heterogeneous datasets at presence of an unknown number of Byzantine workers. The Byzantine workers, during the learning process, may send arbitrary incorrect messages to the master due to data corruptions, communication failures or malicious attacks, and consequently bias the learned model. The key to the proposed methods is a regularization term incorporated with the objective function so as to robustify the learning task and mitigate the negative effects of Byzantine attacks. The resultant subgradientbased algorithms are termed ByzantineRobust Stochastic Aggregation methods, justifying our acronym RSA used henceforth. In contrast to most of the existing algorithms, RSA does not rely on the assumption that the data are independent and identically distributed (i.i.d.) on the workers, and hence fits for a wider class of applications. Theoretically, we show that: i) RSA converges to a nearoptimal solution with the learning error dependent on the number of Byzantine workers; ii) the convergence rate of RSA under Byzantine attacks is the same as that of the stochastic gradient descent method, which is free of Byzantine attacks. Numerically, experiments on real dataset corroborate the competitive performance of RSA and a complexity reduction compared to the stateoftheart alternatives. 