S3Pool  Feature pooling layers (e.g., max pooling) in convolutional neural networks (CNNs) serve the dual purpose of providing increasingly abstract representations as well as yielding computational savings in subsequent convolutional layers. We view the pooling operation in CNNs as a twostep procedure: first, a pooling window (e.g., $2\times 2$) slides over the feature map with stride one which leaves the spatial resolution intact, and second, downsampling is performed by selecting one pixel from each nonoverlapping pooling window in an often uniform and deterministic (e.g., topleft) manner. Our starting point in this work is the observation that this regularly spaced downsampling arising from nonoverlapping windows, although intuitive from a signal processing perspective (which has the goal of signal reconstruction), is not necessarily optimal for \emph{learning} (where the goal is to generalize). We study this aspect and propose a novel pooling strategy with stochastic spatial sampling (S3Pool), where the regular downsampling is replaced by a more general stochastic version. We observe that this general stochasticity acts as a strong regularizer, and can also be seen as doing implicit data augmentation by introducing distortions in the feature maps. We further introduce a mechanism to control the amount of distortion to suit different datasets and architectures. To demonstrate the effectiveness of the proposed approach, we perform extensive experiments on several popular image classification benchmarks, observing excellent improvements over baseline models. Experimental code is available at https://…/s3pool. 
SaberLDA  Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPUbased systems have emerged as a promising alternative because of the high computational power and memory bandwidth of GPUs. However, existing GPUbased LDA systems cannot support a large number of topics because they use algorithms on dense data structures whose time and space complexity is linear to the number of topics. In this paper, we propose SaberLDA, a GPUbased LDA system that implements a sparsityaware algorithm to achieve sublinear time complexity and scales well to learn a large number of topics. To address the challenges introduced by sparsity, we propose a novel data layout, a new warpbased sampling kernel, and an efficient sparse count matrix updating algorithm that improves locality, makes efficient utilization of GPU warps, and reduces memory consumption. xperiments show that SaberLDA can learn from billionstokenscale data with up to 10,000 topics, which is almost two orders of magnitude larger than that of the previous GPUbased systems. With a single GPU card, SaberLDA is able to earn 10,000 topics from a dataset of billions of tokens in a few hours, which is only achievable with clusters with tens of machines before. 
Sac2Vec  Network representation learning (also known as information network embedding) has been the central piece of research in social and information network analytics for the last couple of years. An information network can be viewed as a linked structure of a set of entities. A set of linked web pages and documents, a set of users in a social network are common examples of information network. Typically a node in the information network is formed with a unique id, some content information and the links to its direct neighbors. Information network representation techniques traditionally use only link structure of the network. But the textual or other types of content in each node plays an important role to understand the underlying semantics of the network. In this paper, we propose Sac2Vec, a network representation technique using structure and content. It is a multilayered graph approach which uses a random walk to generate the node embedding. Our approach is simple and computationally fast, yet able to use the content as a complement to structure and the viceversa. Experimental evaluations on three real world publicly available datasets show the merit of our approach compared to stateoftheart algorithms in the domain. 
SAFE  This paper presents a practical approach for detecting nonstationarity in time series prediction. This method is called SAFE and works by monitoring the evolution of the spectral contents of time series through a distance function. This method is designed to work in combination with stateoftheart machine learning methods in real time by informing the online predictors to perform necessary adaptation when a nonstationarity presents. We also propose an algorithm to proportionally include some past data in the adaption process to overcome the Catastrophic Forgetting problem. To validate our hypothesis and test the effectiveness of our approach, we present comprehensive experiments in different elements of the approach involving artificial and realworld datasets. The experiments show that the proposed method is able to significantly save computational resources in term of processor or GPU cycles while maintaining high prediction performances. 
Safe PolicyModel Iteration (SPMI) 
In many realworld problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (ConfMDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe PolicyModel Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach and derived some theoretical results, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy. 
Safe Reinforcement Learning  Safe Reinforcement Learning can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes. We categorize and analyze two approaches of Safe Reinforcement Learning. The first is based on the modification of the optimality criterion, the classic discounted finite/infinite horizon, with a safety factor. The second is based on the modification of the exploration process through the incorporation of external knowledge or the guidance of a risk metric. We use the proposed classification to survey the existing literature, as well as suggesting future directions for Safe Reinforcement Learning. 
SafePredict  SafePredict is a novel metaalgorithm that works with any base prediction algorithm for online data to guarantee an arbitrarily chosen correctness rate, $1\epsilon$, by allowing refusals. Allowing refusals means that the metaalgorithm may refuse to emit a prediction produced by the base algorithm on occasion so that the error rate on nonrefused predictions does not exceed $\epsilon$. The SafePredict error bound does not rely on any assumptions on the data distribution or the base predictor. When the base predictor happens not to exceed the target error rate $\epsilon$, SafePredict refuses only a finite number of times. When the error rate of the base predictor changes through time SafePredict makes use of a weightshifting heuristic that adapts to these changes without knowing when the changes occur yet still maintains the correctness guarantee. Empirical results show that (i) SafePredict compares favorably with stateofthe art confidence based refusal mechanisms which fail to offer robust error guarantees; and (ii) combining SafePredict with such refusal mechanisms can in many cases further reduce the number of refusals. Our software (currently in Python) is included in the supplementary material. 
SAFFRON  In the online false discovery rate (FDR) problem, one observes a possibly infinite sequence of $p$values $P_1,P_2,\dots$, each testing a different null hypothesis, and an algorithm must pick a sequence of rejection thresholds $\alpha_1,\alpha_2,\dots$ in an online fashion, effectively rejecting the $k$th null hypothesis whenever $P_k \leq \alpha_k$. Importantly, $\alpha_k$ must be a function of the past, and cannot depend on $P_k$ or any of the later unseen $p$values, and must be chosen to guarantee that for any time $t$, the FDR up to time $t$ is less than some predetermined quantity $\alpha \in (0,1)$. In this work, we present a powerful new framework for online FDR control that we refer to as SAFFRON. Like older alphainvesting (AI) algorithms, SAFFRON starts off with an error budget, called alphawealth, that it intelligently allocates to different tests over time, earning back some wealth on making a new discovery. However, unlike older methods, SAFFRON’s threshold sequence is based on a novel estimate of the alpha fraction that it allocates to true null hypotheses. In the offline setting, algorithms that employ an estimate of the proportion of true nulls are called adaptive methods, and SAFFRON can be seen as an online analogue of the famous offline StoreyBH adaptive procedure. Just as StoreyBH is typically more powerful than the BenjaminiHochberg (BH) procedure under independence, we demonstrate that SAFFRON is also more powerful than its nonadaptive counterparts, such as LORD and other generalized alphainvesting algorithms. Further, a monotone version of the original AI algorithm is recovered as a special case of SAFFRON, that is often more stable and powerful than the original. Lastly, the derivation of SAFFRON provides a novel template for deriving new online FDR rules. 
SAFPooling  Major winning Convolutional Neural Networks (CNNs), such as VGGNet, ResNet, DenseNet, \etc, include tens to hundreds of millions of parameters, which impose considerable computation and memory overheads. This limits their practical usage in training and optimizing for realworld applications. On the contrary, lightweight architectures, such as SqueezeNet, are being proposed to address this issue. However, they mainly suffer from low accuracy, as they have compromised between the processing power and efficiency. These inefficiencies mostly stem from following an adhoc designing procedure. In this work, we discuss and propose several crucial design principles for an efficient architecture design and elaborate intuitions concerning different aspects of the design procedure. Furthermore, we introduce a new layer called {\it SAFpooling} to improve the generalization power of the network while keeping it simple by choosing best features. Based on such principles, we propose a simple architecture called {\it SimpNet}. We empirically show that SimpNet provides a good tradeoff between the computation/memory efficiency and the accuracy solely based on these primitive but crucial principles. SimpNet outperforms the deeper and more complex architectures such as VGGNet, ResNet, WideResidualNet \etc, on several wellknown benchmarks, while having 2 to 25 times fewer number of parameters and operations. We obtain stateoftheart results (in terms of a balance between the accuracy and the number of involved parameters) on standard datasets, such as CIFAR10, CIFAR100, MNIST and SVHN. The implementations are available at \href{url}{https://…/SimpNet}. 
SAGA  In this paper, we propose a unified framework and an algorithm for the problem of group recommendation where a fixed number of items or alternatives can be recommended to a group of users. The problem of group recommendation arises naturally in many real world contexts, and is closely related to the budgeted social choice problem studied in economics. We frame the group recommendation problem as choosing a subgraph with the largest group consensus score in a completely connected graph defined over the item affinity matrix. We propose a fast greedy algorithm with strong theoretical guarantees, and show that the proposed algorithm compares favorably to the stateoftheart group recommendation algorithms according to commonly used relevance and coverage performance measures on benchmark dataset. 
Saliency Detection  Within our line of sight there are always things that stand out more than others. If you find yourself gazing over a city from a height for example, you may be drawn to a nearby skyscraper, a flashing light or even a red coat someone is wearing below. Saliency is the aspect of any stimulus that makes it stand out from the crowd. The reason a particular stimulus has such salience may be due to contrast i.e. a white line on a black background or as a result of emotional or cognitive factors. For example, we may hone in on something because we are actively looking for it or because it triggers something in our past or memory. Saliency is most commonly discussed in relation to the visual system but it is employed by every perceptual system such as sound and touch. If we are hungry the smell of a favourite food may be highly salient for example. The mechanisms by which humans grant certain stimuli more attentional focus than others probably holds root in our evolutionary past. Our limited cognitive resources require a way to identify the most relevant stimuli for learning and or survival. The world is full of stimuli everywhere you turn and we cannot attend to all of these at once. How does our visual system know where to focus? 
Saliency Methods  Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. 
Saliency Prediction  
Same Place Different Time (SPDT) 

Sammon Mapping  Sammon mapping or Sammon projection is an algorithm that maps a highdimensional space to a space of lower dimensionality () by trying to preserve the structure of interpoint distances in highdimensional space in the lowerdimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969. It is considered a nonlinear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications. 
Sample Size Determination (SSD) 
Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient statistical power. In complicated studies there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling there would be different sample sizes for each population. In a census, data are collected on the entire population, hence the sample size is equal to the population size. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group. ssd 
Sample Size Optimization (SSO) 
Finding the minimal sample size for a query regarding given error constraints. MISS: Finding Optimal Sample Sizes for Approximate Analytics 
Sample, Explore, Modify, Model and Assess (SEMMA) 
SEMMA is an acronym that stands for Sample, Explore, Modify, Model and Assess. It is a list of sequential steps developed by SAS Institute Inc., one of the largest producers of statistics and business intelligence software. It guides the implementation of data mining applications. Although SEMMA is often considered to be a general data mining methodology, SAS claims that it is “rather a logical organisation of the functional tool set of” one of their products, SAS Enterprise Miner, “for carrying out the core tasks of data mining”. 
Sample, Operation, Attribute, and Parameter Dimensions (SOAP) 
The computational requirements for training deep neural networks (DNNs) have grown to the point that it is now standard practice to parallelize training. Existing deep learning systems commonly use data or model parallelism, but unfortunately, these strategies often result in suboptimal parallelization performance. In this paper, we define a more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions. We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that have to execute each strategy. We evaluate FlexFlow with six realworld DNN benchmarks on two GPU clusters and show that FlexFlow can increase training throughput by up to 3.8x over stateoftheart approaches, even when including its search time, and also improves scalability. 
Sampled Weighted MinHashing (SWMH) 
We present Sampled Weighted MinHashing (SWMH), a randomized approach to automatically mine topics from largescale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping interpartition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification. 
Sampling Clustering  We propose an efficient graphbased divisive cluster analysis approach called sampling clustering. It constructs a lite informative dendrogram by recursively dividing a graph into subgraphs. In each recursive call, a graph is sampled first with a set of vertices being removed to disconnect latent clusters, then condensed by adding edges to the remaining vertices to avoid graph fragmentation caused by vertex removals. We also present some sampling and condensing methods and discuss the effectiveness in this paper. Our implementations run in linear time and achieve outstanding performance on various types of datasets. Experimental results show that they outperform stateoftheart clustering algorithms with significantly less computing resources requirements. 
Sampling Error  In statistics, sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics on the sample, such as means and quantiles, generally differ from parameters on the entire population. For example, if one measures the height of a thousand individuals from a country of one million, the average height of the thousand is typically not the same as the average height of all one million people in the country. Since sampling is typically done to determine the characteristics of a whole population, the difference between the sample and population values is considered a sampling error. Exact measurement of sampling error is generally not feasible since the true population values are unknown; however, sampling error can often be estimated by probabilistic modeling of the sample. 
Sampling Importance Resampling (SIR) 
Sampling Importance Resampling allows us to sample from the posterior distribution, p(thetadata) where p(thetadata)?L(theta;data)×p(theta) by resampling from a series of draws from the prior, p(theta). Denote one of those n draws from the prior distribution, p(theta), as thetai. Then draw i from the prior sample is drawn with replacement into the posterior sample with probability qi … 
Samsara  Apache Mahout introduces a new math environment we call Samsara, for its theme of universal renewal. It reflects a fundamental rethinking of how scalable machine learning algorithms are built and customized. MahoutSamsara is here to help people create their own math while providing some offtheshelf algorithm implementations. At its core are general linear algebra and statistical operations along with the data structures to support them. You can use is as a library or customize it in Scala with Mahoutspecific extensions that look something like R. MahoutSamsara comes with an interactive shell that runs distributed operations on a Spark cluster. This make prototyping or task submission much easier and allows users to customize algorithms with a whole new degree of freedom. 
Sankey Diagram  Sankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity. They are typically used to visualize energy or material or cost transfers between processes. 
SAP HANA (HANA) 
SAP HANA has completely transformed the database industry by combining database, data processing, and application platform capabilities in a single inmemory platform. The platform also provides libraries for predictive, planning, text processing, spatial, and business analytics – all on the same architecture. This makes it possible for applications and analytics to be rethought without information processing latency, and senseandresponse solutions can work on massive quantities of realtime data for immediate answers without building preaggregates. Simply put – this makes SAP HANA the platform for building and deploying nextgeneration, realtime applications and analytics. 
SAP River  River is a programming model and a programming language where you define your application (Data Model, Queries & Business Logic) and upon deployment every runtime artifact is deployed onto a DB (such as HANA) and a runtime container (such as XS to run the JavaScript which handles the business logic side). SAP River is an easy way to make SAP HANA Applications. Develop and test an application backend, in a matter of minutes, that runs on SAP HANA – SAP’s inmemory database and application platform. SAP River is a new way of developing native applications on SAP HANA. River consists of a language, a programming model and a set of tools, which allow the developer to focus on the business intent of the application, and largely ignore issues of implementation and optimization. These aspects are taken care of automatically by the language tools, which choose, on compilation, the most appropriate runtime context for each part of the application. River allows a developer to specify the data model, the application business logic as well as access control, all in a single integrated specification. River is compatible with existing SAP HANA objects, like tables, views, stored procedures and XSJS procedures. River code is in fact crosscompiled into these same native runtime objects, which are automatically exposed via an OData API. The result is a simpler development process, increased developer productivity, and application code that is easier to understand and to maintain. 
Sapphire  RDF data in the linked open data (LOD) cloud is very valuable for many different applications. In order to unlock the full value of this data, users should be able to issue complex queries on the RDF datasets in the LOD cloud. SPARQL can express such complex queries, but constructing SPARQL queries can be a challenge to users since it requires knowing the structure and vocabulary of the datasets being queried. In this paper, we introduce Sapphire, a tool that helps users write syntactically and semantically correct SPARQL queries without prior knowledge of the queried datasets. Sapphire interactively helps the user while typing the query by providing autocomplete suggestions based on the queried data. After a query is issued, Sapphire provides suggestions on ways to change the query to better match the needs of the user. We evaluated Sapphire based on performance experiments and a user study and showed it to be superior to competing approaches. 
SAQL  Recently, advanced cyber attacks, which consist of a sequence of steps that involve many vulnerabilities and hosts, compromise the security of many wellprotected businesses. This has led to the solutions that ubiquitously monitor system activities in each host (big data) as a series of events, and search for anomalies (abnormal behaviors) for triaging risky events. Since fighting against these attacks is a timecritical mission to prevent further damage, these solutions face challenges in incorporating expert knowledge to perform timely anomaly detection over the largescale provenance data. To address these challenges, we propose a novel streambased query system that takes as input, a realtime event feed aggregated from multiple hosts in an enterprise, and provides an anomaly query engine that queries the event feed to identify abnormal behaviors based on the specified anomalies. To facilitate the task of expressing anomalies based on expert knowledge, our system provides a domainspecific query language, SAQL, which allows analysts to express models for (1) rulebased anomalies, (2) timeseries anomalies, (3) invariantbased anomalies, and (4) outlierbased anomalies. We deployed our system in NEC Labs America comprising 150 hosts and evaluated it using 1.1TB of real system monitoring data (containing 3.3 billion events). Our evaluations on a broad set of attack behaviors and microbenchmarks show that our system has a low detection latency (<2s) and a high system throughput (110,000 events/s; supporting ~4000 hosts), and is more efficient in memory utilization than the existing streambased complex event processing systems. 
SASiam  Observing that Semantic features learned in an image classification task and Appearance features learned in a similarity matching task complement each other, we build a twofold Siamese network, named SASiam, for realtime object tracking. SASiam is composed of a semantic branch and an appearance branch. Each branch is a similaritylearning Siamese network. An important design choice in SASiam is to separately train the two branches to keep the heterogeneity of the two types of features. In addition, we propose a channel attention mechanism for the semantic branch. Channelwise weights are computed according to the channel activations around the target position. While the inherited architecture from SiamFC \cite{SiamFC} allows our tracker to operate beyond realtime, the twofold design and the attention mechanism significantly improve the tracking performance. The proposed SASiam outperforms all other realtime trackers by a large margin on OTB2013/50/100 benchmarks. 
SAX Transformation  ➘ “Symbolic Aggregate Approximation” TSMining 
sbAbI  In this study, we investigate the limits of the current state of the art AI system for detecting buffer overflows and compare it with current static analysis tools. To do so, we developed a code generator, sbAbI, capable of producing an arbitrarily large number of code samples of controlled complexity. We found that the static analysis engines we examined have good precision, but poor recall on this dataset, except for a sound static analyzer that has good precision and recall. We found that the state of the art AI system, a memory network modeled after Choi et al. [1], can achieve similar performance to the static analysis engines, but requires an exhaustive amount of training data in order to do so. Our work points towards future approaches that may solve these problems; namely, using representations of code that can capture appropriate scope information and using deep learning methods that are able to perform arithmetic operations. 
SBGSketch  Applications in various domains rely on processing graph streams, e.g., communication logs of a cloudtroubleshooting system, roadnetwork traffic updates, and interactions on a social network. A labeledgraph stream refers to a sequence of streamed edges that form a labeled graph. Labelaware applications need to filter the graph stream before performing a graph operation. Due to the large volume and high velocity of these streams, it is often more practical to incrementally build a lossycompressed version of the graph, and use this lossy version to approximately evaluate graph queries. Challenges arise when the queries are unknown in advance but are associated with filtering predicates based on edge labels. Surprisingly common, and especially challenging, are labeledgraph streams that have highly skewed label distributions that might also vary over time. This paper introduces SelfBalanced Graph Sketch (SBGSketch, for short), a graphical sketch for summarizing and querying labeledgraph streams that can cope with all these challenges. SBGSketch maintains synopsis for both the edge attributes (e.g., edge weight) as well as the topology of the streamed graph. SBGSketch allows efficient processing of graphtraversal queries, e.g., reachability queries. Experimental results over a variety of real graph streams show SBGSketch to reduce the estimation errors of stateoftheart methods by up to 99%. 
Scala  Scala is an objectfunctional programming language for general software applications. Scala has full support for functional programming and a very strong static type system. This allows programs written in Scala to be very concise and thus smaller in size than other generalpurpose programming languages. Many of Scala’s design decisions were inspired by criticism of the shortcomings of Java. Scala source code is intended to be compiled to Java bytecode, so that the resulting executable code runs on a Java virtual machine. Java libraries may be used directly in Scala code and vice versa (language interoperability). Like Java, Scala is objectoriented, and uses a curlybrace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching. It also has an advanced type system supporting algebraic data types, covariance and contravariance, higherorder types, and anonymous types. Other features of Scala not present in Java include operator overloading, optional parameters, named parameters, raw strings, and no checked exceptions. The name Scala is a portmanteau of ‘scalable’ and ‘language’, signifying that it is designed to grow with the demands of its users. 
Scalable Advanced Massive Online Analysis (SAMOA) 
SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. samoa is written in Java, is open source, and is available at http://samoaproject.net under the Apache Software License version 2.0. http://samoa.incubator.apache.org http://samoaproject.net https://…ableadvancedmassiveonlineanalysis.pdf 
Scalable Bayesian Multirelational Factorization with Side Information using MCMC (Macau) 
We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drugprotein activity prediction task. 
Scalable Generalized Dynamic Topic Model  Dynamic topic models (DTMs) model the evolution of prevalent themes in literature, online media, and other forms of text over time. DTMs assume that word cooccurrence statistics change continuously and therefore impose continuous stochastic process priors on their model parameters. These dynamical priors make inference much harder than in regular topic models, and also limit scalability. In this paper, we present several new results around DTMs. First, we extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs). This allows us to explore topics that develop smoothly over time, that have a longterm memory or are temporally concentrated (for event detection). Second, we show how to perform scalable approximate inference in these models based on ideas around stochastic variational inference and sparse Gaussian processes. This way we can train a rich family of DTMs to massive data. Our experiments on several largescale datasets show that our generalized model allows us to find interesting patterns that were not accessible by previous approaches. 
Scalable Logo SelfcoLearning (SL^2) 
Existing logo detection methods usually consider a small number of logo classes and limited images per class with a strong assumption of requiring tedious object bounding box annotations, therefore not scalable to realworld dynamic applications. In this work, we tackle these challenges by exploring the webly data learning principle without the need for exhaustive manual labelling. Specifically, we propose a novel incremental learning approach, called Scalable Logo SelfcoLearning (SL^2), capable of automatically selfdiscovering informative training images from noisy web data for progressively improving model capability in a crossmodel colearning manner. Moreover, we introduce a very large (2,190,757 images of 194 logo classes) logo dataset ‘WebLogo2M’ by an automatic web data collection and processing method. Extensive comparative evaluations demonstrate the superiority of the proposed SL^2 method over the stateoftheart strongly and weakly supervised detection models and contemporary webly data learning approaches. 
Scalable Machine Learning  Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. The terms “scalable” and “large scale” have been used in machine learning circles long before there was Big Data. There had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. So finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. 
Scalable Online Learning (SOL) 
SOL is an opensource library for scalable online learning algorithms, and is particularly suitable for learning with highdimensional data. The library provides a family of regular and sparse online learning algorithms for largescale binary and multiclass classification tasks with high efficiency, scalability, portability, and extensibility. SOL was implemented in C++, and provided with a collection of easytouse commandline tools, python wrappers and library calls for users and developers, as well as comprehensive documents for both beginners and advanced users. SOL is not only a practical machine learning toolbox, but also a comprehensive experimental platform for online learning research. Experiments demonstrate that SOL is highly efficient and scalable for largescale machine learning with highdimensional data. 
Scalding  Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away lowlevel Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs. 
Scale Adaptive Dictionary Learning (SADL) 
Dictionary learning has been widely used in many image processing tasks. In most of these methods, the number of basis vectors is either set by experience or coarsely evaluated empirically. In this paper we propose a new Scale Adaptive Dictionary Learning (SADL) framework, which jointly estimates suitable scales and corresponding atoms in an adaptive fashion according to the training data, without the need of prior information. We design an atom counting function and develop a reliable numerical scheme to solve the challenging optimization problem. Extensive experiments on texture and video datasets demonstrate quantitatively and visually that our method can estimate the scale, without damaging the sparse reconstruction ability. 
Scale Invariant Probabilistic Neural Network (SPNN) 
Proposed by Specht (1990) <doi:10.1016/08936080(90)90049q>. spnn 
Scaled Cayley Orthogonal Recurrent Neural Network (scoRNN) 
Recurrent Neural Networks (RNNs) are designed to handle sequential data but suffer from vanishing or exploding gradients. Recent work on Unitary Recurrent Neural Networks (uRNNs) have been used to address this issue and in some cases, exceed the capabilities of Long ShortTerm Memory networks (LSTMs). We propose a simpler and novel update scheme to maintain orthogonal recurrent weight matrices without using complex valued matrices. This is done by parametrizing with a skewsymmetric matrix using the Cayley transform. Such a parametrization is unable to represent matrices with negative one eigenvalues, but this limitation is overcome by scaling the recurrent weight matrix by a diagonal matrix consisting of ones and negative ones. The proposed training scheme involves a straightforward gradient calculation and update step. In several experiments, the proposed scaled Cayley orthogonal recurrent neural network (scoRNN) achieves superior results with fewer trainable parameters than other unitary RNNs. 
Scaled ExponentiallyRegularized Linear Unit (SERLU) 
Recently, selfnormalizing neural networks (SNNs) have been proposed with the intention to avoid batch or weight normalization. The key step in SNNs is to properly scale the exponential linear unit (referred to as SELU) to inherently incorporate normalization based on central limit theory. SELU is a monotonically increasing function, where it has an approximately constant negative output for large negative input. In this work, we propose a new activation function to break the monotonicity property of SELU while still preserving the selfnormalizing property. Differently from SELU, the new function introduces a bumpshaped function in the region of negative input by regularizing a linear function with a scaled exponential function, which is referred to as a scaled exponentiallyregularized linear unit (SERLU). The bumpshaped function has approximately zero response to large negative input while being able to push the output of SERLU towards zero mean statistically. To effectively combat overfitting, we develop a socalled shiftdropout for SERLU, which includes standard dropout as a special case. Experimental results on MNIST, CIFAR10 and CIFAR100 show that SERLUbased neural networks provide consistently promising results in comparison to other 5 activation functions including ELU, SELU, Swish, Leakly ReLU and ReLU. 
Scaled Sparse Linear Regression  Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual square and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs little beyond the computation of a path or grid of the sparse regression estimator for penalty levels above a proper threshold. For the scaled lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the scaled lasso simultaneously yields an estimator for the noise level and an estimated coefficient vector satisfying certain oracle inequalities for prediction, the estimation of the noise level and the regression coefficients. These inequalities provide sufficient conditions for the consistency and asymptotic normality of the noise level estimator, including certain cases where the number of variables is of greater order than the sample size. Parallel results are provided for the least squares estimation after model selection by the scaled lasso. Numerical results demonstrate the superior performance of the proposed methods over an earlier proposal of joint convex minimization. 
ScaleFree Identifier Network (sfIN) 
We propose scalefree Identifier Network(sfIN), a novel model for event identification in documents. In general, sfIN first encodes a document into multiscale memory stacks, then extracts special events via conducting multiscale actions, which can be considered as a special type of sequence labelling. The design of large scale actions makes it more efficient processing a long document. The whole model is trained with both supervised learning and reinforcement learning. 
Scaling Invariable Benford Distance  For the first time, we introduce ‘Scaling invariable Benford distance’ and ‘Benford cyclic graph’, which can be used to analyze any data set. Using the quantity and the graph, we analyze some date sets with common distributions, such as normal, exponent, etc., find that different data set has a much different value of ‘Scaling invariable Benford distance’ and different figure feature of ‘Benford cyclic graph’. We also explore the influence of data size on ‘Scaling invariable Benford distance’, and find that it firstly reduces with data size increasing, then approximate to a fixed value when the size is large enough. 
Scattering Transforms (ScatterNets) 
Scattering Transforms (or ScatterNets) introduced by Mallat are a promising start into creating a welldefined feature extractor to use for pattern recognition and image classification tasks. They are of particular interest due to their architectural similarity to Convolutional Neural Networks (CNNs), while requiring no parameter learning and still performing very well (particularly in constrained classification tasks). In this paper we visualize what the deeper layers of a ScatterNet are sensitive to using a ‘DeScatterNet’. We show that the higher orders of ScatterNets are sensitive to complex, edgelike patterns (checkerboards and rippled edges). These complex patterns may be useful for texture classification, but are quite dissimilar from the patterns visualized in second and third layers of Convolutional Neural Networks (CNNs) – the current state of the art Image Classifiers. We propose that this may be the source of the current gaps in performance between ScatterNets and CNNs (83% vs 93% on CIFAR10 for ScatterNet+SVM vs ResNet). We then use these visualization tools to propose possible enhancements to the ScatterNet design, which show they have the power to extract features more closely resembling CNNs, while still being welldefined and having the invariance properties fundamental to ScatterNets. 
ScatterNet Hybrid Deep Learning (SHDL) 
The paper proposes the ScatterNet Hybrid Deep Learning (SHDL) network that extracts invariant and discriminative image representations for object recognition. SHDL framework is constructed with a multilayer ScatterNet frontend, an unsupervised learning middle, and a supervised learning backend module. Each layer of the SHDL network is automatically designed as an explicit optimization problem leading to an optimal deep learning architecture with improved computational performance as compared to the more usual deep network architectures. SHDL network produces the stateoftheart classification performance against unsupervised and semisupervised learning (GANs) on two image datasets. Advantages of the SHDL network over supervised methods (NIN, VGG) are also demonstrated with experiments performed on training datasets of reduced size. 
Scatterplot Smoothing  In statistics, several scatterplot smoothing methods are available to fit a function through the points of a scatterplot to best represent the relationship between the variables. Scatterplots may be smoothed by fitting a line to the data points in a diagram. This line attempts to display the nonrandom component of the association between the variables in a 2D scatter plot. Smoothing attempts to separate the nonrandom behaviour in the data from the random fluctuations, removing or reducing these fluctuations, and allows prediction of the response based value of the explanatory variable. 
SceneLSTM  We develop a human movement trajectory prediction system that incorporates the scene information (SceneLSTM) as well as human movement trajectories (Pedestrian movement LSTM) in the prediction process within static crowded scenes. We superimpose a twolevel grid structure (scene is divided into grid cells each modeled by a sceneLSTM, which are further divided into smaller subgrids for finer spatial granularity) and explore common human trajectories occurring in the grid cell (e.g., making a right or left turn onto sidewalks coming out of an alley; or standing still at bus/train stops). Two coupled LSTM networks, Pedestrian movement LSTMs (one per target) and the corresponding SceneLSTMs (one per gridcell) are trained simultaneously to predict the next movements. We show that such common path information greatly influences prediction of future movement. We further design a scene data filter that holds important nonlinear movement information. The scene data filter allows us to select the relevant parts of the information from the grid cell’s memory relative to a target’s state. We evaluate and compare two versions of our method with the Linear and several existing LSTMbased methods on five crowded video sequences from the UCY [1] and ETH [2] datasets. The results show that our method reduces the location displacement errors compared to related methods and specifically about 80% reduction compared to social interaction methods. 
Scheduled Auxiliary Control (SACX) 
We propose Scheduled Auxiliary Control (SACX), a new learning paradigm in the context of Reinforcement Learning (RL). SACX enables learning of complex behaviors – from scratch – in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via offpolicy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment – enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach. 
Scheduling Theory  A branch of applied mathematics (a division of operations research) concerned with mathematical formulations and solution methods of problems of optimal ordering and coordination in time of certain operations. Scheduling theory includes questions on the development of optimal schedules (Gantt charts, graphs) for performing finite (or repetitive) sets of operations. The area of application of results in scheduling theory include management, production, transportation, computer systems, construction, etc. The problems that scheduling theory deals with are usually formulated as optimization problems for a process of processing a finite set of jobs in a system with limited resources. A finite set of jobs is what distinguishes scheduling models from similar models in queueing theory, where basically infinite flows of activities are considered. In all other respects the starting points of the two theories are close. In scheduling theory, the time of arrival for every job into the system is specified. Within the system the job has to pass several processing stages, depending on the conditions of the problem. For every stage, feasible sets of resources are given, as well as the processing time depending on the resources used. The possibility of interruptions in the processing of certain jobs (socalled preemptions) can also be stipulated. Constraints on the processing sequence are usually described by a transitive antireflexive binary relation. Algorithms for the evaluation of characteristics of large partially ordered sets of jobs constitute the essence of the part of scheduling theory called network analysis (cf. Network model; Network planning). Sometimes, in scheduling models durations of readjustments are specified that are necessary when one job in process is replaced by another, as well as certain other conditions. 
Schelling’s Model of Segregation  In 1971, the American economist Thomas Schelling created an agentbased model that might help explain why segregation is so difficult to combat. His model of segregation showed that even when individuals (or ‘agents’) didn’t mind being surrounded or living by agents of a different race, they would still choose to segregate themselves from other agents over time! Although the model is quite simple, it gives a fascinating look at how individuals might selfsegregate, even when they have no explicit desire to do so. 
Scientific Data Mining  Scientific data mining is defined as data mining applied to scientific problems, rather than database marketing, finance, or businessdriven applications. Scientific data mining distinguishes itself in the sense that the nature of the datasets is often very different from traditional marketdriven data mining applications. The datasets now might involve vast amounts of precise and continuous data, and accounting for underlying system nonlinearities can be extremely challenging from a machine learning point of view. 
Scientific Information Extractor (SciIE) 
We introduce a multitask setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multitask setup reduces cascading errors between tasks and leverages crosssentence relations through coreference links. Experiments show that our multitask model outperforms previous models in scientific information extraction without using any domainspecific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature. 
scikit  scikitlearn: Machine Learning in Python · Simple and efficient tools for data mining and data analysis · Accessible to everybody, and reusable in various contexts · Built on NumPy, SciPy, and matplotlib · Open source, commercially usable – BSD license 
ScikitMultiflow  Scikitmultiflow is a multioutput/multilabel and stream data mining framework for the Python programming language. Conceived to serve as a platform to encourage democratization of stream learning research, it provides multiple state of the art methods for stream learning, stream generators and evaluators. scikitmultiflow builds upon popular open source frameworks including scikitlearn, MOA and MEKA. Development follows the FOSS principles and quality is enforced by complying with PEP8 guidelines and using continuous integration and automatic testing. ScikitMultiflow: A Multioutput Streaming Framework 
SciTokens  The management of security credentials (e.g., passwords, secret keys) for computational science workflows is a burden for scientists and information security officers. Problems with credentials (e.g., expiration, privilege mismatch) cause workflows to fail to fetch needed input data or store valuable scientific results, distracting scientists from their research by requiring them to diagnose the problems, rerun their computations, and wait longer for their results. In this paper, we introduce SciTokens, open source software to help scientists manage their security credentials more reliably and securely. We describe the SciTokens system architecture, design, and implementation addressing use cases from the Laser Interferometer GravitationalWave Observatory (LIGO) Scientific Collaboration and the Large Synoptic Survey Telescope (LSST) projects. We also present our integration with widelyused software that supports distributed scientific computing, including HTCondor, CVMFS, and XrootD. SciTokens uses IETFstandard OAuth tokens for capabilitybased secure access to remote scientific data. The access tokens convey the specific authorizations needed by the workflows, rather than generalpurpose authentication impersonation credentials, to address the risks of scientific workflows running on distributed infrastructure including NSF resources (e.g., LIGO Data Grid, Open Science Grid, XSEDE) and public clouds (e.g., Amazon Web Services, Google Cloud, Microsoft Azure). By improving the interoperability and security of scientific workflows, SciTokens 1) enables use of distributed computing for scientific domains that require greater data protection and 2) enables use of more widely distributed computing resources by reducing the risk of credential abuse on remote systems. 
Score Function  In statistics, the score, score function, efficient score or informant indicates how sensitively a likelihood function L(theta,X) depends on its parameter theta. Explicitly, the score for theta is the gradient of the loglikelihood with respect to theta. The score plays an important role in several aspects of inference. For example: · in formulating a test statistic for a locally most powerful test; · in approximating the error in a maximum likelihood estimate; · in demonstrating the asymptotic sufficiency of a maximum likelihood estimate; · in the formulation of confidence intervals; · in demonstrations of the CramérRao inequality. The score function also plays an important role in computational statistics, as it can play a part in the computation of maximum likelihood estimates. 
Scoring Rule  In decision theory, a score function, or scoring rule, measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). A score can be thought of as either a measure of the “calibration” of a set of probabilistic predictions, or as a “cost function” or “loss function”. If a cost is levied in proportion to a proper scoring rule, the minimal expected cost corresponds to reporting the true set of probabilities. Proper scoring rules are used in meteorology, finance, and pattern classification where a forecaster or algorithm will attempt to minimize the average score to yield refined, calibrated probabilities (i.e. accurate probabilities). Various scoring rules have also been used to assess the predictive accuracy of football forecast models. 
ScottKnott  ScottKnott is an hierarchical clustering algorithm used in the application of ANOVA, when the researcher is comparing treatment means, with a very important characteristic: it does not present any overlapping in its grouping results. We wrote a code, in R, that performs this algorithm starting from vectors, matrix, data.frame, aov or aov.list objects. The results are presented with letters representing groups, as well as through graphics using different colors to differentiate distinct groups. This R package, named ScottKnott is the main topic of this article. ScottKnott,ScottKnottESD 
SCOUT  Finding the right cloud configuration for workloads is an essential step to ensure good performance and contain running costs. A poor choice of cloud configuration decreases application performance and increases running cost significantly. While Bayesian Optimization is effective and applicable to any workloads, it is fragile because performance and workload are hard to model (to predict). In this paper, we propose a novel method, SCOUT. The central insight of SCOUT is that using prior measurements, even those for different workloads, improves search performance and reduces search cost. At its core, SCOUT extracts search hints (inference of resource requirements) from lowlevel performance metrics. Such hints enable SCOUT to navigate through the search space more efficiently—only spotlight region will be searched. We evaluate SCOUT with 107 workloads on Apache Hadoop and Spark. The experimental results demonstrate that our approach finds better cloud configurations with a lower search cost than state of the art methods. Based on this work, we conclude that (i) lowlevel performance information is necessary for finding the right cloud configuration in an effective, efficient and reliable way, and (ii) a search method can be guided by historical data, thereby reducing cost and improving performance. 
ScoutBot  ScoutBot is a dialogue interface to physical and simulated robots that supports collaborative exploration of environments. The demonstration will allow users to issue unconstrained spoken language commands to ScoutBot. ScoutBot will prompt for clarification if the user’s instruction needs additional input. It is trained on humanrobot dialogue collected from WizardofOz experiments, where robot responses were initiated by a human wizard in previous interactions. The demonstration will show a simulated ground robot (Clearpath Jackal) in a simulated environment supported by ROS (Robot Operating System). 
Scree Plot  A scree plot is a graphical display of the variance of each component in the dataset which is used to determine how many components should be retained in order to explain a high percentage of the variation in the data. The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding. 
ScreenerNet  We propose to learn a curriculum or a syllabus for supervised learning with deep neural networks. Specifically, we learn weights for each sample in training by an attached neural network, called ScreenerNet, to the original network and jointly train them in an endtoend fashion. We show the networks augmented with our ScreenerNet achieve early convergence with better accuracy than the stateoftheart rulebased curricular learning methods in extensive experiments using three popular vision datasets including MNIST, CIFAR10 and Pascal VOC2012, and a Cartpole task using Deep Qlearning. 
SCvx  This paper presents the SCvx algorithm, a successive convexification algorithm designed to solve nonconvex optimal control problems with global convergence and superlinear convergencerate guarantees. The proposed algorithm handles nonlinear dynamics and nonconvex state and control constraints by linearizing them about the solution of the previous iterate, and solving the resulting convex subproblem to obtain a solution for the current iterate. Additionally, the algorithm incorporates several safeguarding techniques into each convex subproblem, employing virtual controls and virtual buffer zones to avoid artificial infeasibility, and a trust region to avoid artificial unboundedness. The procedure is repeated in succession, thus turning a difficult nonconvex optimal control problem into a sequence of numerically tractable convex subproblems. Using fast and reliable Interior Point Method (IPM) solvers, the convex subproblems can be computed quickly, making the SCvx algorithm well suited for realtime applications. Analysis is presented to show that the algorithm converges both globally and superlinearly, guaranteeing the local optimality of the original problem. The superlinear convergence is obtained by exploiting the structure of optimal control problems, showcasing the superior convergence rate that can be obtained by leveraging specific problem properties when compared to generic nonlinear programming methods. Numerical simulations are performed for an illustrative nonconvex quadrotor motion planning example problem, and corresponding results obtained using Sequential Quadratic Programming (SQP) solver are provided for comparison. Our results show that the convergence rate of the SCvx algorithm is indeed superlinear, and surpasses that of the SQPbased method by converging in less than half the number of iterations. 
SE3Nets  We introduce SE3Nets, which are deep networks designed to model rigid body motion from raw point cloud data. Based only on pairs of depth images along with an action vector and point wise data associations, SE3Nets learn to segment effected object parts and predict their motion resulting from the applied force. Rather than learning point wise flow vectors, SE3Nets predict SE3 transformations for different parts of the scene. Using simulated depth data of a table top scene and a robot manipulator, we show that the structure underlying SE3Nets enables them to generate a far more consistent prediction of object motion than traditional flow based networks. 
Seaborn  Seaborn is a Python visualization library based on matplotlib. It provides a highlevel interface for drawing attractive statistical graphics. 
Search Partition Analysis (SPAN) 
spanr 
Search Session Markov Decision Process (SSMDP) 
In ecommerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multistep decisionmaking problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multistep ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than online LTR methods, with more than 40% and 30% growth of total transaction amount in the simulation and the real application, respectively. 
SEARNN  We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the ‘learning to search’ (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages testalike search space exploration to introduce globallocal losses that are closer to the test error. We demonstrate improved performance over MLE on three different tasks: OCR, spelling correction and text chunking. Finally, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. 
Seasonal ARIMA (SARIMA) 
Often time series possess a seasonal component that repeats every s observations. For monthly observations s = 12 (12 in 1 year), for quarterly observations s = 4 (4 in 1 year). In order to deal with seasonality, ARIMA processes have been generalized: SARIMA models have then been formulated. 
Seasonal Decomposition of Time Series by Loess (STL) 
Decompose a time series into seasonal, trend and irregular components using loess. 
Seasonal Hybrid ESD (SHESD) 
The primary algorithm, Seasonal Hybrid ESD (SHESD), builds upon the Generalized ESD test for detecting anomalies. SHESD can be used to detect both global and local anomalies. This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD. In addition, for long time series such as 6 months of minutely data, the algorithm employs piecewise approximation. This is rooted to the fact that trend extraction in the presence of anomalies is nontrivial for anomaly detection. AnomalyDetection 
SecondOrder Convolutional Neural Networks  Convolutional Neural Networks (CNNs) have been successfully applied to many computer vision tasks, such as image classification. By performing linear combinations and elementwise nonlinear operations, these networks can be thought of as extracting solely firstorder information from an input image. In the past, however, secondorder statistics computed from handcrafted features, e.g., covariances, have proven highly effective in diverse recognition tasks. In this paper, we introduce a novel class of CNNs that exploit secondorder statistics. To this end, we design a series of new layers that (i) extract a covariance matrix from convolutional activations, (ii) compute a parametric, secondorder transformation of a matrix, and (iii) perform a parametric vectorization of a matrix. These operations can be assembled to form a Covariance Descriptor Unit (CDU), which replaces the fullyconnected layers of standard CNNs. Our experiments demonstrate the benefits of our new architecture, which outperform the firstorder CNNs, while relying on up to 90% fewer parameters. 
SecondOrder Pooling  
SecureStreams  The growing adoption of distributed data processing frameworks in a wide diversity of application domains challenges endtoend integration of properties like security, in particular when considering deployments in the context of largescale clusters or multitenant Cloud infrastructures. This paper therefore introduces SecureStreams, a reactive middleware framework to deploy and process secure streams at scale. Its design combines the highlevel reactive dataflow programming paradigm with Intel’s lowlevel software guard extensions (SGX) in order to guarantee privacy and integrity of the processed data. The experimental results of SecureStreams are promising: while offering a fluent scripting language based on Lua, our middleware delivers high processing throughput, thus enabling developers to implement secure processing pipelines in just few lines of code. 
Sedano  We present Sedano, a system for processing and indexing a continuous stream of businessrelated news. Sedano defines pipelines whose stages analyze and enrich news items (e.g., newspaper articles and press releases). News data coming from several content sources are stored, processed and then indexed in order to be consumed by Atoka, our business intelligence product. Atoka users can retrieve news about specific companies, filtering according to various facets. Sedano features both an entitylinking phase, which finds mentions of companies in news, and a classification phase, which classifies news according to a set of business events. Its flexible architecture allows Sedano to be deployed on commodity machines while being scalable and faulttolerant. 
Seemingly Unrelated Regression (SUR) 
In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equations, each having its own dependent variable and potentially different sets of exogenous explanatory variables. Each equation is a valid linear regression on its own and can be estimated separately, which is why the system is called seemingly unrelated, although some authors suggest that the term seemingly related would be more appropriate, since the error terms are assumed to be correlated across the equations. The model can be estimated equationbyequation using standard ordinary least squares (OLS). Such estimates are consistent, however generally not as efficient as the SUR method, which amounts to feasible generalized least squares with a specific form of the variancecovariance matrix. Two important cases when SUR is in fact equivalent to OLS are when the error terms are in fact uncorrelated between the equations (so that they are truly unrelated) and when each equation contains exactly the same set of regressors on the righthandside. The SUR model can be viewed as either the simplification of the general linear model where certain coefficients in matrix B {\displaystyle \mathrm {B} } \Beta are restricted to be equal to zero, or as the generalization of the general linear model where the regressors on the righthandside are allowed to be different in each equation. The SUR model can be further generalized into the simultaneous equations model, where the righthand side regressors are allowed to be the endogenous variables as well. 
Seglearn  Seglearn is an opensource python package for machine learning time series or sequences using a sliding window segmentation approach. The implementation provides a flexible pipeline for tackling classification, regression, and forecasting problems with multivariate sequence and contextual data. This package is compatible with scikitlearn and is listed under scikitlearn Related Projects. The package depends on numpy, scipy, and scikitlearn. Seglearn is distributed under the BSD 3Clause License. Documentation includes a detailed API description, user guide, and examples. Unit tests provide a high degree of code coverage. 
Segmental Recurrent Neural Network (SRNN) 
We introduce segmental recurrent neural networks (SRNNs) which define, given an input sequence, a joint probability distribution over segmentations of the input and labelings of the segments. Representations of the input segments (i.e., contiguous subsequences of the input) are computed by encoding their constituent tokens using bidirectional recurrent neural nets, and these ‘segment embeddings’ are used to define compatibility scores with output labels. These local compatibility scores are integrated using a global semiMarkov conditional random field. Both fully supervised training – in which segment boundaries and labels are observed – as well as partially supervised training – in which segment boundaries are latent – are straightforward. Experiments on handwriting recognition and joint Chinese word segmentation/POS tagging show that, compared to models that do not explicitly represent segments such as BIO tagging schemes and connectionist temporal classification (CTC), SRNNs obtain substantially higher accuracies. 
Segmented Linear Regression  Segmented linear regression with two segments separated by a breakpoint can be useful to quantify an abrupt change of the response function (Yr) of a varying influential factor (x). The breakpoint can be interpreted as a critical, safe, or threshold value beyond or below which (un)desired effects occur. The breakpoint can be important in decision making. cowbell 
Segmented Regression  Segmented regression, also known as piecewise regression or ‘brokenstick regression’, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. Segmented regression analysis can also be performed on multivariate data by partitioning the various independent variables. Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. The boundaries between the segments are breakpoints. Segmented linear regression is segmented regression whereby the relations in the intervals are obtained by linear regression. segmented 
Segmentlevel POlariTy annotations (SPOT) 
We consider the task of finegrained sentiment analysis from the perspective of multiple instance learning (MIL). Our neural model is trained on document sentiment labels, and learns to predict the sentiment of text segments, i.e. sentences or elementary discourse units (EDUs), without segmentlevel supervision. We introduce an attentionbased polarity scoring method for identifying positive and negative text snippets and a new dataset which we call SPOT (as shorthand for Segmentlevel POlariTy annotations) for evaluating MILstyle sentiment models like ours. Experimental results demonstrate superior performance against multiple baselines, whereas a judgement elicitation study shows that EDUlevel opinion extraction produces more informative summaries than sentencebased alternatives. 
SegReg  In statistics and data analysis the application software SegReg is a free and userfriendly tool for linear segmented regression analysis to determine the breakpoint where the relation between the dependent variable and the independent variable changes abruptly. Originally the method was developed for the analysis of the influence of soil salinity and depth of the watertable on growth of agricultural crops. However, it can be used for many other types of phenomena and relations, for example: · the change of nutrient contents in plants with time · the number of negative indicator responses at 30% upstream riparian harvest · phosphorus and flow duration on the Saline River 
Seldon  Seldon is an opensource predictive analytics platform. Our proven machine learning algorithms and highly scalable platform serve recommendations to hundreds of millions of people across some of the world’s leading media and ecommerce brands. Seldon VM contains the entire platform preconfigured in a virtual machine for you to get started quickly, to test Seldon with your service data and a movie recommender demo. 
Selective Clustering Annotated Using Modes of Projections (SCAMP) 
Selective clustering annotated using modes of projections (SCAMP) is a new clustering algorithm for data in $\mathbb{R}^p$. SCAMP is motivated from the point of view of nonparametric mixture modeling. Rather than maximizing a classification likelihood to determine cluster assignments, SCAMP casts clustering as a search and selection problem. One consequence of this problem formulation is that the number of clusters is $\textbf{not}$ a SCAMP tuning parameter. The search phase of SCAMP consists of finding subcollections of the data matrix, called candidate clusters, that obey shape constraints along each coordinate projection. An extension of the dip test of Hartigan and Hartigan (1985) is developed to assist the search. Selection occurs by scoring each candidate cluster with a preference function that quantifies prior belief about the mixture composition. Clustering proceeds by selecting candidates to maximize their total preference score. SCAMP concludes by annotating each selected cluster with labels that describe how clusterlevel statistics compare to certain datasetlevel quantities. SCAMP can be run multiple times on a single data matrix. Comparison of annotations obtained across iterations provides a measure of clustering uncertainty. Simulation studies and applications to real data are considered. A C++ implementation with R interface is $\href{https://…/scamp}{available\ online}$. 
SelectScript  We introduce a new declarative language called SELECTSCRIPT. As its name suggests, it is a scripting language inspired primarily by SQL and its relational algebra. It is intended to be used for complex queries on different kinds of world models. Scripts can be dynamically generated and executed, or embedded into the code of foreign programming languages. A first interpreter was therefore developed for Python. Adapting the ideas of languageoriented programming, which enables developers to create their own domainspecific language, we developed a language stub that can be easily adapted and extended to comply with any (discrete) robotic world model or robotic simulator. We will further show how simple SELECTstatements can be used to extract any kind of valuable information in various return formats, thereby going beyond traditional SQL capabilities. Reasoning in complex environments with the SelectScript declarative language 
Self Driving Data Curation  Past. Data curation – the process of discovering, integrating, and cleaning data – is one of the oldest data management problems. Unfortunately, it is still the most time consuming and least enjoyable work of data scientists. So far, successful data curation stories are mainly adhoc solutions that are either domainspecific (for example, ETL rules) or taskspecific (for example, entity resolution). Present. The power of current data curation solutions are not keeping up with the ever changing data ecosystem in terms of volume, velocity, variety and veracity, mainly due to the high human cost, instead of machine cost, needed for providing the adhoc solutions mentioned above. Meanwhile, deep learning is making strides in achieving remarkable successes in areas such as image recognition, natural language processing, and speech recognition. This is largely due to its ability to understanding features that are neither domainspecific nor taskspecific. Future. Data curation solutions need to keep the pace with the fastchanging data ecosystem, where the main hope is to devise domainagnostic and taskagnostic solutions. To this end, we start a new research project, called AutoDC, to unleash the potential of deep learning towards selfdriving data curation. We will discuss how different deep learning concepts can be adapted and extended to solve various data curation problems. We showcase some lowhanging fruits about the early encounters between deep learning and data curation happening in AutoDC. We believe that the directions pointed out by this work will not only drive AutoDC towards democratizing data curation, but also serve as a cornerstone for researchers and practitioners to move to a new realm of data curation solutions. 
Self Exciting Point Process (SEPP) 
A point process N is called selfexciting if cov(N(s,t),N(t,u))>0 for s<t<u where here, cov denotes the covariance of the two quantities. Intuitively, a process is selfexciting if the occurrence of past points makes the occurrence of future points more probable. SpatioTemporal Modeling with R: Point process prediction for mortals 
Self Organising Deltoids  selforganising deltoids dimension squeezing algorithm. This is a simple algorithm that tries to find reasonable positions in mdimensional space for a set of points in n dimensions (where m is smaller than n). It’s main usage is to visualise ndimensional data in 2 dimensions, but any dimensionality can be choosen.The algorithm simply takes a set of points in Ndimensions, and then gradually squeezes out the excess dimensions using the errors in internode distances to arrange the nodes in the reduced dimensional space. 
Self OtherModeling (SOM) 
We consider the multiagent reinforcement learning setting with imperfect information in which each agent is trying to maximize its own utility. The reward function depends on the hidden state (or goal) of both agents, so the agents must infer the other players’ hidden goals from their observed behavior in order to solve the tasks. We propose a new approach for learning in these domains: Self OtherModeling (SOM), in which an agent uses its own policy to predict the other agent’s actions and update its belief of their hidden state in an online manner. We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players’ hidden states, in both cooperative and adversarial settings. 
SelfAdaptive NeuroFuzzy Inference System (SANFIS) 
This paper presents a selfadaptive neurofuzzy inference system (SANFIS) that is capable of selfadapting and selforganizing its internal structure to acquire a parsimonious rulebase for interpreting the embedded knowledge of a system from the given training data set. A connectionist topology of fuzzy basis functions with their universal approximation capability is served as a fundamental SANFIS architecture that provides an elasticity to be extended to all existing fuzzy models whose consequent could be fuzzy term sets, fuzzy singletons, or functions of linear combination of input variables. Without a priori knowledge of the distribution of the training data set, a novel mappingconstrained agglomerative clustering algorithm is devised to reveal the true cluster configuration in a single pass for an initial SANFIS construction, estimating the location and variance of each cluster. Subsequently, a fast recursive linear/nonlinear leastsquares algorithm is performed to further accelerate the learning convergence and improve the system performance. Good generalization capability, fast learning convergence and compact comprehensible knowledge representation summarize the strength of SANFIS. Computer simulations for the Iris, Wisconsin breast cancer, and wine classifications show that SANFIS achieves significant improvements in terms of learning convergence, higher accuracy in recognition, and a parsimonious architecture. 
SelfAdaptive Neurofuzzy System (SANFS) 

SelfAdaptive Systems (SAS) 
Self Adaptive Software evaluates its own behavior and changes behavior when the evaluation indicates that it is not accomplishing what the software is intended to do, or when better functionality or performance is possible. 
SelfAttention Based Sequential Model (SASRec) 
Sequential dynamics are a key feature of many modern recommender systems, which seek to capture the `context’ of users’ activities on the basis of actions they have performed recently. To capture such patterns, two approaches have proliferated: Markov Chains (MCs) and Recurrent Neural Networks (RNNs). Markov Chains assume that a user’s next action can be predicted on the basis of just their last (or last few) actions, while RNNs in principle allow for longerterm semantics to be uncovered. Generally speaking, MCbased methods perform best in extremely sparse datasets, where model parsimony is critical, while RNNs perform better in denser datasets where higher model complexity is affordable. The goal of our work is to balance these two goals, by proposing a selfattention based sequential model (SASRec) that allows us to capture longterm semantics (like an RNN), but, using an attention mechanism, makes its predictions based on relatively few actions (like an MC). At each time step, SASRec seeks to identify which items are `relevant’ from a user’s action history, and use them to predict the next item. Extensive empirical studies show that our method outperforms various stateoftheart sequential models (including MC/CNN/RNNbased approaches) on both sparse and dense datasets. Moreover, the model is an order of magnitude more efficient than comparable CNN/RNNbased models. Visualizations on attention weights also show how our model adaptively handles datasets with various density, and uncovers meaningful patterns in activity sequences. 
SelfAttention Generative Adversarial Network (SAGAN) 
In this paper, we propose the SelfAttention Generative Adversarial Network (SAGAN) which allows attentiondriven, longrange dependency modeling for image generation tasks. Traditional convolutional GANs generate highresolution details as a function of only spatially local points in lowerresolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the stateoftheart results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape. 
SelfAttentive Neural Collaborative Filtering (SANCF) 
The dominant, stateoftheart collaborative filtering (CF) methods today mainly comprises neural models. In these models, deep neural networks, e.g.., multilayered perceptrons (MLP), are often used to model nonlinear relationships between user and item representations. As opposed to shallow models (e.g., factorizationbased models), deep models generally provide a greater extent of expressiveness, albeit at the expense of impaired/restricted information flow. Consequently, the performance of most neural CF models plateaus at 34 layers, with performance stagnating or even degrading when increasing the model depth. As such, the question of how to train really deep networks in the context of CF remains unclear. To this end, this paper proposes a new technique that enables training neural CF models all the way up to 20 layers and beyond. Our proposed approach utilizes a new hierarchical selfattention mechanism that learns introspective intrafeature similarity across all the hidden layers of a standard MLP model. All in all, our proposed architecture, SANCF (SelfAttentive Neural Collaborative Filtering) is a densely connected selfmatching model that can be trained up to 24 layers without plateauing, achieving wide performance margins against its competitors. On several popular benchmark datasets, our proposed architecture achieves up to an absolute improvement of 23%58% and 1.3x to 2.8x fold improvement in terms of nDCG@10 and Hit Ratio (HR@10) scores over several strong neural CF baselines. 
SelfConcordant Regularization in Bandit Learning (SCRiBLe) 
SCRiBLe (SelfConcordant Regularization in Bandit Learning) created by Abernethy et. al.\cite{abernethy}. The SCRiBLe setup and algorithm yield a $O(\sqrt{T})$ regret bound and polynomial run time complexity bound on the dimension of the input space. 
SelfControlled Case Series Model (SCCS) 
The selfcontrolled case series (SCCS) method is an alternative study method for investigating the association between a transient exposure and an adverse event. The method was developed to study adverse reactions to vaccines. The method uses only cases, no separate controls are required as the cases act as their own controls. Each case’s given observation time is divided into control and risk periods. Risk periods are defined during or after the exposure. Then the method finds a relative incidence, that is, the incidence in risk periods relative to the incidence in control periods. Timevarying confounding factors such as age can be allowed for by dividing up the observation period further into age categories. An advantage of the method is that confounding factors that do not vary with time, such as genetics, location, socioeconomic status are controlled for implicitly. SCCS 
SelfExciting Model of Information Cascades (SEISMIC) 
Here we focus on predicting the final size of an information cascade spreading through a network. We develop a statistical model based on the theory of selfexciting point processes. A point process indexed by time is called a counting process when it counts the number of instances (reshares, in our case) over time. In contrast to homogeneous Poisson processes which assume constant intensity over time, selfexciting processes assume that all the previous instances (i.e., reshares) influence the future evolution of the process. Selfexciting point processes are frequently used to model ‘rich get richer’ phenomena. They are ideal for modeling information cascades in networks because every new reshare of a post not only increases its cumulative reshare count by one, but also exposes new followers who may further reshare the post. We develop SEISMIC (SelfExciting Model of Information Cascades) for predicting the total number of reshares of a given post. In our model, each post is fully characterized by its infectiousness which measures the reshare probability. We allow the infectiousness to vary freely over time in agreement with the observation that the infectiousness can drop as the content gets stale. Moreover, our model is able to identify at each time point whether the cascade is in the supercritical or subcritical state, based on whether its infectiousness is above or below a critical threshold. A cascade in the supercritical state is going through an ‘explosion’ period and its final size cannot be predicted accurately at the current time. On the contrary, a cascade is tractable if it is in subcritical state. In this case, we are able to predict its ultimate popularity accurately by modeling the future cascading behavior by a Galton Watson tree. Our SEISMIC approach makes several contributions: Generative model: SEISMIC imposes no parametric assumptions and requires no expensive feature engineering. Moreover, as complete social network structure may be hard to obtain, SEISMIC assumes minimal knowledge of the network: The only required input is the time history of reshares and the degrees of the resharing nodes. seismic 
SelfExciting Point Process Model  seismic 
SelfImitation Learning (SIL) 
This paper proposes SelfImitation Learning (SIL), a simple offpolicy actorcritic algorithm that learns to reproduce the agent’s past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actorcritic (A2C) on several hard exploration Atari games and is competitive to the stateoftheart countbased exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks. 
Selfless Sequential Learning  Sequential learning studies the problem of learning tasks in a sequence with restricted access to only the data of the current task. In the setting with a fixed model capacity, the learning process should not be selfish and account for later tasks to be added and therefore aim at utilizing a minimum number of neurons, leaving enough capacity for future needs. We explore different regularization strategies and activation functions that could lead to less interference between the different tasks. We show that learning a sparse representation is more beneficial for sequential learning than encouraging parameter sparsity regardless of their corresponding neurons. We particularly propose a novel regularizer that encourages representation sparsity by means of neural inhibition. It results in few active neurons which in turn leaves more free neurons to be utilized by upcoming tasks. We combine our regularizer with stateoftheart lifelong learning methods that penalize changes on important previously learned parts of the network. We show that increased sparsity translates in a performance improvement on the different tasks that are learned in a sequence. 
SelfNormalizing Neural Network  Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feedforward neural networks (FNNs) are rare. FNNs that perform well are typically shallow and, therefore cannot exploit many levels of abstract representations. We introduce selfnormalizing neural networks (SNNs) to enable highlevel abstract representations. While batch normalization requires explicit normalization, neuron activations of SNNs automatically converge towards zero mean and unit variance. The activation function of SNNs are ‘scaled exponential linear units’ (SELUs), which induce selfnormalizing properties. Using the Banach fixedpoint theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance — even under the presence of noise and perturbations. This convergence property of SNNs allows to (1) train deep networks with many layers, (2) employ strong regularization, and (3) to make learning highly robust. Furthermore, for activations not close to unit variance, we prove an upper and lower bound on the variance, thus, vanishing and exploding gradients are impossible. We compared SNNs on (a) 121 tasks from the UCI machine learning repository, on (b) drug discovery benchmarks, and on (c) astronomy tasks with standard FNNs and other machine learning methods such as random forests and support vector machines. SNNs significantly outperformed all competing FNN methods at 121 UCI tasks, outperformed all competing methods at the Tox21 dataset, and set a new record at an astronomy data set. The winning SNN architectures are often very deep. Implementations are available at: github.com/bioinfjku/SNNs. 
SelfOrganizing Map (SOM) 
A selforganizing map (SOM) or selforganizing feature map (SOFM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a lowdimensional (typically twodimensional), discretized representation of the input space of the training samples, called a map. Selforganizing maps are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing lowdimensional views of highdimensional data, akin to multidimensional scaling. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map or network. 
SelfOrganizing Systems  The term Selforganizing Systems refers to a class of systems that are able to change their internal structure and their function in response to external circumstances. By selforganization it is understood that elements of a system are able to manipulate or organize other elements of the same system in a way that stabilizes either structure or function of the whole against external fluctuations. Over the last decades a variety of features have been identified as typical for selforganizing systems. Not all of these features are present in all systems able to selforganize. Selforganizing systems are dynamic, nondeterministic, open, exist far from equilibrium and sometimes employ autocatalytic amplification of fluctuations. Often, they are characterized by multiple timescales of their internal and/or external interactions, they possess a hierarchy of structural and/or functional levels and they are able to react to external input in a variety of ways. Many selforganizing systems are nonteleological, i.e. they do not have a specific purpose except their own existence. As a consequence, selfmaintenance is an important function of many selforganizing systems. Most of these systems are complex and use reduncancy to achieve resilience against external pertubation tendencies. Selforganizing systems have been discovered in nature, both in the nonliving (galaxies, stars) and the living world (cells, organisms, ecosystems), they have been found in manmade systems (societies, economies), and they have been identified in the world of ideas (world views, scientific believes, norm systems). Self Organizing System 
SelfPaced Learning (SPL) 
It is known that Boosting can be interpreted as a gradient descent technique to minimize an underlying loss function. Specifically, the underlying loss being minimized by the traditional AdaBoost is the exponential loss, which is proved to be very sensitive to random noise/outliers. Therefore, several Boosting algorithms, e.g., LogitBoost and SavageBoost, have been proposed to improve the robustness of AdaBoost by replacing the exponential loss with some designed robust loss functions. In this work, we present a new way to robustify AdaBoost, i.e., incorporating the robust learning idea of Selfpaced Learning (SPL) into Boosting framework. Specifically, we design a new robust Boosting algorithm based on SPL regime, i.e., SPLBoost, which can be easily implemented by slightly modifying offtheshelf Boosting packages. Extensive experiments and a theoretical characterization are also carried out to illustrate the merits of the proposed SPLBoost. 
SelfPaced MultiTask Clustering (SPMTC) 
Multitask clustering (MTC) has attracted a lot of research attentions in machine learning due to its ability in utilizing the relationship among different tasks. Despite the success of traditional MTC models, they are either easy to stuck into local optima, or sensitive to outliers and noisy data. To alleviate these problems, we propose a novel selfpaced multitask clustering (SPMTC) paradigm. In detail, SPMTC progressively selects data examples to train a series of MTC models with increasing complexity, thus highly decreases the risk of trapping into poor local optima. Furthermore, to reduce the negative influence of outliers and noisy data, we design a soft version of SPMTC to further improve the clustering performance. The corresponding SPMTC framework can be easily solved by an alternating optimization method. The proposed model is guaranteed to converge and experiments on real data sets have demonstrated its promising results compared with stateoftheart multitask clustering methods. 
SelfPaced Sparse Coding (SPSC) 
Sparse coding (SC) is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications. However, most existing sparse coding algorithms are nonconvex and are thus prone to becoming stuck into bad local minima, especially when there are outliers and noisy data. To enhance the learning robustness, in this paper, we propose a unified framework named SelfPaced Sparse Coding (SPSC), which gradually include matrix elements into SC learning from easy to complex. We also generalize the selfpaced learning schema into different levels of dynamic selection on samples, features and elements respectively. Experimental results on realworld data demonstrate the efficacy of the proposed algorithms. 
SelfService Semantic Suite (S4) 
The SelfService Semantic Suite (S4) provides a set of services for lowcost, ondemand text analytics and metadata management on the cloud. S4 provides the following services: · Text analytics services for news, Life Science and social media that allow you to extract valuable meaning and insights used to manage your business · Ondemand, fast and reliable access to Linked Datasets, such as DBpedia, Freebase and GeoNames. These datasets provide facts you can use to enhance your semantic analysis. · A selfmanaged or fullymanaged scalable RDF database available asaservice, so that you can search and update semantic facts loaded from Linked Open Data or your own documents 
SelfTaught Learning  Selftaught learning is a new paradigm in machine learning introduced by Stanford researchers in 2007. ‘Selftaught Learning: Transfer Learning from Unlabeled Data’. Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR.</ref>. The full paper can be found here. It builds on ideas from existing supervised, semisupervised and transfer learning algorithms. The differences between these methods depend on the usage of labeled and unlabeled data (Figure 1): • Supervised Learning – All data is labeled and of the same type (shares the same class labels). • Semisupervised learning – Only some of the data is labeled but it is all of the same class. One drawback is that acquiring unlabeled data of the same class is often difficult and/or expensive. • Transfer learning – All data is labeled but some is of another type (i.e. has class labels that do not apply to data set that we wish to classify). Selftaught learning combines the latter two ideas. It uses labeled data belonging to the desired classes and unlabeled data from other, somehow similar, classes. It is important to emphasize that the unlabeled data need not belong to the class labels we wish to assign, as long as it is related. This fact distinguishes it from semisupervised learning. Since it uses unlabeled data from new classes, it can be thought of as semisupervised transfer learning. Selftaught learning is a technique that uses a large number of unlabeled data as source samples to improve the task performance on target samples. Compared with other transfer learning techniques, selftaught learning can be applied to a broader set of scenarios due to the loose restrictions on source data. However, knowledge transferred from source samples that are not sufficiently related to the target domain may negatively influence the target learner, which is referred to as negative transfer. Autoencoder Based Sample Selection for SelfTaught Learning 
SelfTaught Support Vector Machine  In this paper, a new approach for classification of target task using limited labeled target data as well as enormous unlabeled source data is proposed which is called selftaught learning. The target and source data can be drawn from different distributions. In the previous approaches, covariate shift assumption is considered where the marginal distributions p(x) change over domains and the conditional distributions p(yx) remain the same. In our approach, we propose a new objective function which simultaneously learns a common space T(.) where the conditional distributions over domains p(T(x)y) remain the same and learns robust SVM classifiers for target task using both source and target data in the new representation. Hence, in the proposed objective function, the hidden label of the source data is also incorporated. We applied the proposed approach on Caltech256, MSRC+LMO datasets and compared the performance of our algorithm to the available competing methods. Our method has a superior performance to the successful existing algorithms. 
Semantic Adversarial Deep Learning  Fueled by massive amounts of data, models produced by machinelearning (ML) algorithms, especially deep neural networks, are being used in diverse domains where trustworthiness is a concern, including automotive systems, finance, health care, natural language processing, and malware detection. Of particular concern is the use of ML algorithms in cyberphysical systems (CPS), such as selfdriving cars and aviation, where an adversary can cause serious consequences. However, existing approaches to generating adversarial examples and devising robust ML algorithms mostly ignore the semantics and context of the overall system containing the ML component. For example, in an autonomous vehicle using deep learning for perception, not every adversarial example for the neural network might lead to a harmful consequence. Moreover, one may want to prioritize the search for adversarial examples towards those that significantly modify the desired semantics of the overall system. Along the same lines, existing algorithms for constructing robust ML algorithms ignore the specification of the overall system. In this paper, we argue that the semantics and specification of the overall system has a crucial role to play in this line of research. We present preliminary research results that support this claim. 
Semantic Analysis  Semantic analysis may refer to: · Semantic analysis (compilers) · Semantic analysis (machine learning) · Semantic analysis (knowledge representation) · Semantic analysis (linguistics) 
Semantic Analysis Approach for Recommendation (SAR) 
Recommendation system is a common demand in daily life and matrix completion is a widely adopted technique for this task. However, most matrix completion methods lack semantic interpretation and usually result in weaksemantic recommendations. To this end, this paper proposes a {\bf S}emantic {\bf A}nalysis approach for {\bf R}ecommendation systems \textbf{(SAR)}, which applies a twolevel hierarchical generative process that assigns semantic properties and categories for user and item. SAR learns semantic representations of users/items merely from user ratings on items, which offers a new path to recommendation by semantic matching with the learned representations. Extensive experiments demonstrate SAR outperforms other stateoftheart baselines substantially. 
Semantic Analytics  Semantic analytics is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF. 
Semantic Analytics Visualization (SAV) 
In this paper we present a new tool for semantic analytics through 3D visualization called ‘Semantic Analytics Visualization’ (SAV). It has the capability for visualizing ontologies and metadata including annotated webdocuments, images, and digital media such as audio and video clips in a synthetic threedimensional semiimmersive environment. More importantly, SAV supports visual semantic analytics, whereby an analyst can interactively investigate complex relationships between heterogeneous information. The tool is built using Virtual Reality technology which makes SAV a highly interactive system. The backend of SAV consists of a Semantic Analytics system that supports query processing and semantic association discovery. Using a virtual laser pointer, the user can select nodes in the scene and either play digital media, display images, or load annotated web documents. SAV can also display the ranking of web documents as well as the ranking of paths (sequences of links). SAV supports dynamic specification of subqueries of a given graph and displays the results based on ranking information, which enables the users to find, analyze and comprehend the information presented quickly and accurately. 
Semantic Correspondences Convolutional Neural Network (SCNet) 
This paper addresses the problem of establishing semantic correspondences between images depicting different instances of the same object or scene category. Previous approaches focus on either combining a spatial regularizer with handcrafted features, or learning a correspondence model for appearance only. We propose instead a convolutional neural network architecture, called SCNet, for learning a geometrically plausible model for semantic correspondence. SCNet uses region proposals as matching primitives, and explicitly incorporates geometric consistency in its loss function. It is trained on image pairs obtained from the PASCAL VOC 2007 keypoint dataset, and a comparative evaluation on several standard benchmarks demonstrates that the proposed approach substantially outperforms both recent deep learning architectures and previous methods based on handcrafted features. 
Semantic Differential  Semantic differential is a type of a rating scale designed to measure the connotative meaning of objects, events, and concepts. The connotations are used to derive the attitude towards the given object, event or concept. Osgood’s semantic differential was an application of his more general attempt to measure the semantics or meaning of words, particularly adjectives, and their referent concepts. The respondent is asked to choose where his or her position lies, on a scale between two bipolar adjectives (for example: “AdequateInadequate”, “GoodEvil” or “ValuableWorthless”). Semantic differentials can be used to measure opinions, attitudes and values on a psychometrically controlled scale. 
Semantic Entity Retrieval Toolkit (SERT) 
Unsupervised learning of lowdimensional, semantic representations of words and entities has recently gained attention. In this paper we describe the Semantic Entity Retrieval Toolkit (SERT) that provides implementations of our previously published entity representation models. The toolkit provides a unified interface to different representation learning algorithms, finegrained parsing configuration and can be used transparently with GPUs. In addition, users can easily modify existing models or implement their own models in the framework. After model training, SERT can be used to rank entities according to a textual query and extract the learned entity/word representation for use in downstream algorithms, such as clustering or recommendation. 
Semantic Evaluation (SemEval) 
SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive. This series of evaluations is providing a mechanism to characterize in more precise terms exactly what is necessary to compute in meaning. As such, the evaluations provide an emergent mechanism to identify the problems and solutions for computations with meaning. These exercises have evolved to articulate more of the dimensions that are involved in our use of language. They began with apparently simple attempts to identify word senses computationally. They have evolved to investigate the interrelationships among the elements in a sentence (e.g., semantic role labeling), relations between sentences (e.g., coreference), and the nature of what we are saying (semantic relations and sentiment analysis). The purpose of the SemEval exercises and SENSEVAL is to evaluate semantic analysis systems. ‘Semantic Analysis’ refers to a formal analysis of meaning, and ‘computational’ refer to approaches that in principle support effective implementation. The first three evaluations, Senseval1 through Senseval3, were focused on word sense disambiguation, each time growing in the number of languages offered in the tasks and in the number of participating teams. Beginning with the fourth workshop, SemEval2007 (SemEval1), the nature of the tasks evolved to include semantic analysis tasks outside of word sense disambiguation. Triggered by the conception of the *SEM conference, the SemEval community had decided to hold the evaluation workshops yearly in association with the *SEM conference. It was also the decision that not every evaluation task will be run every year, e.g. none of the WSD tasks were running in the SemEval2012 workshop. 
Semantic Feature Engine  We propose learning flexible but interpretable functions that aggregate a variablelength set of permutationinvariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputsandoutputs. We then use the proposed set function to automate the engineering of dense, interpretable features from sparse categorical features, which we call semantic feature engine. Experiments on realworld data show the achieved accuracy is similar to deep sets or deep neural networks, and is easier to debug and understand. 
Semantic Integration  Semantic integration is the process of interrelating information from diverse sources, for example calendars and to do lists, email archives, presence information (physical, psychological, and social), documents of all sorts, contacts (including social graphs), search results, and advertising and marketing relevance derived from them. In this regard, semantics focuses on the organization of and action upon information by acting as an intermediary between heterogeneous data sources, which may conflict not only by structure but also context or value. 
Semantic Labeling  Semantic labeling is the process of mapping attributes in data sources to classes in an ontology and is a necessary step in heterogeneous data integration. 
Semantic Layer  A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms. Developed and patented by Business Objects, it maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization. By using common business terms, rather than data language, to access, manipulate, and organize information, it simplifies the complexity of business data. These business terms are stored as objects in a universe, accessed through business views. Universes enable business users to access and analyze data stored in a relational database and OLAP cubes. This is claimed to be core business intelligence (BI) technology that frees users from IT while ensuring correct results. Business Views is a multitier system that is designed to enable companies to build comprehensive and specific business objects that help report designers and end users access the information they require. Business Views is intended to enable people to add the necessary business context to their data islands and link them into a single organized Business View for their organization. Semantic layer maps tables to classes and columns to objects. 
Semantic Learning Machine (SLM) 
In iterative supervised learning algorithms it is common to reach a point in the search where no further induction seems to be possible with the available data. If the search is continued beyond this point, the risk of overfitting increases significantly. Following the recent developments in inductive semantic stochastic methods, this paper studies the feasibility of using information gathered from the semantic neighborhood to decide when to stop the search. Two semantic stopping criteria are proposed and experimentally assessed in Geometric Semantic Genetic Programming (GSGP) and in the Semantic Learning Machine (SLM) algorithm (the equivalent algorithm for neural networks). The experiments are performed on realworld highdimensional regression datasets. The results show that the proposed semantic stopping criteria are able to detect stopping points that result in a competitive generalization for both GSGP and SLM. This approach also yields computationally efficient algorithms as it allows the evolution of neural networks in less than 3 seconds on average, and of GP trees in at most 10 seconds. The usage of the proposed semantic stopping criteria in conjunction with the computation of optimal mutation/learning steps also results in small trees and neural networks. 
Semantic Lexicon  A semantic lexicon is a dictionary of words labeled with semantic classes so associations can be drawn between words that have not previously been encountered: it is a dictionary with a semantic network. 
Semantic Matching  Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graphlike structures, e.g. classifications, taxonomies database or XML schemas and ontologies, matching is an operator which identifies those nodes in the two structures which semantically correspond to one another. For example, applied to file systems it can identify that a folder labeled ‘car’ is semantically equivalent to another folder ‘automobile’ because they are synonyms in English. This information can be taken from a linguistic resource like WordNet. In the recent years many of them have been offered. SMatch is an example of a semantic matching operator. It works on lightweight ontologies, namely graph structures where each node is labeled by a natural language sentence, for example in English. These sentences are translated into a formal logical formula (according to an artificial unambiguous language) codifying the meaning of the node taking into account its position in the graph. For example, in case the folder ‘car’ is under another folder ‘red’ we can say that the meaning of the folder ‘car’ is ‘red car’ in this case. This is translated into the logical formula ‘red AND car’. The output of SMatch is a set of semantic correspondences called mappings attached with one of the following semantic relations: disjointness (⊥), equivalence (≡), more specific (⊑) and less specific (⊒). In our example the algorithm will return a mapping between ‘car’ and ‘automobile’ attached with an equivalence relation. Information semantically matched can also be used as a measure of relevance through a mapping of nearterm relationships. Such use of SMatch technology is prevalent in the career space where it is used to gauge depth of skills through relational mapping of information found in applicant resumes. Semantic matching represents a fundamental technique in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging. It using is also being investigated in other areas such as event processing. In fact, it has been proposed as a valid solution to the semantic heterogeneity problem, namely managing the diversity in knowledge. Interoperability among people of different cultures and languages, having different viewpoints and using different terminology has always been a huge problem. Especially with the advent of the Web and the consequential information explosion, the problem seems to be emphasized. People face the concrete problem to retrieve, disambiguate and integrate information coming from a wide variety of sources. Semantic Matching Against a Corpus: New Applications and Methods 
Semantic Memory  Semantic memory refers to a portion of longterm memory that processes ideas and concepts that are not drawn from personal experience. Semantic memory includes things that are common knowledge, such as the names of colors, the sounds of letters, the capitals of countries and other basic facts acquired over a lifetime. The concept of semantic memory is fairly new. It was introduced in 1972 as the result of collaboration between Endel Tulving of the University of Toronto and Wayne Donaldson of the University of New Brunswick on the impact of organization in human memory. Tulving outlined the separate systems of conceptualization of episodic and semantic memory in his book, ‘Elements of Episodic Memory.’ He noted that semantic and episodic differ in how they operate and the types of information they process. 
Semantic Network  A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. Typical standardized semantic networks are expressed as semantic triples. Semantic networks are in use in various Natural Language Processing applications. A Knowledge Grapg is a Semantic Network. 
Semantic Parsing  Semantic parsing can be defined as the process of mapping natural language sentences into a machine interpretable, formal representation of its meaning. 
Semantic Role Labeling (SRL) 
In the field of artificial intelligence, Semantic role labeling, sometimes also called shallow semantic parsing, is a process in natural language processing that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence, such as that of an agent, goal, or result. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. For example, given a sentence like ‘Mary sold the book to John’, the task would be to recognize the verb ‘to sell’ as representing the predicate, ‘Mary’ as representing the seller (agent), ‘the book’ as representing the goods (theme), and ‘John’ as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic analysis of this sort is at a lowerlevel of abstraction than a syntax tree, i.e. it has more categories, thus groups fewer clauses in each category. For instance, ‘the book belongs to me’ would need two labels such as ‘possessed’ and ‘possessor’ whereas ‘the book was sold to John’ would need two other labels such as ‘goal’ (or ‘theme’) and ‘receiver’ (or ‘recipient’) even though these two clauses would be very similar as far as ‘subject’ and ‘object’ functions are concerned. Towards SemiSupervised Learning for Deep Semantic Role Labeling 
Semantic Tagging  Semantic Tagging is the process of associating an element from an ontology with some document, usually a computer file or website. Semantic tagging serves the goal of describing a document in order to facilitate better retrieval later on. Semantic tagging also helps to integrate the tagged document with other resources that are also related to the same ontology. Semantic tagging is a special kind of annotation. What can we learn from Semantic Tagging? 
Semantic Textual Similarity (STS) 
The goal of the’ Semantic Textual Similarity (STS)’ task is to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on NLP applications.’ STS measures the degree of semantic equivalence. We are proposing the STS task as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. STS is related to both Textual Entailment (TE) and – Paraphrase, but differs in a number of ways and it is more directly applicable to a number of NLP tasks. ‘ STS is ‘ different from TE inasmuch as it assumes bidirectional graded equivalence between the pair of textual snippets. In the case of TE the equivalence is directional, e.g. a car is a vehicle, but a vehicle is not necessarily a car. STS also differs from both TE and Paraphrase in that, rather than being a binary yes/no decision (e.g. a vehicle is not a car), STS is a graded similarity notion (e.g. a vehicle and a car are more similar than a wave and a car). This graded bidirectional nature of STS is useful for NLP tasks such as MT evaluation, information extraction, question answering, and summarization. Current textual similarity systems are limited in the scope of similarity they can address, mostly lexical and syntactic similarity. Some other linguistic phenomena have rarely been addressed in isolated efforts, e.g. metaphorical or idiomatic language (John spilled his guts to Mary, vs. John told Mary all about his stories/life), scoping and underspecification (Every representative of the company saw every sample), sentences where the structure is very divergent (The annihilation of Rome in 2000 BC was incurred by an insurgency of the slaves. Vs. The slaves’ revolution 2 millennia before Christ destroyed the capital of the Roman Empire.), and various modality phenomena such as committed belief, permission or negation. The STS task would like to foster joint research efforts on these, to date, fragmented areas. http://…/Thesis_Screen.pdf 
Semantic Vector Network (SeVeN) 
We present SeVeN (Semantic Vector Networks), a hybrid resource that encodes relationships between words in the form of a graph. Different from traditional semantic networks, these relations are represented as vectors in a continuous vector space. We propose a simple pipeline for learning such relation vectors, which is based on word vector averaging in combination with an ad hoc autoencoder. We show that by explicitly encoding relational information in a dedicated vector space we can capture aspects of word meaning that are complementary to what is captured by word embeddings. For example, by examining clusters of relation vectors, we observe that relational similarities can be identified at a more abstract level than with traditional word vector differences. Finally, we test the effectiveness of semantic vector networks in two tasks: measuring word similarity and neural text categorization. SeVeN is available at bitbucket.org/luisespinosa/seven. 
Semantic Vector Spaces (SVS) 
svs 
Semantic View Selection  An understanding of the nature of objects could help robots to solve both highlevel abstract tasks and improve performance at lowerlevel concrete tasks. Although deep learning has facilitated progress in image understanding, a robot’s performance in problems like object recognition often depends on the angle from which the object is observed. Traditionally, robot sorting tasks rely on a fixed topdown view of an object. By changing its viewing angle, a robot can select a more semantically informative view leading to better performance for object recognition. In this paper, we introduce the problem of semantic view selection, which seeks to find good camera poses to gain semantic knowledge about an observed object. We propose a conceptual formulation of the problem, together with a solvable relaxation based on clustering. We then present a new image dataset consisting of around 10k images representing various views of 144 objects under different poses. Finally we use this dataset to propose a first solution to the problem by training a neural network to predict a ‘semantic score’ from a top view image and camera pose. The views predicted to have higher scores are then shown to provide better clustering results than fixed topdown views. 
Semantics  The investigation of interpretations of a logical calculus (a formal axiomatic theory), of the study of the sense and meaning of constructions in formal language theory, and of the methods of understanding its logical connectives and formulas. Semantics studies the precise description and definition of such concepts as ‘truth’ , ‘definability’ , ‘denotation’ , at least in the context of a formal language. In a slightly narrower sense, by the semantics of a formalized language one means a system of agreements that determine the understanding of the formulas of the language, and that define the conditions for these formulas to be true. The semantics of logical connectives in classical and intuitionistic logic has an extensional nature: that is, the truth of a complex statement is determined only by the truth character of the expressions that form it. In other classical logics – for example, relevance logics – the meaningful content of concepts can be taken into account (such logics are called intensional). E.g., in logics of this kind not all true expressions are necessarily equivalent. 
Semblance  Kernel methods provide a principled approach for detecting nonlinear relations using well understood linear algorithms. In exploratory data analyses when the underlying structure of the data’s probability space is unclear, the choice of kernel is often arbitrary. Here, we present a novel kernel, Semblance, on a probability feature space. The advantage of Semblance lies in its distribution free formulation and its ability to detect niche features by placing greater emphasis on similarity between observation pairs that fall at the tail ends of a distribution, as opposed to those that fall towards the mean. We prove that Semblance is a valid Mercer kernel and illustrate its applicability through simulations and real world examples. 
SemiAutoregressive Transformer (SAT) 
Existing approaches to neural machine translation are typically autoregressive models. While these models attain stateoftheart translation quality, they are suffering from low parallelizability and thus slow at decoding long sequences. In this paper, we propose a novel model for fast sequence generation — the semiautoregressive Transformer (SAT). The SAT keeps the autoregressive property in global but relieves in local and thus are able to produce multiple successive words in parallel at each time step. Experiments conducted on EnglishGerman and ChineseEnglish translation tasks show that the SAT achieves a good balance between translation quality and decoding speed. On WMT’14 EnglishGerman translation, the SAT achieves 5.58$\times$ speedup while maintaining 88\% translation quality, significantly better than the previous nonautoregressive methods. When produces two words at each time step, the SAT is almost lossless (only 1\% degeneration in BLEU score). 
Semidefinite Programming (SDP) 
In semidefinite programming (SDP), some of the most commonly used preprocessing techniques for exploiting sparsity result in nontrivial numerical issues. We show that further preprocessing, based on the so called facial reduction, can resolve the issues. In computational experiments on SDP instances from the SDPLib, a benchmark, and structured instances from polynomial and binary quadratic optimisation, we show that combining the twostep preprocessing with a standard interiorpoint method outperforms the interior point method, with or without the traditional preprocessing, by a considerable margin. 
SemiImplicit Variational Inference (SIVI) 
Semiimplicit variational inference (SIVI) is introduced to expand the commonly used analytic variational distribution family, by mixing the variational parameter with a flexible distribution. This mixing distribution can assume any density function, explicit or not, as long as independent random samples can be generated via reparameterization. Not only does SIVI expand the variational family to incorporate highly flexible variational distributions, including implicit ones that have no analytic density functions, but also sandwiches the evidence lower bound (ELBO) between a lower bound and an upper bound, and further derives an asymptotically exact surrogate ELBO that is amenable to optimization via stochastic gradient ascent. With a substantially expanded variational family and a novel optimization algorithm, SIVI is shown to closely match the accuracy of MCMC in inferring the posterior in a variety of Bayesian inference tasks. 
semiMapReduce  Graph problems are troublesome when it comes to MapReduce. Typically, to be able to design algorithms that make use of the advantages of MapReduce, assumptions beyond what the model imposes, such as the {\em density} of the input graph, are required. In a recent shift, a simple and robust model of MapReduce for graph problems, where the space per machine is set to be $O(V)$ has attracted considerable attention. We term this model {\em semiMapReduce}, or in short, semiMPC, and focus on its computational power. In this short note, we show through a set of simulation methods that semiMPC is, perhaps surprisingly, almost equivalent to the congested clique model of distributed computing. However, semiMPC, in addition to round complexity, incorporates another practically important dimension to optimize: the number of machines. Furthermore, we show that algorithms in other distributed computing models, such as CONGEST, can be simulated to run in the same number of rounds of semiMPC while also using an optimal number of machines. We later show the implications of these simulation methods by obtaining improved algorithms for these models using the recent algorithms that have been developed. 
SemiOrthogonal NonNegative Matrix Factorization (semiorthogonal NMF) 
Nonnegative Matrix Factorization (NMF) is a popular clustering and dimension reduction method by decomposing a nonnegative matrix into the product of two lower dimension matrices composed of basis vectors. In this paper, we propose a semiorthogonal NMF method that enforces one of the matrices to be orthogonal with mixed signs, thereby guarantees the rank of the factorization. Our method preserves strict orthogonality by implementing the Cayley transformation to force the solution path to be exactly on the Stiefel manifold, as opposed to the approximated orthogonality solutions in existing literature. We apply a line search update scheme along with an SVDbased initialization which produces a rapid convergence of the algorithm compared to other existing approaches. In addition, we present formulations of our method to incorporate both continuous and binary design matrices. Through various simulation studies, we show that our model has an advantage over other NMF variations regarding the accuracy of the factorization, rate of convergence, and the degree of orthogonality while being computationally competitive. We also apply our method to a textmining data on classifying triage notes, and show the effectiveness of our model in reducing classification error compared to the conventional bagofwords model and other alternative matrix factorization approaches. 
SemiSupervised Active Clustering (SSAC) 
We propose a framework for SemiSupervised Active Clustering framework (SSAC), where the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not. We study the query and computational complexity of clustering in this framework. We consider a setting where the expert conforms to a centerbased clustering with a notion of margin. We show that there is a trade off between computational complexity and query complexity; We prove that for the case of $k$means clustering (i.e., when the expert conforms to a solution of $k$means), having access to relatively few such queries allows efficient solutions to otherwise NP hard problems. In particular, we provide a probabilistic polynomialtime (BPP) algorithm for clustering in this setting that asks $O\big(k^2\log k + k\log n)$ samecluster queries and runs with time complexity $O\big(kn\log n)$ (where $k$ is the number of clusters and $n$ is the number of instances). The success of the algorithm is guaranteed for data satisfying margin conditions under which, without queries, we show that the problem is NP hard. We also prove a lower bound on the number of queries needed to have a computationally efficient clustering algorithm in this setting. Approximate Correlation Clustering Using SameCluster Queries 
SemiSupervised Deep Kernel Learning (SSDKL) 
Large amounts of labeled data are typically required to train deep learning models. For many realworld problems, however, acquiring additional data can be expensive or even impossible. We present semisupervised deep kernel learning (SSDKL), a semisupervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical representation learning of neural networks with the probabilistic modeling capabilities of Gaussian processes. By leveraging unlabeled data, we show improvements on a diverse set of realworld regression tasks over supervised deep kernel learning and semisupervised methods such as VAT and mean teacher adapted for regression. 
SemiSupervised GAN (SSGAN) 
We introduce a new model for building conditional generative models in a semisupervised setting to conditionally generate data given attributes by adapting the GAN framework. The proposed semisupervised GAN (SSGAN) model uses a pair of stacked discriminators to learn the marginal distribution of the data, and the conditional distribution of the attributes given the data respectively. In the semisupervised setting, the marginal distribution (which is often harder to learn) is learned from the labeled + unlabeled data, and the conditional distribution is learned purely from the labeled data. Our experimental results demonstrate that this model performs significantly better compared to existing semisupervised conditional GAN models. 
SemiSupervised Learning  Semisupervised learning deals with the problem of how, if possible, to take advantage of a huge amount of not classified data, to perform classification, in situations when, typically, the labelled data are few. Even though this is not always possible (it depends on how useful is to know the distribution of the unlabelled data in the inference of the labels), several algorithm have been proposed recently. A new algorithm is proposed, that under almost neccesary conditions, attains asymptotically the performance of the best theoretical rule, when the size of unlabeled data tends to infinity. The set of necessary assumptions, although reasonables, show that semiparametric classification only works for very well conditioned problems. 
SemiSupervised Multimodal Hashing  Retrieving nearest neighbors across correlated data in multiple modalities, such as imagetext pairs on Facebook and videotag pairs on YouTube, has become a challenging task due to the huge amount of data. Multimodal hashing methods that embed data into binary codes can boost the retrieving speed and reduce storage requirement. As unsupervised multimodal hashing methods are usually inferior to supervised ones, while the supervised ones requires too much manually labeled data, the proposed method in this paper utilizes a part of labels to design a semisupervised multimodal hashing method. It first computes the transformation matrices for data matrices and label matrix. Then, with these transformation matrices, fuzzy logic is introduced to estimate a label matrix for unlabeled data. Finally, it uses the estimated label matrix to learn hashing functions for data in each modality to generate a unified binary code matrix. Experiments show that the proposed semisupervised method with 50% labels can get a medium performance among the compared supervised ones and achieve an approximate performance to the best supervised method with 90% labels. With only 10% labels, the proposed method can still compete with the worst compared supervised one. 
SemiSupervised Novelty Detection (SSND) 
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semisupervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to NeymanPearson classification. Unlike the inductive approach, semisupervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distributionfree, learningtheoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general twosample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard pvalue approach to multiple testing under the socalled random effects model. Unlike standard rejection regions based on thresholded pvalues, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions 
SEmisupervised VErification Network (SEVEN) 
Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semisupervised model named SEmisupervised VErification Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to find more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an endtoend training of the two components. Extensive experiments on four verification tasks demonstrate that SEVEN significantly outperforms other stateoftheart deep semisupervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN. 
SemTK  The relatively recent adoption of Knowledge Graphs as an enabling technology in multiple highprofile artificial intelligence and cognitive applications has led to growing interest in the Semantic Web technology stack. Many semanticsrelated tools, however, are focused on serving experts with a deep understanding of semantic technologies. For example, triplification of relational data is available but there is no open source tool that allows a user unfamiliar with OWL/RDF to import data into a semantic triple store in an intuitive manner. Further, many tools require users to have a working understanding of SPARQL to query data. Casual users interested in benefiting from the power of Knowledge Graphs have few tools available for exploring, querying, and managing semantic data. We present SemTK, the Semantics Toolkit, a userfriendly suite of tools that allow both expert and nonexpert semantics users convenient ingestion of relational data, simplified query generation, and more. The exploration of ontologies and instance data is performed through SPARQLgraph, an intuitive webbased user interface in SemTK understandable and navigable by a lay user. The open source version of SemTK is available at http://semtk.research.ge.com. 
SenGen  We present a new topic model that generates documents by sampling a topic for one whole sentence at a time, and generating the words in the sentence using an RNN decoder that is conditioned on the topic of the sentence. We argue that this novel formalism will help us not only visualize and model the topical discourse structure in a document better, but also potentially lead to more interpretable topics since we can now illustrate topics by sampling representative sentences instead of bag of words or phrases. We present a variational autoencoder approach for learning in which we use a factorized variational encoder that independently models the posterior over topical mixture vectors of documents using a feedforward network, and the posterior over topic assignments to sentences using an RNN. Our preliminary experiments on two different datasets indicate early promise, but also expose many challenges that remain to be addressed. 
Sensitivity  Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100% specific (i.e. not predicting anyone from the healthy group as sick); however, theoretically any predictor will possess a minimum error bound known as the Bayes error rate. 
Sensitivity Analysis  Sensitivity analysis is the study of how the uncertainty in the output of a mathematical model or system (numerical or otherwise) can be apportioned to different sources of uncertainty in its inputs. A related practice is uncertainty analysis, which has a greater focus on uncertainty quantification and propagation of uncertainty. Ideally, uncertainty and sensitivity analysis should be run in tandem. Sensitivity analysis can be useful for a range of purposes, including Testing the robustness of the results of a model or system in the presence of uncertainty. Increased understanding of the relationships between input and output variables in a system or model. Uncertainty reduction: identifying model inputs that cause significant uncertainty in the output and should therefore be the focus of attention if the robustness is to be increased (perhaps by further research). Searching for errors in the model (by encountering unexpected relationships between inputs and outputs). Model simplification – fixing model inputs that have no effect on the output, or identifying and removing redundant parts of the model structure. Enhancing communication from modelers to decision makers (e.g. by making recommendations more credible, understandable, compelling or persuasive). Finding regions in the space of input factors for which the model output is either maximum or minimum or meets some optimum criterion (, ). In case of calibrating models with large number of parameters, a primary sensitivity test can ease the calibration stage by focusing on the sensitive parameters. Not knowing the sensitivity of parameters can result in time being uselessly spent on nonsensitive ones. Taking an example from economics, in any budgeting process there are always variables that are uncertain. Future tax rates, interest rates, inflation rates, headcount, operating expenses and other variables may not be known with great precision. Sensitivity analysis answers the question, ‘if these variables deviate from expectations, what will the effect be (on the business, model, system, or whatever is being analyzed), and which variables are causing the largest deviations?’ reval 
Sensor Transformation Attention Networks  Recent work on encoderdecoder models for sequencetosequence mapping has shown that integrating both temporal and spatial attention mechanisms into neural networks increases the performance of the system substantially. In this work, we report on the application of an attentional signal not on temporal and spatial regions of the input, but instead as a method of switching among inputs themselves. We evaluate the particular role of attentional switching in the presence of dynamic noise in the sensors, and demonstrate how the attentional signal responds dynamically to changing noise levels in the environment to achieve increased performance on both audio and visual tasks in three commonlyused datasets: TIDIGITS, Wall Street Journal, and GRID. Moreover, the proposed sensor transformation network architecture naturally introduces a number of advantages that merit exploration, including ease of adding new sensors to existing architectures, attentional interpretability, and increased robustness in a variety of noisy environments not seen during training. Finally, we demonstrate that the sensor selection attention mechanism of a model trained only on the small TIDIGITS dataset can be transferred directly to a preexisting larger network trained on the Wall Street Journal dataset, maintaining functionality of switching between sensors to yield a dramatic reduction of error in the presence of noise. 
SentEval  We introduce SentEval, a toolkit for evaluating the quality of universal sentence representations. SentEval encompasses a variety of tasks, including binary and multiclass classification, natural language inference and sentence similarity. The set of tasks was selected based on what appears to be the community consensus regarding the appropriate evaluations for universal sentence representations. The toolkit comes with scripts to download and preprocess datasets, and an easy interface to evaluate sentence encoders. The aim is to provide a fairer, less cumbersome and more centralized way for evaluating sentence representations. 
Sentic Computing  Sentic computing is a multidisciplinary approach to natural language processing and understanding at the crossroads between affective computing, information extraction, and commonsense computing, which exploits both computer and social sciences to better interpret and process information on the Web. In sentic computing, whose term derives from the Latin ‘sentire’ (root of words such as sentiment and sentience) and ‘sensus’ (as in commonsense), the analysis of natural language is based on affective ontologies and commonsense reasoning tools, which enable the analysis of text not only at document, page or paragraphlevel, but also at sentence, clause, and conceptlevel. In particular, sentic computing involves the use of AI and Semantic Web techniques, for knowledge representation and inference; mathematics, for carrying out tasks such as graph mining and multidimensionality reduction; linguistics, for discourse analysis and pragmatics; psychology, for cognitive and affective modeling; sociology, for understanding social network dynamics and social influence; finally ethics, for understanding related issues about the nature of mind and the creation of emotional machines. jumping NLP curves Sentic computing adopts the bagofconcepts model in stead of simply counting word cooccurrence frequencies in text. Working at conceptlevel entails preserving the meaning carried by multiword expressions such as ‘cloud computing’, which represent semantic atoms that should never be broken down into single words. In the bagofwords model, for example, the concept ‘cloud computing’ would be split into ‘computing’ and ‘cloud’, which may wrongly activate concepts related to the weather and, hence, compromise categorization accuracy. http://…/senticcomputing.pdf http://…/9789400750692 
Sentient Enterprise  The continued explosion of data and the continued evolution of analytics capabilities might usher in the next analytics revolution beyond the Intelligent Enterprise. The evolution of analytics capabilities towards an ideal state that is called ‘The Sentient Enterprise’. The Sentient Enterprise is an enterprise that can listen to data, conduct analysis and make autonomous decisions at massive scale in realtime. The Sentient Enterprise can listen to data to sense microtrends. It can act as one organism without being impeded by information silos. It can make autonomous decisions with little or no human intervention. It is always evolving, with emergent intelligence that becomes progressively more sophisticated. http://…/1317004 
Seq2Seq2Sentiment  Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a \textit{Seq2Seq Modality Translation Model} and a \textit{Hierarchical Seq2Seq Modality Translation Model}. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMUMOSI dataset indicate that our methods learn informative multimodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis, specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods. 
Sequence and Set Similarity Measure (S3M) 
In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. 
Sequence Graph Transform (SGT) 
A ubiquitous presence of sequence data across fields, like, web, healthcare, bioinformatics, text mining, etc., has made sequence mining a vital research area. However, sequence mining is particularly challenging because of absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like kmeans, kNN, regression, etc., have proved distance between data points in a euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness — arbitrary strings of arbitrary length. We, therefore, propose a new function, called as Sequence Graph Transform (SGT), that extracts sequence features and embeds it in a finitedimensional euclidean space. It is scalable due to a low computational complexity and has a universal applicability on any sequence problem. We theoretically show that SGT can capture both short and long patterns in sequences, and provides an accurate distancebased measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems. 
Sequence Mixed Graphs (SMG) 
A mixed graph can be seen as a type of digraph containing some edges (two opposite arcs). Here we introduce the concept of sequence mixed graphs, which is a generalization of both sequence graphs and iterated line digraphs. These structures are proven to be useful in the problem of constructing dense graphs or digraphs, and this is related to the degree/diameter problem. Thus, our generalized approach gives rise to graphs that have also good ratio order/diameter. Moreover, we propose a general method for obtaining a sequence mixed digraph by identifying some vertices of a certain iterated line digraph. As a consequence, some results about distancerelated parameters (mainly, the diameter and the average distance) of sequence mixed graphs are presented. 
Sequential Adaptive Nonlinear Modeling of Vector Time Series (SLANTS) 
We propose a method for adaptive nonlinear sequential modeling of vectortime series data. Data is modeled as a nonlinear function of past values corrupted by noise, and the underlying nonlinear function is assumed to be approximately expandable in a spline basis. We cast the modeling of data as finding a good fit representation in the linear span of multidimensional spline basis, and use a variant of l1penalty regularization in order to reduce the dimensionality of representation. Using adaptive filtering techniques, we design our online algorithm to automatically tune the underlying parameters based on the minimization of the regularized sequential prediction error. We demonstrate the generality and flexibility of the proposed approach on both synthetic and realworld datasets. Moreover, we analytically investigate the performance of our algorithm by obtaining both bounds of the prediction errors, and consistency results for variable selection. 
Sequential Analysis  In statistics, sequential analysis or sequential hypothesis testing is statistical analysis where the sample size is not fixed in advance. Instead data is evaluated as it is collected, and further sampling is stopped in accordance with a predefined stopping rule as soon as significant results are observed. Thus a conclusion may sometimes be reached at a much earlier stage than would be possible with more classical hypothesis testing or estimation, at consequently lower financial and/or human cost. 
Sequential Backward Selection (SBS) 
The Sequential Backward Selection (SBS) algorithm is very similar to the Sequential Fortward Selection (SFS). The only difference is that we start with the complete feature set instead of the “null set” and remove features sequentially until we reach the number of desired features k. Note that features are never added back once they were removed, which (similar to SFS) is one of the biggest downsides of this algorithm. 
Sequential Bagging on Regression (SQB) 
Methodology: Remove one observation. Training the rest of data that are sampled without replacement and given this observation’s input, predict the response back. Replicate this N times and for each response, take a sample from these replicates with replacement. Average each responses of sample and again replicate this step N time for each observation. Approximate these N new responses and generate another N responses y’. Training these y’ and predict to have N responses of each testing observation. The average of N is the final prediction. Each observation will do the same. SQB 
Sequential Bayesian Additive Regression Trees (sBART) 
➚ “Bayesian Additive Regression Tree” sbart 
Sequential Copying Network (SeqCopyNet) 
Copying mechanism shows effectiveness in sequencetosequence based neural network models for text generation tasks, such as abstractive sentence summarization and question generation. However, existing works on modeling copying or pointing mechanism only considers single word copying from the source sentences. In this paper, we propose a novel copying framework, named Sequential Copying Networks (SeqCopyNet), which not only learns to copy single words, but also copies sequences from the input sentence. It leverages the pointer networks to explicitly select a subspan from the source side to target side, and integrates this sequential copying mechanism to the generation process in the encoderdecoder paradigm. Experiments on abstractive sentence summarization and question generation tasks show that the proposed SeqCopyNet can copy meaningful spans and outperforms the baseline models. 
Sequential Deactivation (SDA) 
We introduce a new neural network model, together with a tractable and monotone online learning algorithm. Our model describes feedforward networks for classification, with one output node for each class. The only nonlinear operation is rectification using a ReLU function with a bias. However, there is a rectifier on every edge rather than at the nodes of the network. There are also weights, but these are positive, static, and associated with the nodes. Our ‘rectified wire’ networks are able to represent arbitrary Boolean functions. Only the bias parameters, on the edges of the network, are learned. Another departure in our approach, from standard neural networks, is that the loss function is replaced by a constraint. This constraint is simply that the value of the output node associated with the correct class should be zero. Our model has the property that the exact normminimizing parameter update, required to correctly classify a training item, is the solution to a quadratic program that can be computed with a few passes through the network. We demonstrate a training algorithm using this update, called sequential deactivation (SDA), on MNIST and some synthetic datasets. Upon adopting a natural choice for the nodal weights, SDA has no hyperparameters other than those describing the network structure. Our experiments explore behavior with respect to network size and depth in a family of sparse expander networks. 
Sequential Dynamical System (SDS) 
Sequential dynamical systems (SDSs) are a class of graph dynamical systems. They are discrete dynamical systems which generalize many aspects of for example classical cellular automata, and they provide a framework for studying asynchronous processes over graphs. The analysis of SDSs uses techniques from combinatorics, abstract algebra, graph theory, dynamical systems and probability theory. 
Sequential Floating Backward Selection (SFBS) 
Just as in the Sequential Floating Forward Selection (SFFS) algorithm, we have a conditional step: Here, we start with the whole feature subset and exclude features sequentially. Only if adding one of the previously excluded features back to a new feature subset improves the performance (assessed by the criterion function), we add it back in the Conditional Inclusion step. 
Sequential Floating Forward Selection (SFFS) 
The Sequential Floating Forward Selection (SFFS) algorithm can be considered as extension of the simpler Sequential Fortward Selection (SFS) algorithm. In constrast to SFS, the SFFS algorithm can remove features once they were included, so that a larger number of feature subset combinations can be sampled. It is important to emphasize that the removal of included features is conditional, which makes it different from the +L R algorithm. The Conditional Exclusion in SFFS only occurs if the resulting feature subset is assessed as “better” by the criterion function after removal of a particular feature. 
Sequential Forward Selection (SFS) 
The Sequential Fortward Selection (SFS) is one of the simplest and probably fastest Feature Selection algorithms. Let’s summarize its mechanics in words: SFS starts with an empty feature subset and sequentially adds features from the whole input feature space to this subset until the subset reaches a desired (userspecified) size. For every iteration (= inclusion of a new feature), the whole feature subset is evaluated (expect for the features that are already included in the new subset). The evaluation is done by the socalled criterion function which assesses the feature that leads to the maximum performance improvement of the feature subset if it is included. Note that included features are never removed, which is one of the biggest downsides of this algorithm. 
Sequential Input Selection Algorithm (SISAL) 
In time series prediction, making accurate predictions is often the primary goal. At the same time, interpretability of the models would be desirable. For the latter goal, we have devised a sequential input selection algorithm (SISAL) to choose a parsimonious, or sparse, set of input variables. Our proposed algorithm is a sequential backward selection type algorithm based on a crossvalidation resampling procedure. Our strategy is to use a filter approach in the prediction: first we select a sparse set of inputs using linear models and then the selected inputs are used in the nonlinear prediction conducted with multilayerperceptron networks. Furthermore, we perform a sensitivity analysis by quantifying the importance of the individual input variables in the nonlinear models using a method based on partial derivatives. Experiments are done with the Santa Fe laser data set that exhibits very nonlinear behavior and a data set in a problem of electricity load prediction. The results in the prediction problems of varying difficulty highlight the range of applicability of our proposed algorithm. In summary, our SISAL yields accurate and parsimonious prediction models giving insight to the original problem. sisal 
Sequential Match Network  We study response selection for multiturn conversation in retrieval based chatbots. Existing works either ignores relationships among utterances, or misses important information in context when matching a response with a highly abstract context vector finally. We propose a new session based matching model to address both problems. The model first matches a response with each utterance on multiple granularities, and distills important matching information from each pair as a vector with convolution and pooling operations. The vectors are then accumulated in a chronological order through a recurrent neural network (RNN) which models the relationships among the utterances. The final matching score is calculated with the hidden states of the RNN. Empirical study on two public data sets shows that our model can significantly outperform the stateoftheart methods for response selection in multiturn conversation. 
Sequential Monte Carlo (SMC) 
Multisample objectives improve over singlesample estimates by giving tighter variational bounds and more accurate estimates of posterior uncertainty. However, these multisample techniques scale poorly, in the sense that the number of samples required to maintain the same quality of posterior approximation scales exponentially in the number of latent dimensions. One approach to addressing these issues is sequential Monte Carlo (SMC). However for many problems SMC is prohibitively slow because the resampling steps imposes an inherently sequential structure on the computation, which is difficult to effectively parallelise on GPU hardware. We developed tensor MonteCarlo to address these issues. In particular, whereas the usual multisample objective draws $K$ samples from a joint distribution over all latent variables, we draw $K$ samples for each of the $n$ individual latent variables, and form our bound by averaging over all $K^n$ combinations of samples from each individual latent. While this sum over exponentially many terms might seem to be intractable, in many cases it can be efficiently computed by exploiting conditional independence structure. In particular, we generalise and simplify classical algorithms such as message passing by noting that these sums can be computed can be written in an extremely simple, general form: a series of tensor innerproducts which can be depicted graphically as reductions of a factor graph. As such, we can straightforwardly combine summation over discrete variables with importance sampling over importance sampling over continuous variables. 
Sequential Network Transfer  We study the problem of adapting neural sentence embedding models to the domain of human activities to capture their relations in different dimensions. We introduce a novel approach, Sequential Network Transfer, and show that it largely improves the performance on all dimensions. We also extend this approach to other semantic similarity datasets, and show that the resulting embeddings outperform traditional transfer learning approaches in many cases, achieving stateoftheart results on the Semantic Textual Similarity (STS) Benchmark. To account for the improvements, we provide some interpretation of what the networks have learned. Our results suggest that Sequential Network Transfer is highly effective for various sentence embedding models and tasks. 
Sequential Offsetted Regression (SOR) 
SOR 
Sequential Parameter Optimization Toolbox (SPOT) 
The performance of optimization algorithms relies crucially on their parameterizations. Finding good parameter settings is called algorithm tuning. Using a simple simulated annealing algorithm, we will demonstrate how optimization algorithms can be tuned using the sequential parameter optimization toolbox (SPOT). SPOT provides several tools for automated and interactive tuning. The underling concepts of the SPOT approach are explained. This includes key techniques such as exploratory fitness landscape analysis and response surface methodology. Many examples illustrate how SPOT can be used for understanding the performance of algorithms and gaining insight into algorithm’s behavior. Furthermore, we demonstrate how SPOT can be used as an optimizer and how a sophisticated ensemble approach is able to combine several meta models via stacking. 
Sequential PAttern Discovery using Equivalence classes (SPADE) 
In this paper we present SPADE, a new algorithm for fast discovery of Sequential Patterns. The existing solutions to this problem make repeated database scans, and use complex hash structures which have poor locality. SPADE utilizes combinatorial properties to decompose the original problem into smaller subproblems, that can be independently solved in mainmemory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments showthat SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some preprocessed data. It also has linear scalability with respect to the number of inputsequences, and a number of other database parameters. http://…/SPADE 
Sequential Pattern Mining  Sequential Pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining. There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning. 
Sequential Principal Curves Analysis (SPCA) 
This work includes all the technical details of the Sequential Principal Curves Analysis (SPCA) in a single document. SPCA is an unsupervised nonlinear and invertible feature extraction technique. The identified curvilinear features can be interpreted as a set of nonlinear sensors: the response of each sensor is the projection onto the corresponding feature. Moreover, it can be easily tuned for different optimization criteria; e.g. infomax, error minimization, decorrelation; by choosing the right way to measure distances along each curvilinear feature. Even though proposed in and shown to work in multiple modalities in , the SPCA framework has its original roots in the nonlinear ICA algorithm in. Later on, the SPCA philosophy for nonlinear generalization of PCA originated substantially faster alternatives at the cost of introducing different constraints in the model. Namely, the Principal Polynomial Analysis (PPA) , and the Dimensionality Reduction via Regression (DRR). This report illustrates the reasons why we developed such family and is the appropriate technical companion for the missing details in. 
Sequential Probability Distribution  fanplot 
Sequential Probability Ratio Test (SPRT) 
The sequential probability ratio test (SPRT) is a specific sequential hypothesis test, developed by Abraham Wald. Neyman and Pearson’s 1933 result inspired Wald to reformulate it as a sequential analysis problem. The NeymanPearson lemma, by contrast, offers a rule of thumb for when all the data is collected (and its likelihood ratio known). While originally developed for use in quality control studies in the realm of manufacturing, SPRT has been formulated for use in the computerized testing of human examinees as a termination criterion. SPRT 
Sequential Subspace Optimization Boosting (SEBOOST) 
We present SEBOOST, a technique for boosting the performance of existing stochastic optimization methods. SEBOOST applies a secondary optimization process in the subspace spanned by the last steps and descent directions. The method was inspired by the SESOP optimization method for largescale problems, and has been adapted for the stochastic learning framework. It can be applied on top of any existing optimization method with no need to tweak the internal algorithm. We show that the method is able to boost the performance of different algorithms, and make them more robust to changes in their hyperparameters. As the boosting steps of SEBOOST are applied between large sets of descent steps, the additional subspace optimization hardly increases the overall computational burden. We introduce two hyperparameters that control the balance between the baseline method and the secondary optimization process. The method was evaluated on several deep learning tasks, demonstrating promising results. 
Serial Correlation  
Serial Dependence Diagrams (SDD) 
SDD 
SERKET  To realize humanlike robot intelligence, a largescale cognitive architecture is required for robots to understand the environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework named Serket that enables the construction of a largescale generative model and its inference easily by connecting submodules to allow the robots to acquire various capabilities through interaction with their environments and others. We consider that largescale cognitive models can be constructed by connecting smaller fundamental models hierarchically while maintaining their programmatic independence. Moreover, connected modules are dependent on each other, and parameters are required to be optimized as a whole. Conventionally, the equations for parameter estimation have to be derived and implemented depending on the models. However, it becomes harder to derive and implement those of a larger scale model. To solve these problems, in this paper, we propose a method for parameter estimation by communicating the minimal parameters between various modules while maintaining their programmatic independence. Therefore, Serket makes it easy to construct largescale models and estimate their parameters via the connection of modules. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with the original models that we have proposed. 
Service Mining  Traditional service marketing and service science attempted to help companies understand what customers think and how companies dealt with problems. However, a holistic framework and viewpoint to explore services differently is needed. Service mining provides a different perspective into the services industry. Professionals and practitioners also need various mindsets to investigate and analyze the evidence from services. According to the concept of service science, certain areas are involved such as economics, management, computer science, and engineering. This book provides a novel concept to combine the areas of social science and computer science in services. Service mining is a holistic concept covering a service’s lifecycle from design, experience, recover to retain. Traditionally, the value of mining is to discover unknown and potential patterns from big data. Service mining focuses on the amount of data generated from the value cocreation process and features of services. The goal of service mining is to analyze any step in the service’s lifecycle and help enterprises reexamine each one. Companies can also utilize appropriate marketing or management methods to adjust biases and revise the errors of services. Service mining is defined as ‘a systematical process including service discovery, service experience, service recovery and service retention to discover unique patterns and exceptional values within the existing service pool’. The goal of service mining is similar to data mining, text mining or web mining. All aim to ‘detect something new’ from the base being mined. Service mining targets the service pool. What distinguishes service mining from data or text mining is the concept service itself. Data is generally considered factual; text, though more nuanced in that words carry connotations, has a primary denotative quality which conveys meaning that text miners and the consumers of the mined text agree upon. Service, however, is trickier. It is a process of establishing a value proposition; and the value it represents is the joint creation of the provider and the customer, each of which offers a different perception in constructing the value proposition. Moreover, in the concept of service mining, the mining target is not only the traditional categories of services but also ITbased services. Under the big umbrella of service science, service mining is considered to be a branch of it. 
Service With Delay Problem  In this paper, we introduce the online service with delay problem. In this problem, there are $n$ points in a metric space that issue service requests over time, and a server that serves these requests. The goal is to minimize the sum of distance traveled by the server and the total delay in serving the requests. This problem models the fundamental tradeoff between batching requests to improve locality and reducing delay to improve response time, that has many applications in operations management, operating systems, logistics, supply chain management, and scheduling. Our main result is to show a polylogarithmic competitive ratio for the online service with delay problem. This result is obtained by an algorithm that we call the preemptive service algorithm. The salient feature of this algorithm is a process called preemptive service, which uses a novel combination of (recursive) time forwarding and spatial exploration on a metric space. We hope this technique will be useful for related problems such as reordering buffer management, online TSP, vehicle routing, etc. We also generalize our results to $k > 1$ servers. 
SetExpander  We present SetExpander, a corpusbased system for expanding a seed set of terms into amore complete set of terms that belong to the same semantic class. SetExpander implements an iterative endtoend workflow. It enables users to easily select a seed set of terms, expand it, view the expanded set, validate it, reexpand the validated set and store it, thus simplifying the extraction of domainspecific finegrained semantic classes. SetExpander has been used successfully in reallife use cases including integration into an automated recruitment system and an issues and defects resolution system. A video demo of SetExpander is available at https://…open?id=1e545bB87Autsch36DjnJHmq3HWfSd1Rv (some images were blurred for privacy reasons) 
Seven Pillars of the Causal Revolution  What you can do with a causal model that you could not do without? Pillar 1: Encoding Causal Assumptions – Transparency and Testability Pillar 2: Docalculus and the control of confounding Pillar 3: The Algorithmization of Counterfactuals Pillar 4: Mediation Analysis and the Assessment of Direct and Indirect Effects Pillar 5: External Validity and Sample Selection Bias Pillar 6: Missing Data Pillar 7: Causal Discovery 
SGAN  The Generative Adversarial Networks (GANs) have demonstrated impressive performance for data synthesis, and are now used in a wide range of computer vision tasks. In spite of this success, they gained a reputation for being difficult to train, what results in a timeconsuming and humaninvolved development process to use them. We consider an alternative training process, named SGAN, in which several adversarial ‘local’ pairs of networks are trained independently so that a ‘global’ supervising pair of networks can be trained against them. The goal is to train the global pair with the corresponding ensemble opponent for improved performances in terms of mode coverage. This approach aims at increasing the chances that learning will not stop for the global pair, preventing both to be trapped in an unsatisfactory local minimum, or to face oscillations often observed in practice. To guarantee the latter, the global pair never affects the local ones. The rules of SGAN training are thus as follows: the global generator and discriminator are trained using the local discriminators and generators, respectively, whereas the local networks are trained with their fixed local opponent. Experimental results on both toy and realworld problems demonstrate that this approach outperforms standard training in terms of better mitigating mode collapse, stability while converging and that it surprisingly, increases the convergence speed as well. 
ShakeShake Regularization  The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multibranch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3branch residual networks, shakeshake regularization improves on the best single shot published results on CIFAR10 and CIFAR100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at https://…/shakeshake. 
Shampoo  Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structureaware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with stateoftheart deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo’s runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam. 
SHAnnon DEcay (SHADE) 
Regularization is a big issue for training deep neural networks. In this paper, we propose a new informationtheorybased regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to standard regularization schemes on several standard architectures. 
ShannonHartley Theorem  In information theory, the ShannonHartley theorem tells the maximum rate at which information can be transmitted over a communications channel of a specified bandwidth in the presence of noise. It is an application of the noisychannel coding theorem to the archetypal case of a continuoustime analog communications channel subject to Gaussian noise. The theorem establishes Shannon’s channel capacity for such a communication link, a bound on the maximum amount of errorfree digital data (that is, information) that can be transmitted with a specified bandwidth in the presence of the noise interference, assuming that the signal power is bounded, and that the Gaussian noise process is characterized by a known power or power spectral density. The law is named after Claude Shannon and Ralph Hartley. 
SHapley Additive exPlanation (SHAP) 
Understanding why a model makes a certain prediction can be as crucial as the prediction’s accuracy in many applications. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, such as ensemble or deep learning models, creating a tension between accuracy and interpretability. In response, various methods have recently been proposed to help users interpret the predictions of complex models, but it is often unclear how these methods are related and when one method is preferable over another. To address this problem, we present a unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations). SHAP assigns each feature an importance value for a particular prediction. Its novel components include: (1) the identification of a new class of additive feature importance measures, and (2) theoretical results showing there is a unique solution in this class with a set of desirable properties. The new class unifies six existing methods, notable because several recent methods in the class lack the proposed desirable properties. Based on insights from this unification, we present new methods that show improved computational performance and/or better consistency with human intuition than previous approaches. Demystifying BlackBox Models with SHAP Value Analysis 
Sharding  Sharding is the process of splitting up your data so it resides in different tables or often different physical databases. Sharding is helpful when you have some specific set of data that outgrows either storage or reasonable performance within a single database. 
Shared Learning Framework  Deep Reinforcement Learning has been able to achieve amazing successes in a variety of domains from video games to continuous control by trying to maximize the cumulative reward. However, most of these successes rely on algorithms that require a large amount of data to train in order to obtain results on par with humanlevel performance. This is not feasible if we are to deploy these systems on real world tasks and hence there has been an increased thrust in exploring data efficient algorithms. To this end, we propose the Shared Learning framework aimed at making $Q$ensemble algorithms dataefficient. For achieving this, we look into some principles of transfer learning which aim to study the benefits of information exchange across tasks in reinforcement learning and adapt transfer to learning our value function estimates in a novel manner. In this paper, we consider the special case of transfer between the value function estimates in the $Q$ensemble architecture of BootstrappedDQN. We further empirically demonstrate how our proposed framework can help in speeding up the learning process in $Q$ensembles with minimum computational overhead on a suite of Atari 2600 Games. 
ShareLaTeX  An easy to use, online, collaborative LaTeX editor. 
Shark  SHARK is a fast, modular, featurerich opensource C++ machine learning library. It provides methods for linear and nonlinear optimization, kernelbased learning algorithms, neural networks, and various other machine learning techniques (see the feature list below). It serves as a powerful toolbox for real world applications as well as research. Shark depends on Boost and CMake. It is compatible with Windows, Solaris, MacOS X, and Linux. Shark is licensed under the permissive GNU Lesser General Public License. RcppShark 
Shark  Shark is an open source distributed SQL query engine for Hadoop data. It brings stateoftheart performance and advanced analytics to Hive users. By running on Spark, Shark can call complex analytics functions like machine learning right from SQL. Or call Shark inside your Spark jobs to load Hive data. 
Sharpness  It is wellknown that, without restricting treatment effect heterogeneity, instrumental variable (IV) methods only identify ‘local’ effects among compliers, i.e., those subjects who take treatment only when encouraged by the IV. Local effects are controversial since they seem to only apply to an unidentified subgroup; this has led many to denounce these effects as having little policy relevance. However, we show that such pessimism is not always warranted: it is possible in some cases to accurately predict who compliers are, and obtain tight bounds on more generalizable effects in identifiable subgroups. We propose methods for doing so and study their estimation error and asymptotic properties, showing that these tasks can in theory be accomplished even with very weak IVs. We go on to introduce a new measure of IV quality called ‘sharpness’, which reflects the variation in compliance explained by covariates, and captures how well one can identify compliers and obtain tight bounds on identifiable subgroup effects. We develop an estimator of sharpness, and show that it is asymptotically efficient under weak conditions. Finally we explore finitesample properties via simulation, and apply the methods to study canvassing effects on voter turnout. We propose that sharpness should be presented alongside strength to assess IV quality. 
Sheffield Elicitation Framework (SHELF) 
The SHeffield ELicitation Framework (SHELF) is a package of documents, templates and software to carry out elicitation of probability distributions for uncertain quantities from a group of experts. Elicitation is increasingly important for quantifying expert knowledge in situations where hard data are sparse. This is often the context in which difficult policy decisions are made. It is generally important to elicit from a group of experts, rather than a single expert, in order to synthesise the range of knowledge and opinions of the expert community. (However, SHELF may be used for a single expert with only trivial modification.) SHELF 
Shewhart Control Chart  Control charts, also known as Shewhart charts (after Walter A. Shewhart) or processbehavior charts, are a statistical process control tool used to determine if a manufacturing or business process is in a state of control. If analysis of the control chart indicates that the process is currently under control (i.e., is stable, with variation only coming from sources common to the process), then no corrections or changes to process control parameters are needed or desired. In addition, data from the process can be used to predict the future performance of the process. If the chart indicates that the monitored process is not in control, analysis of the chart can help determine the sources of variation, as this will result in degraded process performance. A process that is stable but operating outside desired (specification) limits (e.g., scrap rates may be in statistical control but above desired limits) needs to be improved through a deliberate effort to understand the causes of current performance and fundamentally improve the process. The control chart is one of the seven basic tools of quality control. Typically control charts are used for timeseries data, though they can be used for data that have logical comparability (i.e. you want to compare samples that were taken all at the same time, or the performance of different individuals); however the type of chart used to do this requires consideration. 
ShiftCNN  In this paper we introduce ShiftCNN, a generalized lowprecision architecture for inference of multiplierless convolutional neural networks (CNNs). ShiftCNN is based on a poweroftwo weight representation and, as a result, performs only shift and addition operations. Furthermore, ShiftCNN substantially reduces computational cost of convolutional layers by precomputing convolution terms. Such an optimization can be applied to any CNN architecture with a relatively small codebook of weights and allows to decrease the number of product operations by at least two orders of magnitude. The proposed architecture targets custom inference accelerators and can be realized on FPGAs or ASICs. Extensive evaluation on ImageNet shows that the stateoftheart CNNs can be converted without retraining into ShiftCNN with less than 1% drop in accuracy when the proposed quantization algorithm is employed. RTL simulations, targeting modern FPGAs, show that power consumption of convolutional layers is reduced by a factor of 4 compared to conventional 8bit fixedpoint architectures. 
Shogun  Shogun is and opensource machine learning library that offers a wide range of efficient and unified machine learning methods. 
SHOPPER  We develop SHOPPER, a sequential probabilistic model of market baskets. SHOPPER uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed SHOPPER to capture how items interact with other items. We develop an efficient posterior inference algorithm to estimate these forces from largescale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that SHOPPER provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products. 
Shortest Dependency Path – Long Short Term Memory (SDPLSTM) 
Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we present SDPLSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overfitting. We test our model on the SemEval 2010 relation classification task, and achieve an $F_1$score of 83.7\%, higher than competing methods in the literature. 
Shortest Path Faster Algorithm (SPFA) 
The Shortest Path Faster Algorithm (SPFA) is an improvement of the BellmanFord algorithm which computes singlesource shortest paths in a weighted directed graph. The algorithm is believed to work well on random sparse graphs and is particularly suitable for graphs that contain negativeweight edges. However, the worstcase complexity of SPFA is the same as that of BellmanFord, so for graphs with nonnegative edge weights Dijkstra’s algorithm is preferred. The SPFA algorithm was published in 1994 by Fanding Duan. 
Shortest Probability Interval (SPIn) 
SPIn 
ShortTime Fourier Transform (STFT) 
Towards Fine Grained Network Flow Prediction 
ShotgunWSD  In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widelyused approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a bruteforce WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other stateoftheart unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bioinspired methods, it gives a deterministic solution (it does not involve random choices). 
Shrinkage  In statistics, shrinkage has two meanings: · In relation to the general observation that, in regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coefficient of determination ‘shrinks’. This idea is complementary to overfitting and, separately, to the standard adjustment made in the coefficient of determination to compensate for the subjunctive effects of further sampling, like controlling for the potential of new explanatory terms improving the model by chance: that is, the adjustment formula itself provides ‘shrinkage.’ But the adjustment formula yields an artificial shrinkage, in contrast to the first definition. · To describe general types of estimators, or the effects of some types of estimation, whereby a naive or raw estimate is improved by combining it with other information (). The term relates to the notion that the improved estimate is at a reduced distance from the value supplied by the ‘other information’ than is the raw estimate. In this sense, shrinkage is used to regularize illposed inference problems. A common idea underlying both of these meanings is the reduction in the effects of sampling variation. 
Shrinkage Estimator  In statistics, a shrinkage estimator is an estimator that, either explicitly or implicitly, incorporates the effects of shrinkage. In loose terms this means that a naive or raw estimate is improved by combining it with other information. The term relates to the notion that the improved estimate is made closer to the value supplied by the ‘other information’ than the raw estimate. In this sense, shrinkage is used to regularize illposed inference problems. 
Shrunken Centroids Regularized Discriminant Analysis (SCRDA) 
In this paper, we introduce a modified version of linear discriminant analysis, called ‘shrunken centroids regularized discriminant analysis’ (SCRDA). This method generalizes the idea of ‘nearest shrunken centroids’ (NSC) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method and can be as competitive as the SVM classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for SCRDA is available and will be added to the R libraries in the near future. 
Shuffled Graph  Shuffled Graphs are graphs with latent vertex labels. 
ShuffleNet  We introduce an extremely computation efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power (e.g., 10150 MFLOPs). The new architecture utilizes two proposed operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top1 error (absolute 6.7\%) than the recent MobileNet system on ImageNet classification under the computation budget of 40 MFLOPs. On an ARMbased mobile device, ShuffleNet achieves \textasciitilde 13$\times$ actual speedup over AlexNet while maintaining comparable accuracy. 
ShuffleNet V2  Currently, the neural network architecture design is mostly guided by the \emph{indirect} metric of computation complexity, i.e., FLOPs. However, the \emph{direct} metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics. Thus, this work proposes to evaluate the direct metric on the target platform, beyond only considering FLOPs. Based on a series of controlled experiments, this work derives several practical \emph{guidelines} for efficient network design. Accordingly, a new architecture is presented, called \emph{ShuffleNet V2}. Comprehensive ablation experiments verify that our model is the stateoftheart in terms of speed and accuracy tradeoff. 
shuttleNet  Despite a lot of research efforts devoted in recent years, how to efficiently learn longterm dependencies from sequences still remains a pretty challenging task. As one of the key models for sequence learning, recurrent neural network (RNN) and its variants such as long short term memory (LSTM) and gated recurrent unit (GRU) are still not powerful enough in practice. One possible reason is that they have only feedforward connections, which is different from biological neural network that is typically composed of both feedforward and feedback connections. To address the problem, this paper proposes a biologicallyinspired RNN structure, called shuttleNet, by introducing loop connections in the network and utilizing parameter sharing to prevent overfitting. Unlike the traditional RNNs, the cells of shuttleNet are loop connected to mimic the brain’s feedforward and feedback connections. The structure is then stretched in the depth dimension to generate a deeper model with multiple information flow paths, while the parameters are shared so as to prevent shuttleNet from being overfitting. The attention mechanism is then applied to select the best information path. The extensive experiments are conducted on two datasets for action recognition: UCF101 and HMDB51. We find that our model can outperform LSTMs and GRUs remarkably. Even only replacing the LSTMs with our shuttleNet in a CNNRNN network, we can still achieve the stateoftheart performance on both datasets. 
Siamese Capsule Network  Capsule Networks have shown encouraging results on \textit{defacto} benchmark computer vision datasets such as MNIST, CIFAR and smallNORB. Although, they are yet to be tested on tasks where (1) the entities detected inherently have more complex internal representations and (2) there are very few instances per class to learn from and (3) where pointwise classification is not suitable. Hence, this paper carries out experiments on face verification in both controlled and uncontrolled settings that together address these points. In doing so we introduce \textit{Siamese Capsule Networks}, a new variant that can be used for pairwise learning tasks. The model is trained using contrastive loss with $\ell_2$normalized capsule encoded pose features. We find that \textit{Siamese Capsule Networks} perform well against strong baselines on both pairwise learning datasets, yielding best results in the fewshot learning setting where image pairs in the test set contain unseen subjects. 
Siamese Deep Forest (SDF) 
A Siamese Deep Forest (SDF) is proposed in the paper. It is based on the Deep Forest or gcForest proposed by Zhou and Feng and can be viewed as a gcForest modification. It can be also regarded as an alternative to the wellknown Siamese neural networks. The SDF uses a modified training set consisting of concatenated pairs of vectors. Moreover, it defines the class distributions in the deep forest as the weighted sum of the tree class probabilities such that the weights are determined in order to reduce distances between similar pairs and to increase them between dissimilar points. We show that the weights can be obtained by solving a quadratic optimization problem. The SDF aims to prevent overfitting which takes place in neural networks when only limited training data are available. The numerical experiments illustrate the proposed distance metric method. 
Siamese Deep Neural Network  Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. identical here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both subnetworks. Siamese NNs are popular among tasks that involve finding similarity or a relationship between two comparable things. Some examples are paraphrase scoring, where the inputs are two sentences and the output is a score of how similar they are; or signature verification, where figure out whether two signatures are from the same person. Generally, in such tasks, two identical subnetworks are used to process the two inputs, and another module will take their outputs and produce the final output. The picture below is from Bromley et al (1993). They proposed a Siamese architecture for the signature verification task. 
Siamese Survival Prognosis Network (SSPN) 
Survival analysis in the presence of multiple possible adverse events, i.e., competing risks, is a pervasive problem in many industries (healthcare, finance, etc.). Since only one event is typically observed, the incidence of an event of interest is often obscured by other related competing events. This nonidentifiability, or inability to estimate true causespecific survival curves from empirical data, further complicates competing risk survival analysis. We introduce Siamese Survival Prognosis Network (SSPN), a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks. SSPN circumvents the nonidentifiability problem by avoiding the estimation of causespecific survival curves and instead determines pairwise concordant timedependent risks, where longer event times are assigned lower risks. Furthermore, SSPN is able to directly optimize an approximation to the Cdiscrimination index, rather than relying on wellknown metrics which are unable to capture the unique requirements of survival analysis with competing risks. 
Sibyl  A system for large scale supervised machine learning. Sibyl is an important research project underway at Google that implements machine learning primitives at scale and is widely used within Google. Large scale machine learning is playing an increasingly important role in improving the quality and monetization of Internet properties. A small number of techniques, such as regression, have proven to be widely applicable across Internet properties and applications. 
sigma.js  Sigma is a JavaScript library dedicated to graph drawing. It makes easy to publish networks on Web pages, and allows developers to integrate network exploration in rich Web applications. 
SigmaConnection Graphs (sigmaCG) 
➚ “Causal Modeling Framework of Modular Structural Causal Models” 
SigmaDelta Networks  Deep neural networks can be obscenely wasteful. When processing video, a convolutional network expends a fixed amount of computation for each frame with no regard to the similarity between neighbouring frames. As a result, it ends up repeatedly doing very similar computations. To put an end to such waste, we introduce SigmaDelta networks. With each new input, each layer in this network sends a discretized form of its change in activation to the next layer. Thus the amount of computation that the network does scales with the amount of change in the input and layer activations, rather than the size of the network. We introduce an optimization method for converting any pretrained deep network into an optimally efficient SigmaDelta network, and show that our algorithm, if run on the appropriate hardware, could cut at least an order of magnitude from the computational cost of processing video data. 
Sigmoid Function  A sigmoid function is a mathematical function having an “S” shape (sigmoid curve). Often, sigmoid function refers to the special case of the logistic function. 
SignalR  ASP.NET SignalR is a new library for ASP.NET developers that makes it incredibly simple to add realtime web functionality to your applications. What is “realtime web” functionality? It’s the ability to have your serverside code push content to the connected clients as it happens, in realtime. You may have heard of WebSockets, a new HTML5 API that enables bidirectional communication between the browser and server. SignalR will use WebSockets under the covers when it’s available, and gracefully fallback to other techniques and technologies when it isn’t, while your application code stays the same. SignalR also provides a very simple, highlevel API for doing server to client RPC (call JavaScript functions in your clients’ browsers from serverside .NET code) in your ASP.NET application, as well as adding useful hooks for connection management, e.g. connect/disconnect events, grouping connections, authorization. 
SignaltoNoise Ratio (SNR) 
Signaltonoise ratio (abbreviated SNR) is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power, often expressed in decibels. A ratio higher than 1:1 (greater than 0 dB) indicates more signal than noise. While SNR is commonly quoted for electrical signals, it can be applied to any form of signal (such as isotope levels in an ice core or biochemical signaling between cells). The signaltonoise ratio, the bandwidth, and the channel capacity of a communication channel are connected by the ShannonHartley theorem. Signaltonoise ratio is sometimes used informally to refer to the ratio of useful information to false or irrelevant data in a conversation or exchange. For example, in online discussion forums and other online communities, offtopic posts and spam are regarded as ‘noise’ that interferes with the ‘signal’ of appropriate discussion. 
Signed Heterogeneous Information Network Embedding (SHINE) 
In online social networks people often express attitudes towards others, which forms massive sentiment links among users. Predicting the sign of sentiment links is a fundamental task in many areas such as personal advertising and public opinion analysis. Previous works mainly focus on textual sentiment classification, however, text information can only disclose the ‘tip of the iceberg’ about users’ true opinions, of which the most are unobserved but implied by other sources of information such as social relation and users’ profile. To address this problem, in this paper we investigate how to predict possibly existing sentiment links in the presence of heterogeneous information. First, due to the lack of explicit sentiment links in mainstream social networks, we establish a labeled heterogeneous sentiment dataset which consists of users’ sentiment relation, social relation and profile knowledge by entitylevel sentiment extraction method. Then we propose a novel and flexible endtoend Signed Heterogeneous Information Network Embedding (SHINE) framework to extract users’ latent representations from heterogeneous networks and predict the sign of unobserved sentiment links. SHINE utilizes multiple deep autoencoders to map each user into a lowdimension feature space while preserving the network structure. We demonstrate the superiority of SHINE over stateoftheart baselines on link prediction and node recommendation in two realworld datasets. The experimental results also prove the efficacy of SHINE in cold start scenario. 
SignificanceOffset Convolutional Neural Network  We propose ‘SignificanceOffset Convolutional Neural Network’, a deep convolutional network architecture for multivariate time series regression. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an ARlike weighting system, where the final predictor is obtained as a weighted sum of subpredictors while the weights are datadependent functions learnt through a convolutional network.The architecture was designed for applications on asynchronous time series with low signaltonoise ratio and hence is evaluated on such datasets: a hedge fund proprietary dataset of over2 million quotes for a credit derivative index andan artificially generated noisy autoregressive series. The proposed architecture achieves promising results compared to convolutional and recurrent neural networks. The code for the numerical experiments and the architecture implementation will be shared online to make the research reproducible. 
SilanderMyllymaki  bnstruct 
Silent Choir  The cost of communication is a substantial factor affecting the scalability of many distributed applications. Every message sent can incur a cost in storage, computation, energy and bandwidth. Consequently, reducing the communication costs of distributed applications is highly desirable. The best way to reduce message costs is by communicating without sending any messages whatsoever. This paper initiates a rigorous investigation into the use of silence in synchronous settings, in which processes can fail. We formalize sufficient conditions for information transfer using silence, as well as necessary conditions for particular cases of interest. This allows us to identify message patterns that enable communication through silence. In particular, a pattern called a {\em silent choir} is identified, and shown to be central to information transfer via silence in failureprone systems. The power of the new framework is demonstrated on the {\em atomic commitment} problem (AC). A complete characterization of the tradeoff between message complexity and round complexity in the synchronous model with crash failures is provided, in terms of lower bounds and matching protocols. In particular, a new messageoptimal AC protocol is designed using silence, in which processes decide in~3 rounds in the common case. This significantly improves on the best previously known messageoptimal AC protocol, in which decisions were performed in $\Theta(n)$ rounds. 
Silhouette  Silhouette refers to a method of interpretation and validation of clusters of data. The technique provides a succinct graphical representation of how well each object lies within its cluster. It was first described by Peter J. Rousseeuw in 1986. cluster 
SimDex  We present SimDex, a new technique for serving exact topK recommendations on matrix factorization models that measures and optimizes for the similarity between users in the model. Previous serving techniques presume a high degree of similarity (e.g., L2 or cosine distance) among users and/or items in MF models; however, as we demonstrate, the most accurate models are not guaranteed to exhibit high similarity. As a result, bruteforce matrix multiply outperforms recent proposals for topK serving on several collaborative filtering tasks. Based on this observation, we develop SimDex, a new technique for serving matrix factorization models that automatically optimizes serving based on the degree of similarity between users, and outperforms existing methods in both the highsimilarity and lowsimilarity regimes. SimDexfirst measures the degree of similarity among users via clustering and uses a costbased optimizer to either construct an index on the model or defer to blocked matrix multiply. It leverages highly efficient linear algebra primitives in both cases to deliver predictions either from its index or from bruteforce multiply. Overall, SimDex runs an average of 2x and up to 6x faster than highly optimized baselines for the most accurate models on several popular collaborative filtering datasets. 
Simhash Algorithm  Most hash functions are used to separate and obscure data, so that similar data hashes to very different keys. We propose to use hash functions for the opposite purpose: to detect similarities between data. Detecting similar files and classifying documents is a wellstudied problem, but typically involves complex heuristics and/or O(n 2 ) pairwise comparisons. Using a hash function that hashed similar files to similar values, file similarity could be determined simply by comparing presorted hash key values. The challenge is to find a similarity hash that minimizes false positives. We have implemented a family of similarity hash functions with this intent. We have further enhanced their performance by storing the auxiliary data used to compute our hash keys. This data is used as a second filter after a hash key comparison indicates that two files are potentially similar. We use these tests to explore the notion of ‘similarity.’ GitXiv 
Similar Unlabelled Classification (SU Classification) 
One of the biggest bottlenecks in supervised learning is its high labeling cost. To overcome this problem, we propose a new weaklysupervised learning setting called SU classification, where only similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data are needed, instead of fullysupervised data. We show that an unbiased estimator of the classification risk can be obtained only from SU data, and its empirical risk minimizer achieves the optimal parametric convergence rate. Finally, we demonstrate the effectiveness of the proposed method through experiments. 
Similarity Ensemble Approach (SEA) 
SEA is based on the idea that two targets are similar if the ligand sets of a target are similar to one another. The similarity of two ligand sets is computed by the sum of ligand pair similarities that exceed a certain threshold. The ligand pair similarity is measured by Tanimoto similarity. To correct for size or chemical composition bias a correction technique is intrudiced, which is based on the similarity obtained from randomly drawn ligand sets is. This leads to zscores for similarity between the sets. It is argued that the zscores conform an extreme value distribution. Using this extreme value distribution the probability that a compound is active on a certain target is calculated by assuming that one of the two ligand sets consists only of the compound to predict. We implemented the SEA method efficiently for using it on a multicore supercomputer, enabling us to compare it to the other target prediction methods. 
Similarity Flooding  Matching elements of two data schemas or two data instances plays a key role in data warehousing, ebusiness, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the ‘accuracy’ of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several highlevel operators in an implemented testbed for managing information models and mappings. 
SimilarityBased Imbalanced Classification (SBIC) 
When the training data in a twoclass classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similaritybased Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the training data into account, SBIC utilizes the concept of absent data, i.e. data from the minority class which can help better find the boundary between the two classes. SBIC simultaneously optimizes the weights of the empirical similarity function and finds the locations of absent data points. As such, SBIC uses an embedded mechanism for synthetic data generation which does not modify the training dataset, but alters the algorithm to suit imbalanced datasets. Therefore, SBIC uses the ideas of both major schools of thoughts in imbalanced classification: Like costsensitive approaches SBIC operates on an algorithm level to handle imbalanced structures; and similar to synthetic data generation approaches, it utilizes the properties of unobserved data points from the minority class. The application of SBIC to imbalanced datasets suggests it is comparable to, and in some cases outperforms, other commonly used classification techniques for imbalanced datasets. 
SimilarityFirst Search Seriation (SFS) 
SFS 
Simple Competitive Learning (SCL) 

Simple Logging Facade for Java (SLF4J) 
The Simple Logging Facade for Java (SLF4J) serves as a simple facade or abstraction for various logging frameworks (e.g. java.util.logging, logback, log4j) allowing the end user to plug in the desired logging framework at deployment time. Before you start using SLF4J, we highly recommend that you read the twopage SLF4J user manual. Note that SLF4Jenabling your library implies the addition of only a single mandatory dependency, namely slf4japi.jar. If no binding is found on the class path, then SLF4J will default to a nooperation implementation. In case you wish to migrate your Java source files to SLF4J, consider our migrator tool which can migrate your project to use the SLF4J API in just a few minutes. In case an externallymaintained component you depend on uses a logging API other than SLF4J, such as commons logging, log4j or java.util.logging, have a look at SLF4J’s binarysupport for legacy APIs. 
Simple Probabilistic Inverse (SPI) 
Spectral topic modeling algorithms operate on matrices/tensors of word cooccurrence statistics to learn topicspecific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prioraware Dual Decomposition (PADD) that is capable of learning documentspecific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data. 
Simple Temporal Point Process (SPP) 
A simple temporal point process (SPP) is an important class of time series, where the sample realization of the process is solely composed of the times at which events occur. Particular examples of point process data are neuronal spike patterns or spike trains, and a large number of distance and similarity metrics for those data have been proposed. A marked point process (MPP) is an extension of a simple temporal point process, in which a certain vector valued mark is associated with each of the temporal points in the SPP. Analyses of MPPs are of practical importance because instances of MPPs include recordings of natural disasters such as earthquakes and tornadoes. mmpp 
Simplex Algorithm  In mathematical optimization, Dantzig’s simplex algorithm (or simplex method) is a popular algorithm for linear programming. 
Simplex Model  
Simplified Probabilistic Linear Discriminant Analysis (SPLDA) 

Simplified Shotgun Stochastic Search (S5) 
In p >> n settings, full posterior sampling using existing Markov chain Monte Carlo (MCMC) algorithms is highly inefficient and often not feasible from a practical perspective. To overcome this problem, we propose a scalable stochastic search algorithm that is called the Simplified Shotgun Stochastic Search (S5) and aimed at rapidly explore interesting regions of model space and finding the maximum a posteriori(MAP) model. Also, the S5 provides an approximation of posterior probability of each model (including the marginal inclusion probabilities). BayesS5 
SimpNet  Major winning Convolutional Neural Networks (CNNs), such as VGGNet, ResNet, DenseNet, \etc, include tens to hundreds of millions of parameters, which impose considerable computation and memory overheads. This limits their practical usage in training and optimizing for realworld applications. On the contrary, lightweight architectures, such as SqueezeNet, are being proposed to address this issue. However, they mainly suffer from low accuracy, as they have compromised between the processing power and efficiency. These inefficiencies mostly stem from following an adhoc designing procedure. In this work, we discuss and propose several crucial design principles for an efficient architecture design and elaborate intuitions concerning different aspects of the design procedure. Furthermore, we introduce a new layer called {\it SAFpooling} to improve the generalization power of the network while keeping it simple by choosing best features. Based on such principles, we propose a simple architecture called {\it SimpNet}. We empirically show that SimpNet provides a good tradeoff between the computation/memory efficiency and the accuracy solely based on these primitive but crucial principles. SimpNet outperforms the deeper and more complex architectures such as VGGNet, ResNet, WideResidualNet \etc, on several wellknown benchmarks, while having 2 to 25 times fewer number of parameters and operations. We obtain stateoftheart results (in terms of a balance between the accuracy and the number of involved parameters) on standard datasets, such as CIFAR10, CIFAR100, MNIST and SVHN. The implementations are available at \href{url}{https://…/SimpNet}. 
Simpson’s Paradox  In probability and statistics, Simpson’s paradox, or the YuleSimpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in socialscience and medicalscience statistics, and is particularly confounding when frequency data are unduly given causal interpretations. Simpson’s Paradox disappears when causal relations are brought into consideration. Many statisticians believe that the mainstream public should be informed of the counterintuitive results in statistics such as Simpson’s paradox. http://…/confounding.html 
SimRank  SimRank is a general similarity measure, based on a simple and intuitive graphtheoretic model. SimRank is applicable in any domain with objecttoobject relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says “two objects are considered to be similar if they are referenced by similar objects.” 
Simulated Annealing (SA) 
Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For certain problems, simulated annealing may be more efficient than exhaustive enumeration – provided that the goal is merely to find an acceptably good solution in a fixed amount of time, rather than the best possible solution. 
simultaneous Coherent Structure Coloring (sCSC) 
Existing methods that aim to automatically cluster data into physically meaningful subsets typically require assumptions regarding the number, size, or shape of the coherent subgroups. We present a new method, simultaneous Coherent Structure Coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. To illustrate the versatility of the method, we apply it to frontier physics problems at vastly different temporal and spatial scales: in a theoretical model of geophysical fluid dynamics, in laboratory measurements of vortex ring formation and entrainment, and in atomistic simulation of the Protein G system. The theoretical flow involves sparse sampling of nonequilibrium dynamics, where this new technique can find and characterize the structures that govern fluid transport using two orders of magnitude less data than required by existing methods. Application of the method to empirical measurements of vortex formation leads to the discovery of a well defined region in which vortex ring entrainment occurs, with potential implications ranging from flow control to cardiovascular diagnostics. Finally, the protein folding example demonstrates a datarich application governed by equilibrium dynamics, where the technique in this manuscript automatically discovers the hierarchy of distinct processes that govern protein folding and clusters protein configurations accordingly. We anticipate straightforward translation to many other fields where existing analysis tools, such as kmeans and traditional hierarchical clustering, require ad hoc assumptions on the data structure or lack the interpretability of the present method. The method is also potentially generalizable to fields where the underlying processes are less accessible, such as genomics and neuroscience. 
Simultaneous MeanVariance Regression  We propose simultaneous meanvariance regression for the linear estimation and approximation of conditional mean functions. In the presence of heteroskedasticity of unknown form, our method accounts for varying dispersion in the regression outcome across the support of conditioning variables by using weights that are jointly determined with mean regression parameters. Simultaneity generates outcome predictions that are guaranteed to improve over ordinary leastsquares prediction error, with corresponding parameter standard errors that are automatically valid. Under shape misspecification of the conditional mean and variance functions, we establish existence and uniqueness of the resulting approximations and characterize their formal interpretation. We illustrate our method with numerical simulations and two empirical applications to the estimation of the relationship between economic prosperity in 1500 and today, and demand for gasoline in the United States. 
Simultaneous Perturbation Stochastic Approximation (SPSA) 
This manuscript presents the following: (1) an improved version of the Binary Simultaneous Perturbation Stochastic Approximation (SPSA) Method for feature selection in machine learning (Aksakalli and Malekipirbazari, Pattern Recognition Letters, Vol. 75, 2016) based on nonmonotone iteration gains computed via the Barzilai and Borwein (BB) method, (2) its adaptation for feature ranking, and (3) comparison against popular methods on public benchmark datasets. The improved method, which we call SPSAFSR, dramatically reduces the number of iterations required for convergence without impacting solution quality. SPSAFSR can be used for feature ranking and feature selection both for classification and regression problems. After a review of the current stateoftheart, we discuss our improvements in detail and present three sets of computational experiments: (1) comparison of SPSAFS as a (wrapper) feature selection method against sequential methods as well as genetic algorithms, (2) comparison of SPSAFS as a feature ranking method in a classification setting against random forest importance, chisquared, and information main methods, and (3) comparison of SPSAFS as a feature ranking method in a regression setting against minimum redundancy maximum relevance (MRMR), RELIEF, and linear correlation methods. The number of features in the datasets we use range from a few dozens to a few thousands. Our results indicate that SPSAFS converges to a good feature set in no more than 100 iterations and therefore it is quite fast for a wrapper method. SPSAFS also outperforms popular feature selection as well as feature ranking methods in majority of test cases, sometimes by a large margin, and it stands as a promising new feature selection and ranking method. 
Simultaneous Validation Over an Organized set of Hypotheses (SVOOSH) 

Since Cosine Crow Search Algorithm (SCCSA) 
This paper presents a novel hybrid algorithm named Since Cosine Crow Search Algorithm. To propose the SCCSA, two novel algorithms are considered including Crow Search Algorithm (CSA) and Since Cosine Algorithm (SCA). The advantages of the two algorithms are considered and utilize to design an efficient hybrid algorithm which can perform significantly better in various benchmark functions. The combination of concept and operators of the two algorithms enable the SCCSA to make an appropriate tradeoff between exploration and exploitation abilities of the algorithm. To evaluate the performance of the proposed SCCSA, seven wellknown benchmark functions are utilized. The results indicated that the proposed hybrid algorithm is able to provide very competitive solution comparing to other stateoftheart meta heuristics. 
SineCosine Algorithm (SCA) 
This paper proposes a novel populationbased optimization algorithm called Sine Cosine Algorithm (SCA) for solving optimization problems. The SCA creates multiple initial random candidate solutions and requires them to fluctuate outwards or towards the best solution using a mathematical model based on sine and cosine functions. Several random and adaptive variables also are integrated to this algorithm to emphasize exploration and exploitation of the search space in different milestones of optimization. The performance of SCA is benchmarked in three test phases. Firstly, a set of wellknown test cases including unimodal, multimodal, and composite functions are employed to test exploration, exploitation, local optima avoidance, and convergence of SCA. Secondly, several performance metrics (search history, trajectory, average fitness of solutions, and the best solution during optimization) are used to qualitatively observe and confirm the performance of SCA on shifted twodimensional test functions. Finally, the crosssection of an aircraft’s wing is optimized by SCA as a real challenging case study to verify and demonstrate the performance of this algorithm in practice. The results of test functions and performance metrics prove that the algorithm proposed is able to explore different regions of a search space, avoid local optima, converge towards the global optimum, and exploit promising regions of a search space during optimization effectively. The SCA algorithm obtains a smooth shape for the airfoil with a very low drag, which demonstrates that this algorithm can highly be effective in solving real problems with constrained and unknown search spaces. Note that the source codes of the SCA algorithm are publicly available at http://…/SCA.html. 
Single Document Summarization  Single document summarization is the task of producing a shorter version of a document while preserving its principal information content. Ranking Sentences for Extractive Summarization with Reinforcement Learning 
Single Hiddenlayer Feedforward Neural Network (SLFN) 
Training and predict functions for Single Hiddenlayer Feedforward Neural Networks (SLFN) using the Extreme Learning Machine (ELM) algorithm. The ELM algorithm differs from the traditional gradientbased algorithms for very short training times (it doesn’t need any iterative tuning, this makes learning time very fast) and there is no need to set any other parameters like learning rate, momentum, epochs, etc. This is a reimplementation of the ‘elmNN’ package using ‘RcppArmadillo’ after the ‘elmNN’ package was archived. For more information, see ‘Extreme learning machine: Theory and applications’ by GuangBin Huang, QinYu Zhu, CheeKheong Siew (2006), Elsevier B.V, <doi:10.1016/j.neucom.2005.12.126>. elmNNRcpp 
Single Image SuperResolution  Deep Learning for Single Image SuperResolution: A Brief Review 
Single Index Latent Variable Models (SILVar) 
A semiparametric, nonlinear regression model in the presence of latent variables is introduced. These latent variables can correspond to unmodeled phenomena or unmeasured agents in a complex networked system. This new formulation allows joint estimation of certain nonlinearities in the system, the direct interactions between measured variables, and the effects of unmodeled elements on the observed system. The particular form of the model is justified, and learning is posed as a regularized maximum likelihood estimation. This leads to classes of structured convex optimization problems with a ‘sparse plus lowrank’ flavor. Relations between the proposed model and several common model paradigms, such as those of Robust Principal Component Analysis (PCA) and Vector Autoregression (VAR), are established. Particularly in the VAR setting, the lowrank contributions can come from broad trends exhibited in the time series. Details of the algorithm for learning the model are presented. Experiments demonstrate the performance of the model and the estimation algorithm on simulated and real data. 
Single Shot Multibox Detetor (SSD) 
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster RCNN model. Code is available at this https URL . 
SingleLinkage Clustering  Singlelinkage clustering is one of several methods of agglomerative hierarchical clustering. In the beginning of the process, each element is in a cluster of its own. The clusters are then sequentially combined into larger clusters, until all elements end up being in the same cluster. At each step, the two clusters separated by the shortest distance are combined. The definition of ‘shortest distance’ is what differentiates between the different agglomerative clustering methods. In singlelinkage clustering, the link between two clusters is made by a single element pair, namely those two elements (one in each cluster) that are closest to each other. The shortest of these links that remains at any step causes the fusion of the two clusters whose elements are involved. The method is also known as nearest neighbour clustering. The result of the clustering can be visualized as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took place. 
Singular Spectrum Analysis (SSA) 
In time series analysis, singular spectrum analysis (SSA) is a nonparametric spectral estimation method. It combines elements of classical time series analysis, multivariate statistics, multivariate geometry, dynamical systems and signal processing. Its roots lie in the classical Karhunen (1946)Loève (1945, 1978) spectral decomposition of time series and random fields and in the Mañé (1981)Takens (1981) embedding theorem. SSA can be an aid in the decomposition of time series into a sum of components, each having a meaningful interpretation. The name “singular spectrum analysis” relates to the spectrum of eigenvalues in a singular value decomposition of a covariance matrix, and not directly to a frequency domain decomposition. Rssa,spectral.methods 
Singular Value Decomposition (SVD) 
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix, with many useful applications in signal processing and statistics. 
Singular Vector Canonical Correlation Analysis (SVCCA) 
We introduce a technique based on the singular vector canonical correlation analysis (SVCCA) for measuring the generality of neural network layers across a continuouslyparametrized set of tasks. We illustrate this method by studying generality in neural networks trained to solve parametrized boundary value problems based on the Poisson partial differential equation. We find that the first hidden layer is general, and that deeper layers are successively more specific. Next, we validate our method against an existing technique that measures layer generality using transfer learning experiments. We find excellent agreement between the two methods, and note that our method is much faster, particularly for continuouslyparametrized problems. Finally, we visualize the general representations of the first layers, and interpret them as generalized coordinates over the input domain. 
Singularity  Singularity is a container platform focused on supporting ‘Mobility of Compute’. Mobility of Compute encapsulates the development to compute model where developers can work in an environment of their choosing and creation, and when the developer needs additional compute resources, this environment can easily be copied and executed on other platforms. Additionally, as the primary use case for Singularity is targeted towards computational portability. Many of the barriers to entry of other container solutions do not apply to Singularity, making it an ideal solution for users (both computational and noncomputational) and HPC centers. 
Site Reliability Engineering (SRE) 
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. The main goals are to create ultrascalable and highly reliable software systems. According to Ben Treynor, founder of Google’s Site Reliability Team, SRE is ‘what happens when a software engineer is tasked with what used to be called operations.’ Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google’s sites run smoothly, efficiently, and more reliably. Early on, Google’s largescale systems required the company to come up with new paradigms on how to manage such large systems and at the same time introduce new features continuously but at a very highquality end user experience. The SRE footprint at Google is now larger than 1500 engineers. Many products have small to medium sized SRE teams supporting them, though by far not all products have SREs. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm. ServiceNow, Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, Target, Dell Technologies, IBM, Xero, Oracle, Zalando, Acquia, VMware and GitHub have all put together SRE teams. 
Skellam Distribution  The Skellam distribution is the discrete probability distribution of the difference n_1n_2 of two statistically independent random variables N_1 and N_2 each having Poisson distributions with different expected values \mu_1 and \mu_2. It is useful in describing the statistics of the difference of two images with simple photon noise, as well as describing the point spread distribution in sports where all scored points are equal, such as baseball, hockey and soccer. The distribution is also applicable to a special case of the difference of dependent Poisson random variables, but just the obvious case where the two variables have a common additive random contribution which is cancelled by the differencing: see Karlis & Ntzoufras (2003) for details and an application. skellam 
Sketch, Shingle, & Hashing (SSH) 
Similarity search on time series is a frequent operation in largescale datadriven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a bruteforce search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the stateoftheart branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sublinear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two largescale benchmark time series data show that our proposed method can be around 20 times faster than the stateoftheart package (UCR suite) without any significant loss in accuracy. 
Sketched Subspace Clustering (SketchSC) 
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of groundtruth labels, is an important tool for drawing inferences from data. Subspace clustering (SC) is a relatively recent method that is able to successfully classify nonlinearly separable data in a multitude of settings. In spite of their high clustering accuracy, SC methods incur prohibitively high computational complexity when processing large volumes of highdimensional data. Inspired by random sketching approaches for dimensionality reduction, the present paper introduces a randomized scheme for SC, termed SketchSC, tailored for large volumes of highdimensional data. SketchSC accelerates the computationally heavy parts of stateoftheart SC approaches by compressing the data matrix across both dimensions using random projections, thus enabling fast and accurate largescale SC. Performance analysis as well as extensive numerical tests on real data corroborate the potential of SketchSC and its competitive performance relative to stateoftheart scalable SC approaches. 
Skew Logistic Distribution  A random variable X is said to have Azzalini’s skewlogistic distribution if its pdf is f(x)=2g(x)G(lambda*x), where g(·) and G(·), respectively, denote the pdf and cdf of the logistic distribution. glogis,sld 
Skill2vec  Unsupervise learned word embeddings have seen tremendous success in numerous Natural Language Processing (NLP) tasks in recent years. The main contribution of this paper is to develop a technique called Skill2vec, which applies machine learning techniques in recruitment to enhance the search strategy to find the candidates who possess the right skills. Skill2vec is a neural network architecture which inspired by Word2vec, developed by Mikolov et al. in 2013, to transform a skill to a new vector space. This vector space has the characteristics of calculation and present their relationship. We conducted an experiment using AB testing in a recruitment company to demonstrate the effectiveness of our approach. 
Skilled Experience Catalogue (SEC) 
In this paper, we introduce a skillbalancing mechanism for adversarial nonplayer characters (NPCs), called Skilled Experience Catalogue (SEC). The objective of this mechanism is to approximately match the skill level of an NPC to an opponent in realtime. We test the technique in the context of a FirstPerson Shooter (FPS) game. Specifically, the technique adjusts a reinforcement learning NPC’s proficiency with a weapon based on its current performance against an opponent. Firstly, a catalogue of experience, in the form of stored learning policies, is built up by playing a series of training games. Once the NPC has been sufficiently trained, the catalogue acts as a timeline of experience with incremental knowledge milestones in the form of stored learning policies. If the NPC is performing poorly, it can jump to a later stage in the learning timeline to be equipped with more informed decisionmaking. Likewise, if it is performing significantly better than the opponent, it will jump to an earlier stage. The NPC continues to learn in realtime using reinforcement learning but its policy is adjusted, as required, by loading the most suitable milestones for the current circumstances. 
SkipGram Model  A technique where by ngrams are still stored to model language, but they allow for tokens to be skipped. 
Sklar’s Omega  The statistical measurement of agreement is important in a number of fields, e.g., content analysis, education, computational linguistics, biomedical imaging. We propose Sklar’s Omega, a Gaussian copulabased framework for measuring intracoder, intercoder, and intermethod agreement as well as agreement relative to a gold standard. We demonstrate the efficacy and advantages of our approach by applying it to both simulated and experimentally observed datasets, including data from two medical imaging studies. Application of our proposed methodology is supported by our opensource R package, sklarsomega, which is available for download from the Comprehensive R Archive Network. 
SLAQ  Training machine learning (ML) models with large datasets can incur significant resource contention on shared clusters. This training typically involves many iterations that continually improve the quality of the model. Yet in exploratory settings, better models can be obtained faster by directing resources to jobs with the most potential for improvement. We describe SLAQ, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality. When allocating cluster resources, SLAQ explores the qualityruntime tradeoffs across multiple jobs to maximize systemwide quality improvement. To do so, SLAQ leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highlytailored qualityimprovement predictions for future iterations. Experiments show that SLAQ achieves an average quality improvement of up to 73% and an average delay reduction of up to 44% on a large set of ML training jobs, compared to resource fairness schedulers. 
Slate Markov Decision Processes (slateMDP) 
Many realworld problems come with action spaces represented as feature vectors. Although highdimensional control is a largely unsolved problem, there has recently been progress for modest dimensionalities. Here we report on a successful attempt at addressing problems of dimensionality as high as $2000$, of a particular form. Motivated by important applications such as recommendation systems that do not fit the standard reinforcement learning frameworks, we introduce Slate Markov Decision Processes (slateMDPs). A SlateMDP is an MDP with a combinatorial action space consisting of slates (tuples) of primitive actions of which one is executed in an underlying MDP. The agent does not control the choice of this executed action and the action might not even be from the slate, e.g., for recommendation systems for which all recommendations can be ignored. We use deep Qlearning based on feature representations of both the state and action to learn the value of whole slates. Unlike existing methods, we optimize for both the combinatorial and sequential aspects of our tasks. The new agent’s superiority over agents that either ignore the combinatorial or sequential longterm value aspect is demonstrated on a range of environments with dynamics from a realworld recommendation system. Further, we use deep deterministic policy gradients to learn a policy that for each position of the slate, guides attention towards the part of the action space in which the value is the highest and we only evaluate actions in this area. The attention is used within a sequentially greedy procedure leveraging submodularity. Finally, we show how introducing riskseeking can dramatically imporve the agents performance and ability to discover more far reaching strategies. 
Slice Finder  As machine learning (ML) systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an important problem in model validation because the overall model performance can fail to reflect that of the smaller subsets, and slicing allows users to analyze the model performance on a more granularlevel. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are problematic and large. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. Applications include diagnosing model fairness and fraud detection, where identifying slices that are interpretable to humans is crucial. 
Sliced Inverse Regression (SIR) 
Sliced inverse regression (SIR) is a tool for dimension reduction in the field of multivariate statistics. In statistics, regression analysis is a popular way of studying the relationship between a response variable y and its explanatory variable x _ {\displaystyle {\underline {x}}} {\underline {x}}, which is a pdimensional vector. There are several approaches which come under the term of regression. For example parametric methods include multiple linear regression; nonparametric techniques include local smoothing. With highdimensional data (as p grows), the number of observations needed to use local smoothing methods escalates exponentially. Reducing the number of dimensions makes the operation computable. Dimension reduction aims to show only the most important directions of the data. SIR uses the inverse regression curve, E ( x _  y ) {\displaystyle E({\underline {x}}\,\,y)} E({\underline {x}}\,\,y) to perform a weighted principal component analysis, with which one identifies the effective dimension reducing directions. Sliced Inverse Regression for Dimension Reduction 
Sliced Recurrent Neural Network (SRNN) 
Recurrent neural networks have achieved great success in many NLP tasks. However, they have difficulty in parallelization because of the recurrent structure, so it takes much time to train RNNs. In this paper, we introduce sliced recurrent neural networks (SRNNs), which could be parallelized by slicing the sequences into many subsequences. SRNNs have the ability to obtain highlevel information through multiple layers with few extra parameters. We prove that the standard RNN is a special case of the SRNN when we use linear activation functions. Without changing the recurrent units, SRNNs are 136 times as fast as standard RNNs and could be even faster when we train longer sequences. Experiments on six largescale sentiment analysis datasets show that SRNNs achieve better performance than standard RNNs. 
Sliced Wasserstein Distance  Generative Modeling using the Sliced Wasserstein Distance 
SlicedWasserstein Autoencoder (SWAE) 
In this paper we study generative modeling via autoencoders while using the elegant geometric properties of the optimal transport (OT) problem and the Wasserstein distances. We introduce SlicedWasserstein Autoencoders (SWAE), which are generative models that enable one to shape the distribution of the latent space into any samplable probability distribution without the need for training an adversarial network or defining a closedform for the distribution. In short, we regularize the autoencoder loss with the slicedWasserstein distance between the distribution of the encoded training samples and a predefined samplable distribution. We show that the proposed formulation has an efficient numerical solution that provides similar capabilities to Wasserstein Autoencoders (WAE) and Variational Autoencoders (VAE), while benefiting from an embarrassingly simple implementation. 
SlideNet  This work tackles the automatic finegrained slide quality assessment problem for digitized direct smears test using the Gram staining protocol. Automatic quality assessment can provide useful information for the pathologists and the whole digital pathology workflow. For instance, if the system found a slide to have a low staining quality, it could send a request to the automatic slide preparation system to remake the slide. If the system detects severe damage in the slides, it could notify the experts that manual microscope reading may be required. In order to address the quality assessment problem, we propose a deep neural network based framework to automatically assess the slide quality in a semantic way. Specifically, the first step of our framework is to perform dense finegrained region classification on the whole slide and calculate the region distribution histogram. Next, our framework will generate assessments of the slide quality from various perspectives: staining quality, information density, damage level and which regions are more valuable for subsequent highmagnification analysis. To make the information more accessible, we present our results in the form of a heat map and text summaries. Additionally, in order to stimulate research in this direction, we propose a novel dataset for slide quality assessment. Experiments show that the proposed framework outperforms recent related works. 
Sliding Convolutional Attention Network (SCAN) 
Scene text recognition has drawn great attentions in the community of computer vision and artificial intelligence due to its challenges and wide applications. Stateoftheart recurrent neural networks (RNN) based models map an input sequence to a variable length output sequence, but are usually applied in a black box manner and lack of transparency for further improvement, and the maintaining of the entire past hidden states prevents parallel computation in a sequence. In this paper, we investigate the intrinsic characteristics of text recognition, and inspired by human cognition mechanisms in reading texts, we propose a scene text recognition method with sliding convolutional attention network (SCAN). Similar to the eye movement during reading, the process of SCAN can be viewed as an alternation between saccades and visual fixations. Compared to the previous recurrent models, computations over all elements of SCAN can be fully parallelized during training. Experimental results on several challenging benchmarks, including the IIIT5k, SVT and ICDAR 2003/2013 datasets, demonstrate the superiority of SCAN over stateoftheart methods in terms of both the model interpretability and performance. 
Sliding Line Point Regression (SLPR) 
Traditional text detection methods mostly focus on quadrangle text. In this study we propose a novel method named sliding line point regression (SLPR) in order to detect arbitraryshape text in natural scene. SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. The proposed SLPR can be adapted to many object detection architectures such as Faster RCNN and RFCN. Specifically, we first generate the smallest rectangular box including the text with region proposal network (RPN), then isometrically regress the points on the edge of text by using the vertically and horizontally sliding lines. To make full use of information and reduce redundancy, we calculate xcoordinate or ycoordinate of target point by the rectangular box position, and just regress the remaining ycoordinate or xcoordinate. Accordingly we can not only reduce the parameters of system, but also restrain the points which will generate more regular polygon. Our approach achieved competitive results on traditional ICDAR2015 Incidental Scene Text benchmark and curve text detection dataset CTW1500. 
Sliding Suffix Tree  We consider a sliding window over a stream of characters from some finite alphabet. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window based on a suffix tree. The data structure has optimal time queries $\Theta(m+occ)$ and amortized constant time updates, where $m$ is the length of the query string and $occ$ the number of occurrences. 
Sliding Window Discrete Fourier Transform (SWDFT) 
This paper introduces a new tool for timeseries analysis: the Sliding Window Discrete Fourier Transform (SWDFT). The SWDFT is especially useful for timeseries with local intime periodic components. We define a 5parameter model for noiseless local periodic signals, then study the SWDFT of this model. Our study illustrates several key concepts crucial to analyzing timeseries with the SWDFT, in particular Aliasing, Leakage, and Ringing. We also show how these ideas extend to R > 1 local periodic components, using the linearity property of the Fourier transform. Next, we propose a simple procedure for estimating the 5 parameters of our local periodic signal model using the SWDFT. Our estimation procedure speeds up computation by using a trigonometric identity that linearizes estimation of 2 of the 5 parameters. We conclude with a very small Monte Carlo simulation study of our estimation procedure under different levels of noise. swdft 
SLIM  Model interpretability is an increasingly important component of practical machine learning. Some of the most common forms of interpretability systems are examplebased, local, and global explanations. One of the main challenges in interpretability is designing explanation systems that can capture aspects of each of these explanation types, in order to develop a more thorough understanding of the model. We address this challenge in a novel model called SLIM that uses local linear modeling techniques along with a dual interpretation of random forests (both as a supervised neighborhood approach and as a feature selection method). SLIM has two fundamental advantages over existing interpretability systems. First, while it is effective as a blackbox explanation system, SLIM itself is a highly accurate predictive model that provides faithful self explanations, and thus sidesteps the typical accuracyinterpretability tradeoff. Second, SLIM provides both example based and local explanations and can detect global patterns, which allows it to diagnose limitations in its local explanations. 
SlimNet  Deep neural networks have achieved increasingly accurate results on a wide variety of complex tasks. However, much of this improvement is due to the growing use and availability of computational resources (e.g use of GPUs, more layers, more parameters, etc). Most stateoftheart deep networks, despite performing well, overparameterize approximate functions and take a significant amount of time to train. With increased focus on deploying deep neural networks on resource constrained devices like smart phones, there has been a push to evaluate why these models are so resource hungry and how they can be made more efficient. This work evaluates and compares three distinct methods for deep model compression and acceleration: weight pruning, low rank factorization, and knowledge distillation. Comparisons on VGG nets trained on CIFAR10 show that each of the models on their own are effective, but that the true power lies in combining them. We show that by combining pruning and knowledge distillation methods we can create a compressed network 85 times smaller than the original, all while retaining 96% of the original model’s accuracy. 
SLING  SLING, an experimental system for parsing natural language text directly into a representation of its meaning as a semantic frame graph. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a specialpurpose recurrent neural network model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING’s parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses). 
Slopegraphs  An overview of Edward Tufte’s “slopegraphs”; their history; good and bad examples; when to use slopegraphs; slopegraph best practices. (from Charlie Park) 
Slow Feature Analysis (SFA) 
Slow feature analysis (SFA) is an unsupervised learning algorithm for extracting slowly varying features from a quickly varying input signal. It has been successfully applied, e.g., to the selforganization of complexcell receptive fields, the recognition of whole objects invariant to spatial transformations, the selforganization of placecells, extraction of driving forces, and to nonlinear blind source separation. Theoretical Analysis of the Optimal Free Responses of GraphBased SFA for the Design of Training Graphs 
Slow Intelligence System (SIS) 
In this talk I will introduce the concept of slow intelligence. Not all intelligent systems have fast intelligence. There are a surprisingly large number of intelligent systems, quasiintelligent systems and semiintelligent systems that have slow intelligence. Such slow intelligence systems are often neglected in mainstream research on intelligent systems, but they are really worthy of our attention and emulation. I will discuss the general characteristics of slow intelligence systems and then concentrate on evolutionary query processing for distributed multimedia systems as an example of artificial slow intelligence systems. 
Sluice Networks  Multitask learning is partly motivated by the observation that humans bring to bear what they know about related problems when solving new ones. Similarly, deep neural networks can profit from related tasks by sharing parameters with other networks. However, humans do not consciously decide to transfer knowledge between tasks (and are typically not aware of the transfer). In machine learning, it is hard to estimate if sharing will lead to improvements; especially if tasks are only loosely related. To overcome this, we introduce Sluice Networks, a general framework for multitask learning where trainable parameters control the amount of sharing — including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multitask learning. We analyze when the architecture is particularly helpful, as well as its ability to fit noise. We show that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing, and b) while sluice networks easily fit noise, they are robust across domains in practice. 
Small Area Estimation (SAE) 
Small area estimation is any of several statistical techniques involving the estimation of parameters for small subpopulations, generally used when the subpopulation of interest is included in a larger survey. The term ‘small area’ in this context generally refers to a small geographical area such as a county. It may also refer to a ‘small domain’, i.e. a particular demographic within an area. If a survey has been carried out for the population as a whole (for example, a nation or statewide survey), the sample size within any particular small area may be too small to generate accurate estimates from the data. To deal with this problem, it may be possible to use additional data (such as census records) that exists for these small areas in order to obtain estimates. http://…areaestimation101oldmaterialsposted saeSim,sae2,sae 
Small Sample Learning (SSL) 
As a promising area in artificial intelligence, a new learning paradigm, called Small Sample Learning (SSL), has been attracting prominent research attention in the recent years. In this paper, we aim to present a survey to comprehensively introduce the current techniques proposed on this topic. Specifically, current SSL techniques can be mainly divided into two categories. The first category of SSL approaches can be called ‘concept learning’, which emphasizes learning new concepts from only few related observations. The purpose is mainly to simulate human learning behaviors like recognition, generation, imagination, synthesis and analysis. The second category is called ‘experience learning’, which usually coexists with the large sample learning manner of conventional machine learning. This category mainly focuses on learning with insufficient samples, and can also be called small data learning in some literatures. More extensive surveys on both categories of SSL techniques are introduced and some neuroscience evidences are provided to clarify the rationality of the entire SSL regime, and the relationship with human learning process. Some discussions on the main challenges and possible future research directions along this line are also presented. 
Smallify  As neural networks become widely deployed in different applications and on different hardware, it has become increasingly important to optimize inference time and model size along with model accuracy. Most current techniques optimize model size, model accuracy and inference time in different stages, resulting in suboptimal results and computational inefficiency. In this work, we propose a new technique called Smallify that optimizes all three of these metrics at the same time. Specifically we present a new method to simultaneously optimize network size and model performance by neuronlevel pruning during training. Neuronlevel pruning not only produces much smaller networks but also produces dense weight matrices that are amenable to efficient inference. By applying our technique to convolutional as well as fully connected models, we show that Smallify can reduce network size by 35X with a 6X improvement in inference time with similar accuracy as models found by traditional training techniques. 
Smart Data  
Smart Mining for Deep Metric Learning  To solve deep metric learning problems and producing feature embeddings, current methodologies will commonly use a triplet model to minimise the relative distance between samples from the same class and maximise the relative distance between samples from different classes. Though successful, the training convergence of this triplet model can be compromised by the fact that the vast majority of the training samples will produce gradients with magnitudes that are close to zero. This issue has motivated the development of methods that explore the global structure of the embedding and other methods that explore hard negative/positive mining. The effectiveness of such mining methods is often associated with intractable computational requirements. In this paper, we propose a novel deep metric learning method that combines the triplet model and the global structure of the embedding space. We rely on a smart mining procedure that produces effective training samples for a low computational cost. In addition, we propose an adaptive controller that automatically adjusts the smart mining hyperparameters and speeds up the convergence of the training process. We show empirically that our proposed method allows for fast and more accurate training of triplet ConvNets than other competing mining methods. Additionally, we show that our method achieves new stateoftheart embedding results for CUB2002011 and Cars196 datasets. 
Smart ‘Predict, then Optimize’ (SPO) 
Many realworld analytics problems involve two significant challenges: prediction and optimization. Due to the typically complex nature of each challenge, the standard paradigm is to predict, then optimize. By and large, machine learning tools are intended to minimize prediction error and do not account for how the predictions will be used in a downstream optimization problem. In contrast, we propose a new and very general framework, called Smart ‘Predict, then Optimize’ (SPO), which directly leverages the optimization problem structure, i.e., its objective and constraints, for designing successful analytics tools. A key component of our framework is the SPO loss function, which measures the quality of a prediction by comparing the objective values of the solutions generated using the predicted and observed parameters, respectively. Training a model with respect to the SPO loss is computationally challenging, and therefore we also develop a surrogate loss function, called the SPO+ loss, which upper bounds the SPO loss, has desirable convexity properties, and is statistically consistent under mild conditions. We also propose a stochastic gradient descent algorithm which allows for situations in which the number of training samples is large, model regularization is desired, and/or the optimization problem of interest is nonlinear or integer. Finally, we perform computational experiments to empirically verify the success of our SPO framework in comparison to the standard predictthenoptimize approach. 
Smart System  Smart systems incorporate functions of sensing, actuation, and control in order to describe and analyze a situation, and make decisions based on the available data in a predictive or adaptive manner, thereby performing smart actions. In most cases the ‘smartness’ of the system can be attributed to autonomous operation based on closed loop control, energy efficiency, and networking capabilities. 
SMarTplan  Smart factories are on the verge of becoming the new industrial paradigm, wherein optimization permeates all aspects of production, from concept generation to sales. To fully pursue this paradigm, flexibility in the production means as well as in their timely organization is of paramount importance. AI is planning a major role in this transition, but the scenarios encountered in practice might be challenging for current tools. Task planning is one example where AI enables more efficient and flexible operation through an online automated adaptation and rescheduling of the activities to cope with new operational constraints and demands. In this paper we present SMarTplan, a task planner specifically conceived to deal with realworld scenarios in the emerging smart factory paradigm. Including both specialpurpose and generalpurpose algorithms, SMarTplan is based on current automated reasoning technology and it is designed to tackle complex application domains. In particular, we show its effectiveness on a logistic scenario, by comparing its specialized version with the general purpose one, and extending the comparison to other stateoftheart task planners. 
SmartTable  We introduce SmartTable, an online spreadsheet application that is equipped with intelligent assistance capabilities. With a focus on relational tables, describing entities along with their attributes, we offer assistance in two flavors: (i) for populating the table with additional entities (rows) and (ii) for extending it with additional entity attributes (columns). We provide details of our implementation, which is also released as open source. The application is available at http://smarttable.cc. 
Smooth Additive Quantile Regression Model  qgam 
Smooth Imitation Learning (SIL) 
In Smooth Imitation Learning for online sequence prediction is the goal is to train a policy that can smoothly imitate demonstrated behavior in a dynamic and continuous environment in response to online, sequential context input. 
Smooth Neighbors on Teacher Graph (SNTG) 
The paper proposes an inductive semisupervised learning method, called Smooth Neighbors on Teacher Graphs (SNTG). At each iteration during training, a graph is dynamically constructed based on predictions of the teacher model, i.e., the implicit selfensemble of models. Then the graph serves as a similarity measure with respect to which the representations of ‘similar’ neighboring points are learned to be smooth on the low dimensional manifold. We achieve stateoftheart results on semisupervised learning benchmarks. The error rates are 9.89%, 3.99% for CIFAR10 with 4000 labels, SVHN with 500 labels, respectively. In particular, the improvements are significant when the labels are scarce. For nonaugmented MNIST with only 20 labels, the error rate is reduced from previous 4.81% to 1.36%. Our method is also effective under noisy supervision and shows robustness to incorrect labels. 
Smoothly Clipped Absolute Deviation (SCAD) 
Variable selection is fundamental to highdimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0,inf), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. http://…/SCAD_Documentation.pdf 
SMTM  Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on singlelabel classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seedguided Multilabel Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multilabel classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single categorytopic which covers the meaning of the category. To accommodate with multilabeled documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seedguided biased GPU (i.e., generalized Polya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than stateoftheart alternatives and even outperforms supervised solutions in some scenarios. 
Snake  A regularized optimization problem over a large unstructured graph is studied, where the regularization term is tied to the graph geometry. Typical regularization examples include the total variation and the Laplacian regularizations over the graph. When applying the proximal gradient algorithm to solve this problem, there exist quite affordable methods to implement the proximity operator (backward step) in the special case where the graph is a simple path without loops. In this paper, an algorithm, referred to as ‘Snake’, is proposed to solve such regularized problems over general graphs, by taking benefit of these fast methods. The algorithm consists in properly selecting random simple paths in the graph and performing the proximal gradient algorithm over these simple paths. This algorithm is an instance of a new general stochastic proximal gradient algorithm, whose convergence is proven. Applications to trend filtering and graph inpainting are provided among others. Numerical experiments are conducted over large graphs. 
Snap Machine Learning (Snap ML) 
We describe an efficient, scalable machine learning library that enables very fast training of generalized linear models. We demonstrate that our library can remove the training time as a bottleneck for machine learning workloads, opening the door to a range of new applications. For instance, it allows more agile development, faster and more finegrained exploration of the hyperparameter space, enables scaling to massive datasets and makes frequent retraining of models possible in order to adapt to events as they occur. Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to effectively leverage available network, memory and heterogeneous compute resources. On a terabytescale publicly available dataset for clickthroughrate prediction in computational advertising, we demonstrate the training of a logistic regression classifier in 1.53 minutes, a 46x improvement over the fastest reported performance. 
Snapshot Ensembles  Ensembles of neural networks are known to be much more robust and accurate than individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost. We achieve this goal by training a single neural network, converging to several local minima along its optimization path and saving the model parameters. To obtain repeated rapid convergence, we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is simple, yet surprisingly effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields lower error rates than stateoftheart single models at no additional training cost, and compares favorably with traditional network ensembles. On CIFAR10 and CIFAR100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively. Snapshot Ensembles in Keras 
SneakPeek  Nowadays, eye tracking is the most used technology to detect areas of interest. This kind of technology requires specialized equipment recording user’s eyes. In this paper, we propose SneakPeek, a different approach to detect areas of interest on images displayed in web pages based on the zooming and panning actions of the users through the image. We have validated our proposed solution with a group of test subjects that have performed a test in our online prototype. Being this the first iteration of the algorithm, we have found both good and bad results, depending on the type of image. In specific, SneakPeek works best with medium/big objects in medium/big sized images. The reason behind it is the limitation on detection when smartphone screens keep getting bigger and bigger. SneakPeek can be adapted to any website by simply adapting the controller interface for the specific case. 
SNIPER  We present SNIPER, an algorithm for performing efficient multiscale training in instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions around groundtruth instances (referred to as chips) at the appropriate scale. For background sampling, these contextregions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER only processes 30% more pixels compared to the commonly used single scale training at 800×1333 pixels on the COCO dataset. But, it also observes samples from extreme resolutions of the image pyramid, like 1400×2000 pixels. As SNIPER operates on resampled low resolution chips (512×512 pixels), it can have a batch size as large as 20 on a single GPU even with a ResNet101 backbone. Therefore it can benefit from batchnormalization during training without the need for synchronizing batchnormalization statistics across GPUs. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Our implementation based on FasterRCNN with a ResNet101 backbone obtains an mAP of 47.6% on the COCO dataset for bounding box detection and can process 5 images per second with a single GPU. 
Snoogle  Embedding small devices into everyday objects like toasters and coffee mugs creates a wireless network of objects. These embedded devices can contain a description of the underlying objects, or other user defined information. In this paper, we present Snoogle, a search engine for such a network. A user can query Snoogle to find a particular mobile object, or a list of objects that fit the description. Snoogle uses information retrieval techniques to index information and process user queries, and Bloom filters to reduce communication overhead. Security and privacy protections are also engineered into Snoogle to protect sensitive information. We have implemented a prototype of Snoogle using offtheshelf sensor motes, and conducted extensive experiments to evaluate the system performance. 
Snoogle  Snoogle is a graphical, SWRLbased ontology mapper to assist in the task of OWL ontology alignment. It allows users to visualize ontologies and then draw mappings from one to another on a graphical canvas. Users draw mappings as they see them in their head, and then Snoogle turns these mappings into SWRL/RDF or SWRL/XML for use in a knowledge base. 
Snowdoop  Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System. But eventually word got out that Hadoop is slow, and very limited in available data operations. Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark. Spark is apparently much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations. But even Spark su ers a very practical problem, shared by the others mentioned above: All of these systems are complicated. There is a considerable amount of con guration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava. Some of this requires systems knowledge that many R users may lack. And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though technically they are coding in R. So, do we really need all that complicated machinery? Hadoop and Spark provide e cient dis tributed sort operations, but if one’s application does not depend on sorting, we have a costbene t issue here. Here is an alternative, more of a general approach rather than a package, which I call ‘Snowdoop.’ (The name alludes to the fact that it uses the section of the parallel package derived from the old snow package.) The idea is to retain the notion of chunking les into distributed minifiles, but (a) do this on one’s own, and (b) the process those les using ordinary R code, not fancy new functions like Hadoop and Spark require. https://…pdateonsnowdoopamapreducealternative 
Sobel Operator  The Sobel operator, sometimes called Sobel Filter, is used in image processing and computer vision, particularly within edge detection algorithms, and creates an image which emphasizes edges and transitions. It is named after Irwin Sobel, who presented the idea of an ‘Isotropic 3×3 Image Gradient Operator’ at a talk at the Stanford Artificial Intelligence Project (SAIP) in 1968. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel operator is either the corresponding gradient vector or the norm of this vector. The Sobel operator is based on convolving the image with a small, separable, and integer valued filter in horizontal and vertical direction and is therefore relatively inexpensive in terms of computations. On the other hand, the gradient approximation that it produces is relatively crude, in particular for high frequency variations in the image. The Kayyali operator for edge detection is another operator generated from Sobel operator. 
Sobol Indices  Sobol indices are a widespread quantitative measure for variancebased global sensitivity analysis, but computing and utilizing them remains challenging for highdimensional systems. 
Sobolev GAN  ➘ “Sobolev Integral Probability Metric” 
Sobolev Integral Probability Metric (IPM) 
We propose a new Integral Probability Metric (IPM) between distributions: the Sobolev IPM. The Sobolev IPM compares the mean discrepancy of two distributions for functions (critic) restricted to a Sobolev ball defined with respect to a dominant measure $\mu$. We show that the Sobolev IPM compares two distributions in high dimensions based on weighted conditional Cumulative Distribution Functions (CDF) of each coordinate on a leave one out basis. The Dominant measure $\mu$ plays a crucial role as it defines the support on which conditional CDFs are compared. Sobolev IPM can be seen as an extension of the one dimensional VonMises Cram\’er statistics to high dimensional distributions. We show how Sobolev IPM can be used to train Generative Adversarial Networks (GANs). We then exploit the intrinsic conditioning implied by Sobolev IPM in text generation. Finally we show that a variant of Sobolev GAN achieves competitive results in semisupervised learning on CIFAR10, thanks to the smoothness enforced on the critic by Sobolev GAN which relates to Laplacian regularization. 
Sobolev Training  At the heart of deep learning we aim to use neural networks as function approximators – training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to inputoutput pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input – for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function’s outputs but also the function’s derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the dataefficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on largescale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation. 
Social Influence Maximization Problem (SIM Problem) 
➘ “Target Set Selection in a Social Network” 
Social Network Analysis (SNA) 
Social network analysis (SNA) is a strategy for investigating social structures through the use of network and graph theories. It characterizes networked structures in terms of nodes (individual actors, people, or things within the network) and the ties or edges (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, friendship and acquaintance networks, kinship, disease transmission,and sexual relationships. These networks are often visualized through sociograms in which nodes are represented as points and ties are represented as lines. Social network analysis has emerged as a key technique in modern sociology. It has also gained a significant following in anthropology, biology, communication studies, economics, geography, history, information science, organizational studies, political science, social psychology, development studies, and sociolinguistics and is now commonly available as a consumer tool. 
Social WiFi  Many retailers offer WiFi to attract and retain customers. Now some retailers hope to get more from wireless networking via Social WiFi, in which customers get free connectivity by logging in to the retailer’s network using their credentials from a social network account, such as Facebook. The user gets free wireless connectivity. The retailer gets access to customer data for marketing purposes. For example, the retailer could use the data to tailor offers to the customer, such as an instore coupon for a favorite brand. 
Sockeye  We describe Sockeye (version 1.12), an opensource sequencetosequence toolkit for Neural Machine Translation (NMT). Sockeye is a productionready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoderdecoder architectures: attentional recurrent neural networks, selfattentional transformers, and fully convolutional networks. Sockeye also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight Sockeye’s features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): EnglishGerman and LatvianEnglish. We report competitive BLEU scores across all three architectures, including an overall best score for Sockeye’s transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The Sockeye toolkit is free software released under the Apache 2.0 license. 
SOCRATES  A distributed semantic graph processing system that provides locality control, indexing, graph query, and parallel processing capabilities is presented. 
Socratic Learning  Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they ‘tell a story’ about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on realworld relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts. 
Soft Computing (SC) 
Soft computing is a term applied to a field within computer science which is characterized by the use of inexact solutions to computationally hard tasks such as the solution of NPcomplete problems, for which there is no known algorithm that can compute an exact solution in polynomial time. Soft computing differs from conventional (hard) computing in that, unlike hard computing, it is tolerant of imprecision, uncertainty, partial truth, and approximation. In effect, the role model for soft computing is the human mind. 
Soft Filter Pruning (SFP) 
This paper proposed a Soft Filter Pruning (SFP) method to accelerate the inference procedure of deep Convolutional Neural Networks (CNNs). Specifically, the proposed SFP enables the pruned filters to be updated when training the model after pruning. SFP has two advantages over previous works: (1) Larger model capacity. Updating previously pruned filters provides our approach with larger optimization space than fixing the filters to zero. Therefore, the network trained by our method has a larger model capacity to learn from the training data. (2) Less dependence on the pretrained model. Large capacity enables SFP to train from scratch and prune the model simultaneously. In contrast, previous filter pruning methods should be conducted on the basis of the pretrained model to guarantee their performance. Empirically, SFP from scratch outperforms the previous filter pruning methods. Moreover, our approach has been demonstrated effective for many advanced CNN architectures. Notably, on ILSCRC2012, SFP reduces more than 42% FLOPs on ResNet101 with even 0.2% top5 accuracy improvement, which has advanced the stateoftheart. Code is publicly available on GitHub: https://…/softfilterpruning 
Soft KMeans (Fuzzy CMeans) 

Soft Locality Preserving Map (SLPM) 
For image recognition, an extensive number of methods have been proposed to overcome the highdimensionality problem of feature vectors being used. These methods vary from unsupervised to supervised, and from statistics to graphtheory based. In this paper, the most popular and the stateoftheart methods for dimensionality reduction are firstly reviewed, and then a new and more efficient manifoldlearning method, named Soft Locality Preserving Map (SLPM), is presented. Furthermore, feature generation and sample selection are proposed to achieve better manifold learning. SLPM is a graphbased subspacelearning method, with the use of kneighbourhood information and the class information. The key feature of SLPM is that it aims to control the level of spread of the different classes, because the spread of the classes in the underlying manifold is closely connected to the generalizability of the learned subspace. Our proposed manifoldlearning method can be applied to various pattern recognition applications, and we evaluate its performances on facial expression recognition. Experiments on databases, such as the Bahcesehir University Multilingual Affective Face Database (BAUM2), the Extended CohnKanade (CK+) Database, the Japanese Female Facial Expression (JAFFE) Database, and the Taiwanese Facial Expression Image Database (TFEID), show that SLPM can effectively reduce the dimensionality of the feature vectors and enhance the discriminative power of the extracted features for expression recognition. Furthermore, the proposed featuregeneration method can improve the generalizability of the underlying manifolds for facial expression recognition. 
Soft Multivariate Truncated Normal Distribution (soft tMVN) 
We propose a new distribution, called the soft tMVN distribution, which provides a smooth approximation to the truncated multivariate normal (tMVN) distribution with linear constraints. An efficient blocked Gibbs sampler is developed to sample from the soft tMVN distribution in high dimensions. We provide theoretical support to the approximation capability of the soft tMVN and provide further empirical evidence thereof. The soft tMVN distribution can be used to approximate simulations from a multivariate truncated normal distribution with linear constraints, or itself as a prior in shapeconstrained problems. 
Soft Topographic Vector Quantization (STVQ) 
We have developed an algorithm (STVQ) for the optimization of neighbourhood preserving maps by applying deterministic annealing to an energy function for topographic vector quantization. The combinatorial optimization problem is solved by introducing temperature dependent fuzzy assignments of data points to cluster centers and applying an EMtype algorithm at each temperature while annealing. The annealing process exhibits phase transitions in the cluster representation for which we calcul ate critical modes and temperatures expressed in terms of the neighbourhood function and the covariance matrix of the data. In particular, phase transitions corresponding to the automatic selection of feature dimensions are explored analytically and numer ically for finite temperatures. Results are related to those obtained earlier for Kohonen’s SOMalgorithm which can be derived as an approximation to STVQ. The deterministic annealing approach makes it possible to use the neighbourhood function solely to encode desired neighbourhood relations. The working of the annealing process is visualized by showing the effects of ‘heating’ on the topological structure of a twodimensional map of the plane. 
SoftGuided AdaptivelyDropped Neural Network (SGAD) 
Deep neural networks (DNNs) have been proven to have many redundancies. Hence, many efforts have been made to compress DNNs. However, the existing model compression methods treat all the input samples equally while ignoring the fact that the difficulties of various input samples being correctly classified are different. To address this problem, DNNs with adaptive dropping mechanism are well explored in this work. To inform the DNNs how difficult the input samples can be classified, a guideline that contains the information of input samples is introduced to improve the performance. Based on the developed guideline and adaptive dropping mechanism, an innovative softguided adaptivelydropped (SGAD) neural network is proposed in this paper. Compared with the 32 layers residual neural networks, the presented SGAD can reduce the FLOPs by 77% with less than 1% drop in accuracy on CIFAR10. 
SoftRobust ActorCritic algorithm (SRAC) 
Robust Reinforcement Learning aims to derive an optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our \textit{softrobust} framework is an attempt to overcome this issue. In this paper, we present a novel SoftRobust ActorCritic algorithm (SRAC). It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show convergence of the SRAC and test the efficiency of our approach on different domains by comparing it against regular learning methods and their robust formulations. 
SoftTarget Regularization  Deep neural networks are learning models with a very high capacity and therefore prone to overfitting. Many regularization techniques such as Dropout, DropCon nect, and weight decay all attempt to solve the problem of overfitting by reducing the capacity of their respective models (Srivastava et al., 2014), (Wan et al., 2013), (Krogh & Hertz, 1992). In this paper we introduce a new form of regularization that guides the learning problem in a way that reduces overfitting without sacrificing the capacity of the model. The mistakes that models make in early stages of training carry information about the learning problem. By adjusting the labels of the current epoch of training through a weighted average of the real labels, and an exponential average of the past softtargets we achieved a regularization scheme as powerful as Dropout without necessarily reducing the capacity of the model, and simplified the complexity of the learning problem. SoftTarget regularization proved to be an effective tool in various neural network architectures. 
Solid  Solid (derived from ‘social linked data’) is a proposed set of conventions and tools for building decentralized Web applications based on Linked Data principles. Solid is modular and extensible. It relies as much as possible on existing W3C standards and protocols. 
Sonnet  It’s now nearly a year since DeepMind made the decision to switch the entire research organisation to using TensorFlow (TF). It’s proven to be a good choice – many of our models learn significantly faster, and the builtin features for distributed training have hugely simplified our code. Along the way, we found that the flexibility and adaptiveness of TF lends itself to building higher level frameworks for specific purposes, and we’ve written one for quickly building neural network modules with TF. We are actively developing this codebase, but what we have so far fits our research needs well, and we’re excited to announce that today we are open sourcing it. We call this framework Sonnet. 
Soundex  Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, Ingres, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for “phonetic algorithm”. Improvements to Soundex are the basis for many modern phonetic algorithms. 
Sounding Board  We present Sounding Board, a social chatbot that won the 2017 Amazon Alexa Prize. The system architecture consists of several components including spoken language processing, dialogue management, language generation, and content management, with emphasis on usercentric and contentdriven design. We also share insights gained from largescale online logs based on 160,000 conversations with realworld users. 
spaCy  spaCy, aIndustrialstrength NLP, is a library for advanced natural language processing in Python and Cython. spaCy is built on the very latest research, but it isn’t researchware. It was designed from day 1 to be used in real products. You can buy a commercial license, or you can use it under the AGPL. Features: · Labelled dependency parsing (91.8% accuracy on OntoNotes 5) · Named entity recognition (82.6% accuracy on OntoNotes 5) · Partofspeech tagging (97.1% accuracy on OntoNotes 5) · Easy to use word vectors · All strings mapped to integer IDs · Export to numpy data arrays · Alignment maintained to original string, ensuring easy mark up calculation · Range of easytouse orthographic features. · No preprocessing required. spaCy takes raw text as input, warts and newlines and all. Github 
Spaghetti Plot  A spaghetti plot (also known as a spaghetti chart, spaghetti diagram, or spaghetti model) is a method of viewing data to visualize possible flows through systems. Flows depicted in this manner appear like noodles, hence the coining of this term. This method of statistics was first used to track routing through factories. Visualizing flow in this manner can reduce inefficiency within the flow of a system. In regards to animal populations and weather buoys drifting through the ocean, they are drawn to study distribution and migration patterns. Within meteorology, these diagrams can help determine confidence in a specific weather forecast, as well as positions and intensities of high and low pressure systems. They are composed of deterministic forecasts from atmospheric models or their various ensemble members. Within medicine, they can illustrate the effects of drugs on patients during drug trials. 
SPARCML  One of the main drivers behind the rapid recent advances in machine learning has been the availability of efficient system support. This comes both through hardware acceleration, but also in the form of efficient software frameworks and programming models. Despite significant progress, scaling computeintensive machine learning workloads to a large number of compute nodes is still a challenging task, with significant latency and bandwidth demands. In this paper, we address this challenge, by proposing SPARCML, a general, scalable communication layer for machine learning applications. SPARCML is built on the observation that many distributed machine learning algorithms either have naturally sparse communication patters, or have updates which can be sparsified in a structured way for improved performance, without any convergence or accuracy loss. To exploit this insight, we design and implement a set of communication efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute sparse input data vectors, of heterogeneous sizes. We call these operations sparseinput collectives, and present efficient practical algorithms with strong theoretical bounds on their running time and communication cost. Our generic communication layer is enriched with additional features, such support for nonblocking (asynchronous) operations, and support for lowprecision data representations. We validate our algorithmic results experimentally on a range of largescale machine learning applications and target architectures, showing that we can leverage sparsity for order ofmagnitude runtime savings, compared to stateofthe art methods and frameworks. 
Spark Python API (PySpark) 
The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python. PySpark & Scikitlearn = Sparkitlearn 
Sparkle  Spark is an inmemory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable data are used for storing data updates in each iteration, making it inefficient for long running, iterative workloads. A nondeterministic garbage collector further worsens this problem. Sparkle is a library that optimizes memory usage in Spark. It exploits large shared memory to achieve better data shuffling and intermediate storage. Sparkle replaces the current TCP/IPbased shuffle with a shared memory approach and proposes an offheap memory store for efficient updates. We performed a series of experiments on scaleout clusters and scaleup machines. The optimized shuffle engine leveraging shared memory provides 1.3x to 6x faster performance relative to Vanilla Spark. The offheap memory store along with the sharedmemory shuffle engine provides more than 20x performance increase on a probabilistic graph processing workload that uses a largescale realworld hyperlink graph. While Sparkle benefits at most from running on large memory machines, it also achieves 1.6x to 5x performance improvements over scale out cluster with equivalent hardware setting. 
SparkNet  Training deep networks is a timeconsuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widelypopular batchprocessing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communicationintensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multidimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very highlatency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster’s communication overhead, and we benchmark our system’s performance on the ImageNet dataset. 
SPARQL Protocol and RDF Query Language (SPARQL) 
SPARQL (pronounced “sparkle”, a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 became an official W3C Recommendation, and SPARQL 1.1 in March, 2013. SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. Implementations for multiple programming languages exist. “SPARQL will make a huge difference” making the web machinereadable according to Sir Tim BernersLee in a May 2006 interview. There exist tools that allow one to connect and semiautomatically construct a SPARQL query for a SPARQL endpoint, for example ViziQuer. In addition, there exist tools that translate SPARQL queries to other query languages, for example to SQL and to XQuery. 
Sparse Coding  The sparse code is when each item is encoded by the strong activation of a relatively small set of neurons. For each item to be encoded, this is a different subset of all available neurons. As a consequence, sparseness may be focused on temporal sparseness (‘a relatively small number of time periods are active’) or on the sparseness in an activated population of neurons. In this latter case, this may be defined in one time period as the number of activated neurons relative to the total number of neurons in the population. This seems to be a hallmark of neural computations since compared to traditional computers, information is massively distributed across neurons. A major result in neural coding from Olshausen et al. is that sparse coding of natural images produces waveletlike oriented filters that resemble the receptive fields of simple cells in the visual cortex. The capacity of sparse codes may be increased by simultaneous use of temporal coding, as found in the locust olfactory system. Given a potentially large set of input patterns, sparse coding algorithms (e.g. Sparse Autoencoder) attempt to automatically find a small number of representative patterns which, when combined in the right proportions, reproduce the original input patterns. The sparse coding for the input then consists of those representative patterns. For example, the very large set of English sentences can be encoded by a small number of symbols (i.e. letters, numbers, punctuation, and spaces) combined in a particular order for a particular sentence, and so a sparse coding for English would be those symbols. ➚ “Dictionary Learning” More Algorithms for Provable Dictionary Learning 
Sparse Distributed Representations (SDR) 
Sparse Distributed Representations are binary representations of data comprised of many bits with a small percentage of the bits active (1’s). The bits in these representations have semantic meaning and that meaning is distributed across the bits. https://…distributedrepresentationssparsecodes 
Sparse Generalized Linear Models  glmgraph 
SParse Interpretable Neural Embeddings (SPINE) 
Prediction without justification has limited utility. Much of the success of neural models can be attributed to their ability to learn rich, dense and expressive representations. While these representations capture the underlying complexity and latent trends in the data, they are far from being interpretable. We propose a novel variant of denoising ksparse autoencoders that generates highly efficient and interpretable distributed word representations (word embeddings), beginning with existing word representations from stateoftheart methods like GloVe and word2vec. Through large scale human evaluation, we report that our resulting word embedddings are much more interpretable than the original GloVe and word2vec embeddings. Moreover, our embeddings outperform existing popular word embeddings on a diverse suite of benchmark downstream tasks. 
Sparse Linear Isotonic Model  In machine learning and data mining, linear models have been widely used to model the response as parametric linear functions of the predictors. To relax such stringent assumptions made by parametric linear models, additive models consider the response to be a summation of unknown transformations applied on the predictors; in particular, additive isotonic models (AIMs) assume the unknown transformations to be monotone. In this paper, we introduce sparse linear isotonic models (SLIMs) for highdimensional problems by hybridizing ideas in parametric sparse linear models and AIMs, which enjoy a few appealing advantages over both. In the highdimensional setting, a twostep algorithm is proposed for estimating the sparse parameters as well as the monotone functions over predictors. Under mild statistical assumptions, we show that the algorithm can accurately estimate the parameters. Promising preliminary experiments are presented to support the theoretical results. 
Sparse Linear Method (SLIM) 
This paper focuses on developing effective and efficient algorithms for topN recommender systems. A novel Sparse LInear Method (SLIM) is proposed, which generates topN recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an l1norm and l2norm regularized optimization problem. W is demonstrated to produce highquality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other stateoftheart topN recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods. GitXiv slimrec 
Sparse Matrix / Sparsity  In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements (nonzero elements) in a matrix is called the sparsity (density). Sparsity is in the general sense: variable selection, total variation regularization, polynomial trend filtering, and others. http://…/demo_glm.html 
Sparse MatrixBased SPARQL (SMbased SPARQL,gSMat) 
Resource Description Framework (RDF) has been widely used to represent information on the web, while SPARQL is a standard query language to manipulate RDF data. Given a SPARQL query, there often exist many joins which are the bottlenecks of efficiency of query processing. Besides, the real RDF datasets often reveal strong data sparsity, which indicates that a resource often only relates to a few resources even the number of total resources is large. In this paper, we propose a sparse matrixbased (SMbased) SPARQL query processing approach over RDF datasets which con siders both join optimization and data sparsity. Firstly, we present a SMbased storage for RDF datasets to lift the storage efficiency, where valid edges are stored only, and then introduce a predicate based hash index on the storage. Secondly, we develop a scalable SMbased join algorithm for SPARQL query processing. Finally, we analyze the overall cost by accumulating all intermediate results and design a query plan generated algorithm. Besides, we extend our SMbased join algorithm on GPU for parallelizing SPARQL query processing. We have evaluated our approach compared with the stateoftheart RDF engines over benchmark RDF datasets and the experimental results show that our proposal can significantly improve SPARQL query processing with high scalability. 
Sparse Multiprototype Linear Learner (SMaLL) 
We present a new machine learning technique for training small resourceconstrained predictors. Our algorithm, the Sparse Multiprototype Linear Learner (SMaLL), is inspired by the classic machine learning problem of learning $k$DNF Boolean formulae. We present a formal derivation of our algorithm and demonstrate the benefits of our approach with a detailed empirical study. 
Sparse Principal Component Analysis (SPCA) 
Sparse principal component analysis (sparse PCA) is a specialised technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by adding sparsity constraint on the input variables. Ordinary principal component analysis (PCA) uses a vector space transform to reduce multidimensional data sets to lower dimensions. It finds linear combinations of input variables, and transforms them into new variables (called principal components) that correspond to directions of maximal variance in the data. The number of new variables created by these linear combinations is usually much lower than the number of input variables in the original dataset, while still explaining most of the variance present in the data. A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. Sparse PCA overcomes this disadvantage by finding linear combinations that contain just a few input variables. 
Sparse Shrink  Nowadays, it is still difficult to adapt Convolutional Neural Network (CNN) based models for deployment on embedded devices. The heavy computation and large memory footprint of CNN models become the main burden in real application. In this paper, we propose a ‘Sparse Shrink’ algorithm to prune an existing CNN model. By analyzing the importance of each channel via sparse reconstruction, the algorithm is able to prune redundant feature maps accordingly. The resulting pruned model thus directly saves computational resource. We have evaluated our algorithm on CIFAR100. As shown in our experiments, we can reduce 56.77% parameters and 73.84% multiplication in total with only minor decrease in accuracy. These results have demonstrated the effectiveness of our ‘Sparse Shrink’ algorithm. 
Sparse SpaceTime Model  Inspired by Kalikowtype decompositions, we introduce a new stochastic model of infinite neuronal networks, for which we establish oracle inequalities for Lasso methods and restricted eigenvalue properties for the associated Gram matrix with high probability. These results hold even if the network is only partially observed. The main argument rely on the fact that concentration inequalities can easily be derived whenever the transition probabilities of the underlying process admit a sparse spacetime representation. 
Sparse Spatial Generalized Linear Mixed Model (SGLMM) 
(reparameterizations of traditional models) ngspatial 
Sparse Weighted Canonical Correlation Analysis (SWCCA) 
Given two data matrices $X$ and $Y$, sparse canonical correlation analysis (SCCA) is to seek two sparse canonical vectors $u$ and $v$ to maximize the correlation between $Xu$ and $Yv$. However, classical and sparse CCA models consider the contribution of all the samples of data matrices and thus cannot identify an underlying specific subset of samples. To this end, we propose a novel sparse weighted canonical correlation analysis (SWCCA), where weights are used for regularizing different samples. We solve the $L_0$regularized SWCCA ($L_0$SWCCA) using an alternating iterative algorithm. We apply $L_0$SWCCA to synthetic data and realworld data to demonstrate its effectiveness and superiority compared to related methods. Lastly, we consider also SWCCA with different penalties like LASSO (Least absolute shrinkage and selection operator) and Group LASSO, and extend it for integrating more than three data matrices. 
Sparsely Connected Convolutional Network (SparseNet) 
Residual learning with skip connections permits training ultradeep neural networks and obtains superb performance. Building in this direction, DenseNets proposed a dense connection structure where each layer is directly connected to all of its predecessors. The densely connected structure leads to better information flow and feature reuse. However, the overly dense skip connections also bring about the problems of potential risk of overfitting, parameter redundancy and large memory consumption. In this work, we analyze the feature aggregation patterns of ResNets and DenseNets under a uniform aggregation view framework. We show that both structures densely gather features from previous layers in the network but combine them in their respective ways: summation (ResNets) or concatenation (DenseNets). We compare the strengths and drawbacks of these two aggregation methods and analyze their potential effects on the networks’ performance. Based on our analysis, we propose a new structure named SparseNets which achieves better performance with fewer parameters than DenseNets and ResNets. 
SparseNet  Deep neural networks have made remarkable progresses on various computer vision tasks. Recent works have shown that depth, width and shortcut connections of networks are all vital to their performances. In this paper, we introduce a method to sparsify DenseNet which can reduce connections of a Llayer DenseNet from O(L^2) to O(L), and thus we can simultaneously increase depth, width and connections of neural networks in a more parameterefficient and computationefficient way. Moreover, an attention module is introduced to further boost our network’s performance. We denote our network as SparseNet. We evaluate SparseNet on datasets of CIFAR(including CIFAR10 and CIFAR100) and SVHN. Experiments show that SparseNet can obtain improvements over the stateoftheart on CIFAR10 and SVHN. Furthermore, while achieving comparable performances as DenseNet on these datasets, SparseNet is x2.6 smaller and x3.7 faster than the original DenseNet. 
SparseStep Regression  The SparseStep algorithm is presented for the estimation of a sparse parameter vector in the linear regression problem. The algorithm works by adding an approximation of the exact counting norm as a constraint on the model parameters and iteratively strengthening this approximation to arrive at a sparse solution. Theoretical analysis of the penalty function shows that the estimator yields unbiased estimates of the parameter vector. An iterative majorization algorithm is derived which has a straightforward implementation reminiscent of ridge regression. In addition, the SparseStep algorithm is compared with similar methods through a rigorous simulation study which shows it often outperforms existing methods in both model fit and prediction accuracy. sparsestep 
Sparsity Oriented Importance Learning (SOIL) 
Sparsity Oriented Importance Learning (SOIL) provides an objective and informative profile of variable importances for high dimensional regression and classification models. SOIL 
Spatial Position Model  SpatialPosition 
Spatial Random Sampling (SRS) 
Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from lesspopulated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data. 
Spatial Sign Correlation  A new robust correlation estimator based on the spatial sign covariance matrix (SSCM) is proposed. We derive its asymptotic distribution and influence function at elliptical distributions. Finite sample and robustness properties are studied and compared to other robust correlation estimators by means of numerical simulations. sscor 
Spatial Sign Covariance Matrix (SSCM) 
The robust estimation of multivariate location and shape is one of the most challenging problems in statistics and crucial in many application areas. The objective is to find highly efficient, robust, computable and affine equivariant location and covariance matrix estimates. In this paper three different concepts of multivariate sign and rank are considered and their ability to carry information about the geometry of the underlying distribution (or data cloud) are discussed. New techniques for robust covariance matrix estimation based on different sign and rank concepts are proposed and algorithms for computing them outlined. In addition, new tools for evaluating the qualitative and quantitative robustness of a covariance estimator are proposed. The use of these tools is demonstrated on two rank based covariance matrix estimates. Finally, to illustrate the practical importance of the problem, a signal processing example where robust covariance matrix estimates are needed is given. The Spatial Sign Covariance Matrix With Unknown Location ➘ “Spatial Sign Correlation” sscor 
Spatial Simulated Annealing (SSA) 
Spatial simulated annealing uses slight perturbations of previous sampling designs and a random search technique to solve spatial optimization problems. Candidate measurement locations are iteratively moved around and optimized by minimizing the mean universal kriging variance. The approach relies on a known, prespecified model for underlying spatial variation. ➚ “Simulated Annealing” spsann 
Spatial Statistics  Spatial analysis or spatial statistics includes any of the formal techniques which study entities using their topological, geometric, or geographic properties. The phrase properly refers to a variety of techniques, many still in their early development, using different analytic approaches and applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, to chip fabrication engineering, with its use of ‘place and route’ algorithms to build complex wiring structures. The phrase is often used in a more restricted sense to describe techniques applied to structures at the human scale, most notably in the analysis of geographic data. The phrase is even sometimes used to refer to a specific technique in a single area of research, for example, to describe geostatistics. 
Spatial Stochastic Frontier Analysis (SSFA) 
Spatial Stochastic Frontier Analysis (SSFA) is an original method for controlling the spatial heterogeneity in Stochastic Frontier Analysis (SFA) models by splitting the inefficiency term into three terms: the first one related to spatial peculiarities of the territory in which each single unit operates, the second one related to the specific production features and the third one representing the error term. 
Spatial Transformer GAN  We address the problem of finding realistic geometric corrections to a foreground object such that it appears natural when composited into a background image. To achieve this, we propose a novel Generative Adversarial Network (GAN) architecture that utilizes Spatial Transformer Networks (STNs) as the generator, which we call Spatial Transformer GANs (STGANs). STGANs seek image realism by operating in the geometric warp parameter space. In particular, we exploit an iterative STN warping scheme and propose a sequential training strategy that achieves better results compared to naive training of a single generator. One of the key advantages of STGAN is its applicability to highresolution images indirectly since the predicted warp parameters are transferable between reference frames. We demonstrate our approach in two applications: (1) visualizing how indoor furniture (e.g. from product images) might be perceived in a room, (2) hallucinating how accessories like glasses would look when matched with real portraits. 
Spatial Transformer Introspective Neural Network (STINN) 
Natural images contain many variations such as illumination differences, affine transformations, and shape distortions. Correctly classifying these variations poses a long standing problem. The most commonly adopted solution is to build largescale datasets that contain objects under different variations. However, this approach is not ideal since it is computationally expensive and it is hard to cover all variations in one single dataset. Towards addressing this difficulty, we propose the spatial transformer introspective neural network (STINN) that explicitly generates samples with the unseen affine transformation variations in the training set. Experimental results indicate STINN achieves classification accuracy improvements on several benchmark datasets, including MNIST, affNIST, SVHN and CIFAR10. We further extend our method to cross dataset classification tasks and fewshot learning problems to verify our method under extreme conditions and observe substantial improvements from experiment results. 
Spatial Transformer Network (STN) 
Robotic grasp detection task is still challenging, particularly for novel objects. With the recent advance of deep learning, there have been several works on detecting robotic grasp using neural networks. Typically, regression based grasp detection methods have outperformed classification based detection methods in computation complexity with excellent accuracy. However, classification based robotic grasp detection still seems to have merits such as intermediate step observability and straightforward back propagation routine for endtoend training. In this work, we propose a novel classification based robotic grasp detection method with multiplestage spatial transformer networks (STN). Our proposed method was able to achieve stateoftheart performance in accuracy with real time computation. Additionally, unlike other regression based grasp detection methods, our proposed method allows partial observation for intermediate results such as grasp location and orientation for a number of grasp configuration candidates. 
Spatially Compact Semantic Scan (SCSS) 
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting realworld disease outbreaks from freetext ED chief complaint data. 
SPAtially Related Convolutional Neural Network (SPARCNN) 
The ability to accurately detect and classify objects at varying pixel sizes in cluttered scenes is crucial to many Navy applications. However, detection performance of existing stateof theart approaches such as convolutional neural networks (CNNs) degrade and suffer when applied to such cluttered and multiobject detection tasks. We conjecture that spatial relationships between objects in an image could be exploited to significantly improve detection accuracy, an approach that had not yet been considered by any existing techniques (to the best of our knowledge) at the time the research was conducted. We introduce a detection and classification technique called Spatially Related Detection with Convolutional Neural Networks (SPARCNN) that learns and exploits a probabilistic representation of interobject spatial configurations within images from training sets for more effective region proposals to use with stateoftheart CNNs. Our empirical evaluation of SPARCNN on the VOC 2007 dataset shows that it increases classification accuracy by 8% when compared to a region proposal technique that does not exploit spatial relations. More importantly, we obtained a higher performance boost of 18.8% when task difficulty in the test set is increased by including highly obscured objects and increased image clutter. 
SpatialTemporalSpectral Framework Based on a Deep Convolutional Neural Network (STSCNN) 
Because of the internal malfunction of satellite sensors and poor atmospheric conditions such as thick cloud, the acquired remote sensing data often suffer from missing information, i.e., the data usability is greatly reduced. In this paper, a novel method of missing information reconstruction in remote sensing images is proposed. The unified spatialtemporalspectral framework based on a deep convolutional neural network (STSCNN) employs a unified deep convolutional neural network combined with spatialtemporalspectral supplementary information. In addition, to address the fact that most methods can only deal with a single missing information reconstruction task, the proposed approach can solve three typical missing information reconstruction tasks: 1) dead lines in Aqua MODIS band 6; 2) the Landsat ETM+ Scan Line Corrector (SLC)off problem; and 3) thick cloud removal. It should be noted that the proposed model can use multisource data (spatial, spectral, and temporal) as the input of the unified framework. The results of both simulated and realdata experiments demonstrate that the proposed model exhibits high effectiveness in the three missing information reconstruction tasks listed above. 
Spatiotemporal Intrinsic Mode Decomposition (STIMD) 
We propose a new solution to the blind source separation problem that factors mixed timeseries signals into a sum of spatiotemporal modes, with the constraint that the temporal components are intrinsic mode functions (IMF’s). The key motivation is that IMF’s allow the computation of meaningful Hilbert transforms of nonstationary data, from which instantaneous timefrequency representations may be derived. Our spatiotemporal intrinsic mode decomposition (STIMD) method leverages spatial correlations to generalize the extraction of IMF’s from onedimensional signals, commonly performed using the empirical mode decomposition (EMD), to multidimensional signals. Further, this datadriven method enables futurestate prediction. We demonstrate STIMD on several synthetic examples, comparing it to common matrix factorization techniques, namely singular value decomposition (SVD), independent component analysis (ICA), and dynamic mode decomposition (DMD). We show that STIMD outperforms these methods at reconstruction and extracting interpretable modes. Next, we apply STIMD to analyze two realworld datasets, gravitational wave data and neural recordings from the rodent hippocampus. 
spatstat  spatstat is an R package for spatial statistics with a strong focus on analysing spatial point patterns in 2D (with some support for 3D and very basic support for spacetime). 
SPCALDA  A new reducedrank LDA method which works for high dimensional multiclass data. 
SpCoSLAM 2.0  In this paper, we propose a novel online learning algorithm, SpCoSLAM 2.0 for spatial concepts and lexical acquisition with higher accuracy and scalability. In previous work, we proposed SpCoSLAM as an online learning algorithm based on the Rao–Blackwellized particle filter. However, this conventional algorithm had problems such as the decrease of the estimation accuracy due to the influence of the early stages of learning as well as the increase of the computational complexity with the increase of the training data. Therefore, we first develop an improved algorithm by introducing new techniques such as rejuvenation. Next, we develop a scalable algorithm to reduce the calculation time while maintaining a higher accuracy than the conventional algorithm. In the experiment, we evaluate and compare the estimation accuracy and calculation time of the proposed algorithm, conventional online algorithm, and batch learning. The experimental results demonstrate that the proposed algorithm not only exceeds the accuracy of the conventional algorithm but also capable of achieving an accuracy comparable to that of batch learning. In addition, the proposed algorithm showed that the calculation time does not depend on the amount of training data and becomes constant for each step with the scalable algorithm. 
Specificity  Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are correctly identified as having the condition). Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition). These two measures are closely related to the concepts of type I and type II errors. A perfect predictor would be described as 100% sensitive (i.e. predicting all people from the sick group as sick) and 100% specific (i.e. not predicting anyone from the healthy group as sick); however, theoretically any predictor will possess a minimum error bound known as the Bayes error rate. 
Spectral Clustering  In multivariate statistics and the clustering of data, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset. In application to image segmentation, spectral clustering is known as segmentationbased object categorization. 
Spectral Clustering using Deep Neural Networks (SpectralNet) 
Spectral clustering is a leading and popular technique in unsupervised data analysis. Two of its major limitations are scalability and generalization of the spectral embedding (i.e., outofsampleextension). In this paper we introduce a deep learning approach to spectral clustering that overcomes the above shortcomings. Our network, which we call SpectralNet, learns a map that embeds input data points into the eigenspace of their associated graph Laplacian matrix and subsequently clusters them. We train SpectralNet using a procedure that involves constrained stochastic optimization. Stochastic optimization allows it to scale to large datasets, while the constraints, which are implemented using a specialpurpose output layer, allow us to keep the network output orthogonal. Moreover, the map learned by SpectralNet naturally generalizes the spectral embedding to unseen data points. To further improve the quality of the clustering, we replace the standard pairwise Gaussian affinities with affinities leaned from unlabeled data using a Siamese network. Additional improvement can be achieved by applying the network to code representations produced, e.g., by standard autoencoders. Our endtoend learning procedure is fully unsupervised. In addition, we apply VC dimension theory to derive a lower bound on the size of SpectralNet. Stateoftheart clustering results are reported on the Reuters dataset. Our implementation is publicly available at https://…/SpectralNet . 
Spectral Convolution Networks  Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. We show, mathematically, how convolution and activation can both be implemented in the frequency domain using either the Fourier or Laplace transformation. The main contributions are a description of spectral activation under the Fourier transform and a further description of an efficient algorithm for computing both convolution and activation under the Laplace transform. By computing both the convolution and activation functions in the frequency domain, we can reduce the number of transforms required, as well as reducing overall complexity. Our description of a spectral activation function, together with existing spectral analogs of other network functions may then be used to compose a fully spectral implementation of a convolution network. 
Spectral Graph Clustering (SGC) 
➘ “Spectral Clustering” 
Spectral Inference Network  We present Spectral Inference Networks, a framework for learning eigenfunctions of linear operators by stochastic optimization. Spectral Inference Networks generalize Slow Feature Analysis to generic symmetric operators, and are closely related to Variational Monte Carlo methods from computational physics. As such, they can be a powerful tool for unsupervised representation learning from video or pairs of data. We derive a training algorithm for Spectral Inference Networks that addresses the bias in the gradients due to finite batch size and allows for online learning of multiple eigenfunctions. We show results of training Spectral Inference Networks on problems in quantum mechanics and feature learning for videos on synthetic datasets as well as the Arcade Learning Environment. Our results demonstrate that Spectral Inference Networks accurately recover eigenfunctions of linear operators, can discover interpretable representations from video and find meaningful subgoals in reinforcement learning environments. 
Spectral Normalization  One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SNGANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques. 
SpectralPruning  The model size of deep neural network is getting larger and larger to realize superior performance in complicated tasks. This makes it difficult to implement deep neural network in small edgecomputing devices. To overcome this problem, model compression methods have been gathering much attention. However, there have been only few theoretical backgrounds that explain what kind of quantity determines the compression ability. To resolve this issue, we develop a new theoretical framework for model compression, and propose a new method called {\it SpectralPruning} based on the theory. Our theoretical analysis is based on the observation such that the eigenvalues of the covariance matrix of the output from nodes in the internal layers often shows rapid decay. We define ‘degree of freedom’ to quantify an intrinsic dimensionality of the model by using the eigenvalue distribution and show that the compression ability is essentially controlled by this quantity. Along with this, we give a generalization error bound of the compressed model. Our proposed method is applicable to wide range of models, unlike the existing methods, e.g., ones possess complicated branches as implemented in SegNet and ResNet. Our method makes use of both ‘input’ and ‘output’ in each layer and is easy to implement. We apply our method to several datasets to justify our theoretical analyses and show that the proposed method achieves the stateoftheart performance. 
SPECTRE  Distributed Complex Event Processing (DCEP) is a paradigm to infer the occurrence of complex situations in the surrounding world from basic events like sensor readings. In doing so, DCEP operators detect event patterns on their incoming event streams. To yield high operator throughput, data parallelization frameworks divide the incoming event streams of an operator into overlapping windows that are processed in parallel by a number of operator instances. In doing so, the basic assumption is that the different windows can be processed independently from each other. However, consumption policies enforce that events can only be part of one pattern instance; then, they are consumed, i.e., removed from further pattern detection. That implies that the constituent events of a pattern instance detected in one window are excluded from all other windows as well, which breaks the data parallelism between different windows. In this paper, we tackle this problem by means of speculation: Based on the likelihood of an event’s consumption in a window, subsequent windows may speculatively suppress that event. We propose the SPECTRE framework for speculative processing of multiple dependent windows in parallel. Our evaluations show an up to linear scalability of SPECTRE with the number of CPU cores. 
Spectrum Anomaly Detector with Interpretable FEatures (SAIFE) 
Detecting anomalous behavior in wireless spectrum is a demanding task due to the sheer complexity of the electromagnetic spectrum use. Wireless spectrum anomalies can take a wide range of forms from the presence of an unwanted signal in a licensed band to the absence of an expected signal, which makes manual labeling of anomalies difficult and suboptimal. We present, Spectrum Anomaly Detector with Interpretable FEatures (SAIFE), an Adversarial Autoencoder (AAE) based anomaly detector for wireless spectrum anomaly detection using Power Spectral Density (PSD) data which achieves good anomaly detection and localization in an unsupervised setting. In addition, we investigate the model’s capabilities to learn interpretable features such as signal bandwidth, class and center frequency in a semisupervised fashion. Along with anomaly detection the model exhibits promising results for lossy PSD data compression up to 120X and semisupervised signal classification accuracy close to 100% on three datasets just using 20% labeled samples. Finally the model is tested on data from one of the distributed Electrosense sensors over a long term of 500 hours showing its anomaly detection capabilities. 
Speculative Query Planning (SpecQP) 
Organisations store huge amounts of data from multiple heterogeneous sources in the form of Knowledge Graphs (KGs). One of the ways to query these KGs is to use SPARQL queries over a database engine. Since SPARQL follows exact match semantics, the queries may return too few or no results. Recent works have proposed query relaxation where the query engine judiciously replaces a query predicate with similar predicates using weighted relaxation rules mined from the KG. The space of possible relaxations is potentially too large to fully explore and users are typically interested in only topk results, so such query engines use topk algorithms for query processing. However, they may still process all the relaxations, many of whose answers do not contribute towards topk answers. This leads to computation overheads and delayed response times. We propose SpecQP, a query planning framework that speculatively determines which relaxations will have their results in the topk answers. Only these relaxations are processed using the topk operators. We, therefore, reduce the computation overheads and achieve faster response times without adversely affecting the quality of results. We tested SpecQP over two datasets – XKG and Twitter, to demonstrate the efficiency of our planning framework at reducing runtimes with reasonable accuracy for query engines supporting relaxations. 
Speech Analytics  Speech analytics is the process of analyzing recorded calls to gather information, brings structure to customer interactions and exposes information buried in customer contact center interactions with an enterprise. Although it often includes elements of automatic speech recognition, where the identities of spoken words or phrases are determined, it may also include analysis of one or more of the following: the topic(s) being discussed the emotional character of the speech the amount and locations of speech versus nonspeech (e.g. call hold time or periods of silence) One use of speech analytics applications is to spot spoken keywords or phrases, either as realtime alerts on live audio or as a postprocessing step on recorded speech. This technique is also known as audio mining. Other uses include categorization of speech, for example in the contact center environment, to identify calls from unsatisfied customers. Speech analytics in contact centers can be used to extract critical business intelligence that would otherwise be lost. By analyzing and categorizing recorded phone conversations between companies and their customers, useful information can be discovered relating to strategy, product, process, operational issues and contact center agent performance. This information gives decisionmakers insight into what customers really think about their company so that they can quickly react. In addition, speech analytics can automatically identify areas in which contact center agents may need additional training or coaching, and can automatically monitor the customer service provided on calls. 
Speech2Vec  In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixedlength vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to the underlying spoken words, and are close to other vectors in the embedding space if their corresponding underlying spoken words are semantically similar. The proposed model can be viewed as a speech version of Word2Vec. Its design is based on a RNN EncoderDecoder framework, and borrows the methodology of skipgrams or continuous bagofwords for training. Learning word embeddings directly from speech enables Speech2Vec to make use of the semantic information carried by speech that does not exist in plain text. The learned word embeddings are evaluated and analyzed on 13 widely used word similarity benchmarks, and outperform word embeddings learned by Word2Vec from the transcriptions. 
Speed as a Supervisor (SaaS) 
We introduce the SaaS Algorithm for semisupervised learning, which uses learning speed during stochastic gradient descent in a deep neural network to measure the quality of an iterative estimate of the posterior probability of unknown labels. Training speed in supervised learning correlates strongly with the percentage of correct labels, so we use it as an inference criterion for the unknown labels, without attempting to infer the model parameters at first. Despite its simplicity, SaaS achieves stateoftheart results in semisupervised learning benchmarks. 
Spherical Paragraph Model (SPM) 
Representing texts as fixedlength vectors is central to many language processing tasks. Most traditional methods build text representations based on the simple BagofWords (BoW) representation, which loses the rich semantic relations between words. Recent advances in natural language processing have shown that semantically meaningful representations of words can be efficiently acquired by distributed models, making it possible to build text representations based on a better foundation called the BagofWordEmbedding (BoWE) representation. However, existing text representation methods using BoWE often lack sound probabilistic foundations or cannot well capture the semantic relatedness encoded in word vectors. To address these problems, we introduce the Spherical Paragraph Model (SPM), a probabilistic generative model based on BoWE, for text representation. SPM has good probabilistic interpretability and can fully leverage the rich semantics of words, the word cooccurrence information as well as the corpuswide information to help the representation learning of texts. Experimental results on topical classification and sentiment analysis demonstrate that SPM can achieve new stateoftheart performances on several benchmark datasets. 
SpiderCNN  Deep neural networks have enjoyed remarkable success for various vision tasks, however it remains challenging to apply CNNs to domains lacking a regular underlying structures such as 3D point clouds. Towards this we propose a novel convolutional architecture, termed SpiderCNN, to efficiently extract geometric features from point clouds. SpiderCNN is comprised of units called SpiderConv, which extend convolutional operations from regular grids to irregular point set that can be embedded in R^n, by parametrizing a family of convolutional filters. We elaborately design the filter as a product of simple step function that captures local geodesic information and a Taylor polynomial that ensures the expressiveness. SpiderCNN inherits the multiscale hierarchical architecture from the classical CNNs, which allows it to extract semantic deep features. Experiments on ModelNet40 demonstrate that SpiderCNN achieves thestateofthe art accuracy 92.4% on standard benchmarks, and shows competitive performance on segmentation task. 
SpikeandSlab LASSO  SSLASSO 
SpikeProp  For a network of spiking neurons that encodes information in the timing of individual spike times, we derive a supervised learning rule, SpikeProp, akin to traditional errorbackpropagation. With this algorithm, we demonstrate how networks of spiking neurons with biologically reasonable action potentials can perform complex nonlinear classification in fast temporal coding just as well as ratecoded networks. We perform experiments for the classical XOR problem, when posed in a temporal setting, as well as for a number of other benchmark datasets. Comparing the (implicit) number of spiking neurons required for the encoding of the interpolated XOR problem, the trained networks demonstrate that temporal coding is a viable code for fast neural information processing, and as such requires less neurons than instantaneous ratecoding. Furthermore, we find that reliable temporal computation in the spiking networks was only accomplished when using spike response functions with a time constant longer than the coding interval, as has been predicted by theoretical considerations. 
SpikeTriggered nonNegative Matrix Factorization (STNMF) 
Neurons in sensory systems often pool inputs over arrays of presynaptic cells, giving rise to functional subunits inside a neuron´s receptive field. The organization of these subunits provides a signature of the neuron´s presynaptic functional connectivity and determines how the neuron integrates sensory stimuli. Here we introduce the method of spiketriggered nonnegative matrix factorization for detecting the layout of subunits within a neuron´s receptive field. The method only requires the neuron´s spiking responses under finely structured sensory stimulation and is therefore applicable to large populations of simultaneously recorded neurons. Applied to recordings from ganglion cells in the salamander retina, the method retrieves the receptive fields of presynaptic bipolar cells, as verified by simultaneous bipolar and ganglion cell recordings. The identified subunit layouts allow improved predictions of ganglion cell responses to natural stimuli and reveal shared bipolar cell input into distinct types of ganglion cells. Characterizing Neuronal Circuits with Spiketriggered Nonnegative Matrix Factorization 
Spiking Neural Network (SNN) 
Spiking Neural Nets (SNNs) (also sometimes called Oscillatory NNs) are being developed from an examination of the fact that neurons do not constantly communicate with one another but rather in spikes of signals. We all have heard of alpha waves in the brain and these oscillations are only one manifestation of the irregular cyclic and spiking nature of communication among neurons. So if individual neurons are activated only under specific circumstances in which the electrical potential exceeds a specific threshold, a spike, what might be the implication for designing neural nets? For one, there is the fundamental question of whether information is being encoded in the rate, amplitude, or even latency of the spikes. It appears this is so. The SNNs that have been demonstrated thus far show the following characteristics: · They can be developed with far fewer layers. If nodes only fire in response to a spike (actually a train of spikes) then one spiking neuron could replace many hundreds of hidden units on a sigmoidal NN. · There are implications for energy efficiency. SNNs should require much lower power than CNNs. · You could in theory route spikes like data packets further reducing layers. It’s tempting to say this reduces complexity and it’s true that layers go away, but are replaced by the complexity of interpreting and directing basically noisy spike trains. · Training SNNs does not rely on gradient descent functions as do CNNs. Gradient descent which looks at the performance of the overall network can be led astray by unusual conditions at a layer like a nondifferentiable activation function. The current and typical way to train SNNs is some variation on ‘Spike Timing Dependent Plasticity’ and is based on the timing, amplitude, or latency of the spike train. 
Spinnaker  Spinnaker is an open source, multicloud continuous delivery platform for releasing software changes with high velocity and confidence. 
Spirtes Glymour Scheines Algorithm (SGS) 
A.) Form the complete undirected graph H on the vertex set V. B.) For each pair of vertices A and B, if there exists a subset S of V such that A and B are dseparated given S, remove the edge between A and B from H. C.) Let K be the undirected graph resulting from step B). For each triple of vertices A B, and C such that the pair A and B and the pair B and C are each adjacent in K (written as A – B – C) but the pair A and C are not adjacent in K, orient A – B – C as A > B < C if and only if there is no subset S of {B} È V that dseparates A and C. D.) repeat · If A > B, B and C are adjacent, A and C are not adjacent, and there is no arrowhead at B, then orient B – C as B > C. · If there is a directed path from A to B, and an edge between A and B, then orient A – B as A > B. until no more edges can be oriented. 
SplitApplyCombine Strategy  In a splitapplycombine strategy you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. 
Splitted Isotonic Regression (ISOSPLIT) 
A limitation of many clustering algorithms is the requirement to tune adjustable parameters for each application or even for each dataset. Some algorithms require an \emph{a priori} estimate of the number of clusters while densitybased techniques usually require a scale parameter. Other parametric methods, such as mixture modeling, make assumptions about the underlying cluster distributions. Here we introduce a nonparametric clustering method that does not involve tunable parameters and only assumes that clusters are unimodal, in the sense that they have a single point of maximal density when projected onto any line, and that clusters are separated from one another by a separating hyperplane of relatively lower density. The technique uses a nonparametric algorithm—isotonic regression—as the kernel operation repeated at every iteration. We carry out a rigorous hypothesis test for whether pairs of clusters should be merged based upon Monte Carlo sampling of a statistic. We compare the method against kmeans++, DBSCAN, and Gaussian mixture algorithms and show in simulations that it performs better than these standard methods in many situations. The algorithm’s utility is also demonstrated in the context of ‘spike sorting’ of neural electrical recordings. The source code for the algorithm is freely available. 
SPLOM Chart  The scatterplot matrix, known acronymically as SPLOM, is a relatively uncommon graphical tool that uses multiple scatterplots to determine the correlation (if any) between a series of variables. These scatterplots are then organized into a matrix, making it easy to look at all the potential correlations in one place. SPLOMs, invented by John Hartigan in 1975, allow data aficionados to quickly realize any interesting correlations between parameters in the data set. 
Spoken Dialogue System (SDS) 
A spoken dialog system is a computer system able to converse with a human with voice. It has two essential components that do not exist in a text dialog system: a speech recognizer and a texttospeech module. In can be further distinguished from command and control speech systems that can respond to requests but do not attempt to maintain continuity over time. 
Spontaneous Clustering  We propose a new method for clustering based on the local minimization of the \gammadivergence, which we call the spontaneous clustering. The greatest advantage of the proposed method is that it automatically detects the number of clusters that adequately reflect the data structure. In contrast, exiting methods such as Kmeans, fuzzy cmeans, and model based clustering need to prescribe the number of clusters. We detect all the local minimum points of the \gammadivergence, which are defined as the centers of clusters. A necessary and sufficient condition for the \gammadivergence to have the local minimum points is also derived in a simple setting. A simulation study and a real data analysis are performed to compare our proposal with existing methods. 
Spotlight Analysis  New name for an old way of interpreting an interaction between a continuous and a categorical grouping variable in a regression model. The basic idea of spotlight analysis is to compare the mean satisfaction score of the two groups at specific values of the continuous covariate. 
Spotting anomalies with Privileged Information (SPI) 
We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information – data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learning and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through ‘imitation’ functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain knowledgeaugmented detection using expert annotations as PI, (ii) fast detection using computationallydemanding data as PI, and (iii) early detection using ‘historical future’ data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (iiii); with PI capturing expert knowledge, computationally expensive features, and future data on three real world detection tasks. 
spray  spray is an opensource toolkit for building REST/HTTPbased integration layers on top of Scala and Akka. Being asynchronous, actorbased, fast, lightweight, modular and testable it’s a great way to connect your Scala applications to the world. 
Spreading Activation  Spreading activation is a method for searching associative networks, neural networks, or semantic networks. The search process is initiated by labeling a set of source nodes (e.g. concepts in a semantic network) with weights or “activation” and then iteratively propagating or “spreading” that activation out to other nodes linked to the source nodes. Most often these “weights” are real values that decay as activation propagates through the network. When the weights are discrete this process is often referred to as marker passing. Activation may originate from alternate paths, identified by distinct markers, and terminate when two alternate paths reach the same node. 
Spreadmart  A spreadmart (spreadsheet data mart) is a situation in which a company’s employees has inconsistent views of corporate data because each department relies on the data from their own spreadsheets. 
Spring for Apache Hadoop  Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration. 
SPSAFSR  This manuscript presents the following: (1) an improved version of the Binary Simultaneous Perturbation Stochastic Approximation (SPSA) Method for feature selection in machine learning (Aksakalli and Malekipirbazari, Pattern Recognition Letters, Vol. 75, 2016) based on nonmonotone iteration gains computed via the Barzilai and Borwein (BB) method, (2) its adaptation for feature ranking, and (3) comparison against popular methods on public benchmark datasets. The improved method, which we call SPSAFSR, dramatically reduces the number of iterations required for convergence without impacting solution quality. SPSAFSR can be used for feature ranking and feature selection both for classification and regression problems. After a review of the current stateoftheart, we discuss our improvements in detail and present three sets of computational experiments: (1) comparison of SPSAFS as a (wrapper) feature selection method against sequential methods as well as genetic algorithms, (2) comparison of SPSAFS as a feature ranking method in a classification setting against random forest importance, chisquared, and information main methods, and (3) comparison of SPSAFS as a feature ranking method in a regression setting against minimum redundancy maximum relevance (MRMR), RELIEF, and linear correlation methods. The number of features in the datasets we use range from a few dozens to a few thousands. Our results indicate that SPSAFS converges to a good feature set in no more than 100 iterations and therefore it is quite fast for a wrapper method. SPSAFS also outperforms popular feature selection as well as feature ranking methods in majority of test cases, sometimes by a large margin, and it stands as a promising new feature selection and ranking method. 
Spyre  Spyre is a Web Application Framework for providing a simple user interface for Python data projects. Spyre runs on the minimalist python web framework, cherrypy, with jinja2 templating. At it’s heart, spyre is about data and data visualization, so you’ll also need pandas and matplotlib. 
SQLite  SQLite is a software library that implements a selfcontained, serverless, zeroconfiguration, transactional SQL database engine. SQLite is a relational database management system contained in a C programming library. In contrast to other database management systems, SQLite is not a separate process that is accessed from the client application, but an integral part of it. SQLite is ACIDcompliant and implements most of the SQL standard, using a dynamically and weakly typed SQL syntax that does not guarantee the domain integrity. SQLite is a popular choice as embedded database for local/client storage in application software such as web browsers. It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems, among others. SQLite has bindings to many programming languages. The source code for SQLite is in the public domain. RSQLite 
SQLScript  The motivation for SQLScript is to embed dataintensive application logic into the database. As of today, applications only offload very limited functionality into the database using SQL, most of the application logic is normally executed in an application server. This has the effect that data to be operated upon needs to be copied from the database into the application server and vice versa. When executing data intensive logic, this copying of data is very expensive in terms of processor and data transfer time. Moreover, when using an imperative language like ABAP or JAVA for processing data, developers tend to write algorithms which follow a one tuple at a time semantics (for example looping over rows in a table). However, these algorithms are hard to optimize and parallelize compared to declarative setoriented languages such as SQL. The SAP HANA database is optimized for modern technology trends and takes advantage of modern hardware, for example, by having data residing in mainmemory and allowing massiveparallelization on multicore CPUs. The goal of the SAP HANA database is to optimally support application requirements by leveraging such hardware. To this end, the SAP HANA database exposes a very sophisticated interface to the application consisting of many languages. The expressiveness of these languages far exceeds that attainable with OpenSQL. The set of SQL extensions for the SAP HANA database that allow developers to push data intensive logic into the database is called SQLScript. Conceptually SQLScript is related to stored procedures as defined in the SQL standard, but SQLScript is designed to provide superior optimization possibilities. SQLScript should be used in cases where other modeling constructs of SAP HANA, for example analytic views or attribute views are not sufficient. For more information on how to best exploit the different view types, see ‘Exploit Underlying Engine’. The set of SQL extensions are the key to avoiding massive data copies to the application server and for leveraging sophisticated parallel execution strategies of the database. SQLScript addresses the following problems: ● Decomposing an SQL query can only be done using views. However when decomposing complex queries using views, all intermediate results are visible and must be explicitly typed. Moreover SQL views cannot be parameterized which limits their reuse. In particular they can only be used like tables and embedded into other SQL statements. ● SQL queries do not have features to express business logic (for example a complex currency conversion). As a consequence such a business logic cannot be pushed down into the database (even if it is mainly based on standard aggregations like SUM(Sales), etc.). ● An SQL query can only return one result at a time. As a consequence the computation of related result sets must be split into separate, usually unrelated, queries. ● As SQLScript encourages developers to implement algorithms using a setoriented paradigm and not using a one tuple at a time paradigm, imperative logic is required, for example by iterative approximation algorithms. Thus it is possible to mix imperative constructs known from stored procedures with declarative ones. 
SquaredLoss Mutual Information (SMI) 
➘ “SquaredLoss Mutual Information Regularization” 
SquaredLoss Mutual Information Regularization (SMIR) 
We propose squaredloss mutual information regularization (SMIR) for multiclass probabilistic classi cation, following the information maximization principle. SMIR is convex under mild conditions and thus improves the nonconvexity of mutual information regularization. It offers all of the following four abilities to semisupervised algorithms: Analytical solution, outofsample/multiclass classification, and probabilistic output. Furthermore, novel generalization error bounds are derived. Experiments show SMIR compares favorably with stateoftheart methods. 
SqueezeSegNet  The recent researches in Deep Convolutional Neural Network have focused their attention on improving accuracy that provide significant advances. However, if they were limited to classification tasks, nowadays with contributions from Scientific Communities who are embarking in this field, they have become very useful in higher level tasks such as object detection and pixelwise semantic segmentation. Thus, brilliant ideas in the field of semantic segmentation with deep learning have completed the state of the art of accuracy, however this architectures become very difficult to apply in embedded systems as is the case for autonomous driving. We present a new Deep fully Convolutional Neural Network for pixelwise semantic segmentation which we call SqueezeSegNet. The architecture is based on EncoderDecoder style. We use a SqueezeNetlike encoder and a decoder formed by our proposed squeezedecoder module and upsample layer using downsample indices like in SegNet and we add a deconvolution layer to provide final multichannel feature map. On datasets like Camvid or Citystates, our net gets SegNetlevel accuracy with less than 10 times fewer parameters than SegNet. 
SSIMLayer  Deeper convolutional neural networks provide more capacity to approximate complex mapping functions. However, increasing network depth imposes difficulties on training and increases model complexity. This paper presents a new nonlinear computational layer of considerably high capacity to the deep convolutional neural network architectures. This layer performs a set of comprehensive convolution operations that mimics the overall function of the human visual system (HVS) via focusing on learning structural information in its input. The core of its computations is evaluating the components of the structural similarity metric (SSIM) in a setting that allows the kernels to learn to match structural information. The proposed SSIMLayer is inherently nonlinear and hence, it does not require subsequent nonlinear transformations. Experiments conducted on CIFAR10 benchmark demonstrates that the SSIMLayer provides better convergence than the traditional convolutional layer, bypasses the need for nonlinear transformations and shows more robustness against noise perturbations and adversarial attacks. 
Stability  A learning system is said to be stable if no pattern in the training data changes its category after a finite number of learning iterations. 
Stable Marriage Problem (SMP) 
In mathematics, economics, and computer science, the stable marriage problem (also stable matching problem or SMP) is the problem of finding a stable matching between two equally sized sets of elements given an ordering of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both: 1. some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and 2. B also prefers A over the element to which B is already matched In other words, a matching is stable when there does not exist any match (A, B) by which both A and B are individually better off than they would be with the element to which they are currently matched. The stable marriage problem is commonly stated in terms of heterosexual marriages and binary genders: ‘Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.’ Algorithms for finding solutions to the stable marriage problem have applications in a variety of realworld situations, perhaps the best known of these being in the assignment of graduating medical students to their first hospital appointments. In 2012, the Nobel Prize in Economics was awarded to Lloyd S. Shapley and Alvin E. Roth ‘for the theory of stable allocations and the practice of market design.’ 
Stacked Area Plot  Stacked Area Graphs work in the same way as simple Area Graphs do, except for the use of multiple data series that start each point from the point left by the previous data series. The entire graph represents the total of all the data plotted. Stacked Area Graphs also use area to convey whole numbers, so they do not work for negative values. Overall, they are useful for comparing multiple variables changing over an interval. areaplot 
Stacked Autoencoders  A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer. The greedy layerwise approach for pretraining a deep network works by training each layer in turn. In this page, you will find out how autoencoders can be “stacked” in a greedy layerwise fashion for pretraining (initializing) the weights of a deep network. 
Stacked Deconvolutional Network (SDN) 
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for semantic segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, are stacked one by one to integrate contextual information and guarantee the fine recovery of localization information. Meanwhile, interunit and intraunit connections are designed to assist network training and enhance feature fusion since the connections improve the flow of information and gradient propagation throughout the network. Besides, hierarchical supervision is applied during the upsampling process of each SDN unit, which guarantees the discrimination of feature representations and benefits the network optimization. We carry out comprehensive experiments and achieve the new stateoftheart results on three datasets, including PASCAL VOC 2012, CamVid, GATECH. In particular, our best model without CRF postprocessing achieves an intersectionoverunion score of 86.6% in the test set. 
Stacked Denoising Autoencoder (SdA) 
A stacked denoising autoencoder is to a denoising autoencoder what a deepbelief network is to a restricted Boltzmann machine. A key function of SDAs, and deep learning more generally, is unsupervised pretraining, layer by layer, as input is fed through. Once each layer is pretrained to conduct feature selection and extraction on the input from the preceding layer, a second stage of supervised finetuning can follow. A word on stochastic corruption in SDAs: Denoising autoencoders shuffle data around and learn about that data by attempting to reconstruct it. The act of shuffling is the noise, and the job of the network is to recognize the features within the noise that will allow it to classify the input. When a network is being trained, it generates a model, and measures the distance between that model and the benchmark through a loss function. Its attempts to minimize the loss function involve resampling the shuffled inputs and rereconstructing the data, until it finds those inputs which bring its model closest to what it has been told is true. The serial resamplings are based on a generative model to randomly provide data to be processed. This is known as a Markov Chain, and more specifically, a Markov Chain Monte Carlo algorithm that steps through the data set seeking a representative sampling of indicators that can be used to construct more and more complex features. 
Stacked Generalization (Blending) 
Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a singlelayer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression) and unsupervised learning (density estimation). It has also been used to estimate bagging’s error rate. It has been reported to outperform Bayesian modelaveraging. The two topperformers in the Netflix competition utilized blending, which may be considered to be a form of stacking. 
Stacked Generative Adversarial Networks (SGAN) 
In this paper we aim to leverage the powerful bottomup discriminative representations to guide a topdown generative model. We propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a discriminative bottomup deep network. Our model consists of a topdown stack of GANs, each trained to generate ‘plausible’ lowerlevel representations, conditioned on higherlevel representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottomup discriminative network, providing intermediate supervision. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. To the best of our knowledge, the entropy loss is the first attempt to tackle the conditional model collapse problem that is common in conditional GANs. We first train each GAN of the stack independently, and then we train the stack endtoend. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the topdown generative process. Experiments demonstrate that SGAN is able to generate diverse and highquality images, as well as being more interpretable than a vanilla GAN. 
Stacked Kernel Network (SKN) 
Kernel methods are powerful tools to capture nonlinear patterns behind data. They implicitly learn high (even infinite) dimensional nonlinear features in the Reproducing Kernel Hilbert Space (RKHS) while making the computation tractable by leveraging the kernel trick. Classic kernel methods learn a single layer of nonlinear features, whose representational power may be limited. Motivated by recent success of deep neural networks (DNNs) that learn multilayer hierarchical representations, we propose a Stacked Kernel Network (SKN) that learns a hierarchy of RKHSbased nonlinear features. SKN interleaves several layers of nonlinear transformations (from a linear space to a RKHS) and linear transformations (from a RKHS to a linear space). Similar to DNNs, a SKN is composed of multiple layers of hidden units, but each parameterized by a RKHS function rather than a finitedimensional vector. We propose three ways to represent the RKHS functions in SKN: (1)nonparametric representation, (2)parametric representation and (3)random Fourier feature representation. Furthermore, we expand SKN into CNN architecture called Stacked Kernel Convolutional Network (SKCN). SKCN learning a hierarchy of RKHSbased nonlinear features by convolutional operation with each filter also parameterized by a RKHS function rather than a finitedimensional matrix in CNN, which is suitable for image inputs. Experiments on various datasets demonstrate the effectiveness of SKN and SKCN, which outperform the competitive methods. 
Staircase Network  Language recognition system is typically trained directly to optimize classification error on the target language labels, without using the external, or metainformation in the estimation of the model parameters. However labels are not independent of each other, there is a dependency enforced by, for example, the language family, which affects negatively on classification. The other external information sources (e.g. audio encoding, telephony or video speech) can also decrease classification accuracy. In this paper, we attempt to solve these issues by constructing a deep hierarchical neural network, where different levels of metainformation are encapsulated by attentive prediction units and also embedded into the training progress. The proposed method learns auxiliary tasks to obtain robust internal representation and to construct a variant of attentive units within the hierarchical model. The final result is the structural prediction of the target language and a closely related language family. The algorithm reflects a ‘staircase’ way of learning in both its architecture and training, advancing from the fundamental audio encoding to the language family level and finally to the target language level. This process not only improves generalization but also tackles the issues of imbalanced class priors and channel variability in the deep neural network model. Our experimental findings show that the proposed architecture outperforms the stateoftheart ivector approaches on both small and big language corpora by a significant margin. 
Stan  Stan is a probabilistic programming language implementing full Bayesian statistical inference wit MCMC sampling (NUTS, HMC) and penalized maximum likelihood estimation wit Optimization (LBFGS. Stan is coded in C++ and runs on all major platforms (Linux, Mac, Windows). Stan is freedomrespecting, opensource software (new BSD core, GPLv3 interfaces). http://…/hellostan.html rstan 
Standard Methodology for Analytical Models (SMAM) 
In this document, the Standard Methodology for Analytical Models (SMAM) is described. The most frequent used methodology is the Cross Industrial Standard Processes for Data Mining (CRISPDM), which has several shortcomings that translate into frequent friction points with the business when practitioners start building analytical models. 
Stanford DAWN Project  Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively timeconsuming and expensive for all but the besttrained, bestfunded engineering organizations. This expense comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting endtoend machine learning application development, from data preparation and labeling to productionization and monitoring. In this document, we outline opportunities for infrastructure supporting usable, endtoend machine learning applications in the context of the nascent DAWN (Data Analytics for What’s Next) project at Stanford. 
Stanford Question Answering Dataset (SQuAD) 
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ questionanswer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets. 
StarCraft II Learning Environment (SC2LE) 
This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multiagent problem with multiple players interacting; there is imperfect information due to a partially observed map; it has a large action space involving the selection and control of hundreds of units; it has a large state space that must be observed solely from raw input feature planes; and it has delayed credit assignment requiring longterm strategies over thousands of steps. We describe the observation, action, and reward specification for the StarCraft II domain and provide an open source Pythonbased interface for communicating with the game engine. In addition to the main game maps, we provide a suite of minigames focusing on different elements of StarCraft II gameplay. For the main game maps, we also provide an accompanying dataset of game replay data from human expert players. We give initial baseline results for neural networks trained from this data to predict game outcomes and player actions. Finally, we present initial baseline results for canonical deep reinforcement learning agents applied to the StarCraft II domain. On the minigames, these agents learn to achieve a level of play that is comparable to a novice player. However, when trained on the main game, these agents are unable to make significant progress. Thus, SC2LE offers a new and challenging environment for exploring deep reinforcement learning algorithms and architectures. 
StarKOSR  Motivated by many practical applications in logistics and mobilityasaservice, we study the topk optimal sequenced routes (KOSR) querying on large, general graphs where the edge weights may not satisfy the triangle inequality, e.g., road network graphs with travel times as edge weights. The KOSR querying strives to find the topk optimal routes (i.e., with the topk minimal total costs) from a given source to a given destination, which must visit a number of vertices with specific vertex categories (e.g., gas stations, restaurants, and shopping malls) in a particular order (e.g., visiting gas stations before restaurants and then shopping malls). To efficiently find the topk optimal sequenced routes, we propose two algorithms PruningKOSR and StarKOSR. In PruningKOSR, we define a dominance relationship between two partiallyexplored routes. The partiallyexplored routes that can be dominated by other partiallyexplored routes are postponed being extended, which leads to a smaller searching space and thus improves efficiency. In StarKOSR, we further improve the efficiency by extending routes in an A* manner. With the help of a judiciously designed heuristic estimation that works for general graphs, the cost of partially explored routes to the destination can be estimated such that the qualified complete routes can be found early. In addition, we demonstrate the high extensibility of the proposed algorithms by incorporating Hop Labeling, an effective label indexing technique for shortest path queries, to further improve efficiency. Extensive experiments on multiple realworld graphs demonstrate that the proposed methods significantly outperform the baseline method. Furthermore, when k=1, StarKOSR also outperforms the stateoftheart method for the optimal sequenced route queries. 
STARTS  Although researchers in clinical psychology routinely gather data in which many individuals respond at multiple times, there is not a standard way to analyze such data. A new approach for the analysis of such data is described. It is proposed that a person’s current standing on a variable is caused by 3 sources of variance: a term that does not change (trait), a term that changes (state), and a random term (error). It is shown how structural equation modeling can be used to estimate such a model. An extended example is presented in which the correlations between variables are quite different at the trait, state, and error levels. (PsycINFO Database Record (c) 2016 APA, all rights reserved) STARTS 
Stata  Stata is a complete, integrated statistical software package that provides everything you need for data analysis, data management, and graphics. With both a pointandclick interface and a powerful, intuitive command syntax, Stata is fast, accurate, and easy to use. All analyses can be reproduced and documented for publication and review. Version control ensures statistical programs will continue to produce the same results no matter when you wrote them. 
State Space Model (SSM) 
State space model (SSM) refers to a class of probabilistic graphical model (Koller and Friedman, 2009) that describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process. The SSM framework has been successfully applied in engineering, statistics, computer science and economics to solve a broad range of dynamical systems problems. Other terms used to describe SSMs are hidden Markov models (HMMs) (Rabiner, 1989) and latent process models. The most well studied SSM is the Kalman filter, which defines an optimal algorithm for inferring linear Gaussian systems. 
StateActionRewardStateAction (SARSA) 
SARSA (StateActionRewardStateAction) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was introduced in a technical note where the alternative name SARSA was only mentioned as a footnote. This name simply reflects the fact that the main function for updating the Qvalue depends on the current state of the agent “S1”, the action the agent chooses “A1”, the reward “R” the agent gets for choosing this action, the state “S2” that the agent will now be in after taking that action, and finally the next action “A2” the agent will choose in its new state. Taking every letter in the quintuple (st, at, rt, st+1, at+1) yields the word SARSA. 
Stated Preference Method  The term “Stated Preference Methods” refers to a family of techniques which use individual respondents´ statements about their preferences in a set of options to estimate utility functions. The options are typically descriptions of situations or contexts constructed by the researcher. By their nature, stated preference methods require purposedesigned surveys for their collection of data. “Contingent Valuation” is often referred to as a stated preference model. http://…/spmur 
StateDenoised Recurrent Neural Network (SDRNN) 
Recurrent neural networks (RNNs) are difficult to train on sequence processing tasks, not only because input noise may be amplified through feedback, but also because any inaccuracy in the weights has similar consequences as input noise. We describe a method for denoising the hidden state during training to achieve more robust representations thereby improving generalization performance. Attractor dynamics are incorporated into the hidden state to `clean up’ representations at each step of a sequence. The attractor dynamics are trained through an auxillary denoising loss to recover previously experienced hidden states from noisy versions of those states. This statedenoised recurrent neural network {SDRNN} performs multiple steps of internal processing for each external sequence step. On a range of tasks, we show that the SDRNN outperforms a generic RNN as well as a variant of the SDRNN with attractor dynamics on the hidden state but without the auxillary loss. We argue that attractor dynamics—and corresponding connectivity constraints—are an essential component of the deep learning arsenal and should be invoked not only for recurrent networks but also for improving deep feedforward nets and intertask transfer. 
Stationary Process  In mathematics and statistics, a stationary process (or strict(ly) stationary process or strong(ly) stationary process) is a stochastic process whose joint probability distribution does not change when shifted in time. Consequently, parameters such as the mean and variance, if they are present, also do not change over time and do not follow any trends. Stationarity is used as a tool in time series analysis, where the raw data is often transformed to become stationary; for example, economic data are often seasonal and/or dependent on a nonstationary price level. An important type of nonstationary process that does not include a trendlike behavior is the cyclostationary process. Note that a “stationary process” is not the same thing as a “process with a stationary distribution”. Indeed there are further possibilities for confusion with the use of “stationary” in the context of stochastic processes; for example a “timehomogeneous” Markov chain is sometimes said to have “stationary transition probabilities”. Besides, all stationary Markov random processes are timehomogeneous. 
STATISTICA  STATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, statistics, data mining, and data visualization procedures. STATISTICA product categories include Enterprise (for use across a site or organization), WebBased (for use with a server and web browser), Concurrent Network Desktop, and SingleUser Desktop. 
Statistical Analysis System (SAS) 
SAS (Statistical Analysis System; not to be confused with SAP) is a software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics. SAS was developed at North Carolina State University from 1966 until 1976, when SAS Institute was incorporated. SAS was further developed in the 1980s and 1990s with the addition of new statistical procedures, additional components and the introduction of JMP. A pointandclick interface was added in version 9 in 2004. A social media analytics product was added in 2010. 
Statistical Archetypal Analysis (SAA) 
Statistical Archetypal Analysis (SAA) is introduced for the dimensional reduction of a collection of probability distributions known via samples. Applications include medical diagnosis from clinical data in the form of distributions (such as distributions of blood pressure or heart rates from different patients), the analysis of climate data such as temperature or wind speed at different locations, and the study of bifurcations in stochastic dynamical systems. Distributions can be embedded into a Hilbert space with a suitable metric, and then analyzed similarly to feature vectors in Euclidean space. However, most dimensional reduction techniques –such as Principal Component Analysis– are not interpretable for distributions, as neither the components nor the reconstruction of input data by components are themselves distributions. To obtain an interpretable result, Archetypal Analysis (AA) is extended to distributions, requiring the components to be mixtures of the input distributions and approximating the input distributions by mixtures of components. 
Statistical Classification  In machine learning and statistics, classification is the problem of identifying to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “nonspam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance. 
Statistical Data and Metadata eXchange (SDMX) 
SDMX is an initiative to foster standards for the exchange of statistical information. It started in 2001 and aims at fostering standards for Statistical Data and Metadata eXchange (SDMX). The SDMX message formats have two basic expressions, SDMXML (using XML syntax) and SDMXEDI (using EDIFACT syntax and based on the GESMES/TS statistical message). The standards also include additional specifications (e.g. registry specification, web services). Version 1.0 of the SDMX standard has been recognised as an ISO standard in 2005. The latest version of the standard – SDMX 2.1 – has been released in April 2011. In 2013 SDMX was approved by ISO as an International Standard (ISO 17369:2013). rsdmx 
Statistical Decision Theory  
Statistical Disclosure Control (SDC) 
The purpose of statistical disclosure control is to make as small as possible the risk of releasing confidential information whilst maximising the access to useful, high quality data. Statistical disclosure control (SDC) covers a range of ways of changing data which are used to control the risk of an intruder finding out confidential information about a person or unit (such as a household or business). Laws protect the confidentiality of data about living people and there is also a range of legislation for specific types of data, for example the census. Many surveys also carry a confidentiality assurance, which is an agreement between the respondent and the data collector about how the collected data will be used. In the last ten years there has been a large increase in the electronic storage of data and wider access to information on the internet, including data for small geographical areas. At the same time, computing expertise and access to computers with a large amount of processing power have also increased. This means that data publishers need to take increased steps so that released microdata (data held as individual records) and tabulations do not reveal any identifiable or disclosive information about a person , household or business. Statistical Disclosure Control: Protecting sensitive information Statistical disclosure control Introduction to Statistical Disclosure Control sdcTarget 
Statistical Disclosure Limitation (SDL) 
The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study. 
Statistical Distance  In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be between an individual sample point and a population or a wider sample of points. A distance between populations can be interpreted as measuring the distance between two probability distributions and hence they are essentially measures of distances between probability measures. Where statistical distance measures relate to the differences between random variables, these may have statistical dependence, and hence these distances are not directly related to measures of distances between probability measures. Again, a measure of distance between random variables may relate to the extent of dependence between them, rather than to their individual values. Statistical distance measures are mostly not metrics and they need not be symmetric. Some types of distance measures are referred to as (statistical) divergences. 
Statistical Engineering  Several authors, including the American Statistician (ASA), have noted the challenges facing statisticians when attacking large, complex, unstructured problems, as opposed to welldefined textbook problems. Clearly, the standard paradigm of selecting the one ‘correct’ statistical method for such problems is not sufficient; a new paradigm is needed. Statistical engineering has been proposed as a discipline that can provide a viable paradigm to attack such problems, used in conjunction with sound statistical science. Of course, in order to develop as a true discipline, statistical engineering needs a welldeveloped theory, not just a formal definition and successful case studies. This article documents and disseminates the current state of the underlying theory of statistical engineering. Our purpose is to provide a vehicle for applied statisticians to further enhance the practice of statistics, and for academics so interested to continue development of the underlying theory of statistical engineering. 
Statistical Inference  In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to welldefined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. 
Statistical Learning  ➘ “Statistical Learning Theory” 
Statistical Learning Theory (SLT) 
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, bioinformatics and baseball. 
Statistical Model  A statistical model is a formalization of relationships between variables in the form of mathematical equations. A statistical model describes how one or more random variables are related to one or more other variables. The model is statistical as the variables are not deterministically but stochastically related. In mathematical terms, a statistical model is frequently thought of as a pair where is the set of possible observations and the set of possible probability distributions on. It is assumed that there is a distinct element of which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one. Most statistical tests can be described in the form of a statistical model. For example, the Student’s ttest for comparing the means of two groups can be formulated as seeing if an estimated parameter in the model is different from 0. Another similarity between tests and models is that there are assumptions involved. Error is assumed to be normally distributed in most models. Peter Norvig 
Statistical Power  The power of a statistical test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false (i.e. the probability of not committing a Type II error). It can be equivalently thought of as the probability of correctly accepting the alternative hypothesis when the alternative hypothesis is true – that is, the ability of a test to detect an effect, if the effect actually exists. 
Statistical Process Control (SPC) 
Statistical process control (SPC) is a method of quality control which uses statistical methods. SPC is applied in order to monitor and control a process. Monitoring and controlling the process ensures that it operates at its full potential. At its full potential, the process can make as much conforming product as possible with a minimum (if not an elimination) of waste (rework or scrap). SPC can be applied to any process where the “conforming product” (product meeting specifications) output can be measured. Key tools used in SPC include control charts; a focus on continuous improvement; and the design of experiments. An example of a process where SPC is applied is manufacturing lines. 
Statistical Protocol IDentification (SPID) 
Identifying which application layer protocol is being used within a network communication session is important when assigning Quality of Service priorities as well as when conducting network security monitoring. Currently most protocol identification is performed through signature matching algorithms that rely on strings or regular expressions as signatures. This report presents a protocol identification scheme called the Statistical Protocol Identification (SPID) algorithm, which reliably identifies the application layer protocol by using statistical measurements of flow data as well as application layer data. The SPID algorithm utilises KullbackLeibler divergence measurements to compare probability vectors created from observed network traffic to probability vectors of known protocols. 
Statistical Ranking Color Scheme (SRCS) 
The problem of comparing a new solution method against existing ones to find statistically significant differences arises very often in sciences and engineering. When the problem instance being solved is defined by several parameters, assessing a number of methods with respect to many problem configurations simultaneously becomes a hard task. Some visualization technique is required for presenting a large number of statistical significance results in an easily interpretable way. Here we review an existing colorbased approach called Statistical Ranking Color Scheme (SRCS) for displaying the results of multiple pairwise statistical comparisons between several methods assessed separately on a number of problem configurations. We introduce an R package implementing SRCS, which performs all the pairwise statistical tests from user data and generates customizable plots. We demonstrate its applicability on two examples from the areas of dynamic optimization and machine learning, in which several algorithms are compared on many problem instances, each defined by a combination of parameters. 
Statistical Recurrent Unit (SRU) 
Sophisticated gated recurrent neural network architectures like LSTMs and GRUs have been shown to be highly effective in a myriad of applications. We develop an ungated unit, the statistical recurrent unit (SRU), that is able to learn long term dependencies in data by only keeping moving averages of statistics. The SRU’s architecture is simple, ungated, and contains a comparable number of parameters to LSTMs; yet, SRUs perform favorably to more sophisticated LSTM and GRU alternatives, often outperforming one or both in various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an unbiased manner by optimizing respective architectures’ hyperparameters in a Bayesian optimization scheme for both synthetic and realworld tasks. 
Statistical Relational Learning (SRL) 
Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with models of domains that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in SRL use (a subset of) firstorder logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming. Significant contributions to the field have been made since the late 1990s. As is evident from the characterization above, the field is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and firstorder probabilistic languages (emphasizing the key properties of the languages with which models are represented). http://…/9783319136431 
Statistical Theory  The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statisticaldecision problems and to statistical inference, and the actions and deductions that satisfy the basic principles stated for these different approaches. Within a given approach, statistical theory gives ways of comparing statistical procedures; it can find a best possible procedure within a given context for given statistical problems, or can provide guidance on the choice between alternative procedures. Apart from philosophical considerations about how to make statistical inferences and decisions, much of statistical theory consists of mathematical statistics, and is closely linked to probability theory, to utility theory, and to optimization. 
Statistical Transformer Network (StaTN) 
We generalise Spatial Transformer Networks (STN) by replacing the parametric transformation of a fixed, regular sampling grid with a deformable, statistical shape model which is itself learnt. We call this a Statistical Transformer Network (StaTN). By training a network containing a StaTN endtoend for a particular task, the network learns the optimal nonrigid alignment of the input data for the task. Moreover, the statistical shape model is learnt with no direct supervision (such as landmarks) and can be reused for other tasks. Besides training for a specific task, we also show that a StaTN can learn a shape model using generic loss functions. This includes a loss inspired by the minimum description length principle in which an appearance model is also learnt from scratch. In this configuration, our model learns an active appearance model and a means to fit the model from scratch with no supervision at all, even identity labels. 
Statistics  Statistics is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments. When analyzing data, it is possible to use one or both of statistics methodologies: descriptive and inferential statistics in the analysis data. 
statnet  statnet is a suite of software packages for network analysis that implement recent advances in the statistical modeling of networks. The analytic framework is based on Exponential family Random Graph Models (ergm). statnet provides a comprehensive framework for ergmbased network modeling, including tools for model estimation, model evaluation, modelbased network simulation, and network visualization. This broad functionality is powered by a central Markov chain Monte Carlo (MCMC) algorithm. statnetWeb 
Statues Algorithm  We present here a new probabilistic inference algorithm that gives exact results in the domain of discrete probability distributions. This algorithm, named the Statues algorithm, calculates the marginal probability distribution on probabilistic models defined as direct acyclic graphs. These models are made up of welldefined primitives that allow to express, in particular, joint probability distributions, Bayesian networks, discrete Markov chains, conditioning and probabilistic arithmetic. The Statues algorithm relies on a variable binding mechanism based on the generator construct, a special form of coroutine; being related to the enumeration algorithm, this new algorithm brings important improvements in terms of efficiency, which makes it valuable in regard to other exact marginalization algorithms. After introduction of several definitions, primitives and compositional rules, we present in details the Statues algorithm. Then, we briefly discuss the interest of this algorithm compared to others and we present possible extensions. Finally, we introduce Lea and MicroLea, two Python libraries implementing the Statues algorithm, along with several use cases. 
Stein Points  An important task in computational statistics and machine learning is to approximate a posterior distribution $p(x)$ with an empirical measure supported on a set of representative points $_{i=1}^n$. This paper focuses on methods where the selection of points is essentially deterministic, with an emphasis on achieving accurate approximation when $n$ is small. To this end, we present `Stein Points’. The idea is to exploit either a greedy or a conditional gradient method to iteratively minimise a kernel Stein discrepancy between the empirical measure and $p(x)$. Our empirical results demonstrate that Stein Points enable accurate approximation of the posterior at modest computational cost. In addition, theoretical results are provided to establish convergence of the method. 
Stein Variational Autoencoder  A new method for learning variational autoencoders is developed, based on an application of Stein’s operator. The framework represents the encoder as a deep nonlinear function through which samples from a simple distribution are fed. One need not make parametric assumptions about the form of the encoder distribution, and performance is further enhanced by integrating the proposed encoder with importance sampling. Example results are demonstrated across multiple unsupervised and semisupervised problems, including semisupervised analysis of the ImageNet data, demonstrating the scalability of the model to large datasets. 
Stein´s Paradox  Stein’s example (or phenomenon or paradox), in decision theory and estimation theory, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected meansquared error) than any method that handles the parameters separately. An intuitive explanation is that optimizing for the meansquared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent; this occurs in channel estimation in telecommunications, for instance (different factors affect overall channel performance). On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse. 
Steiner Tree Reoptimization  New algorithms for Steiner tree reoptimization 
Stein’s Unbiased Risk Estimate (SURE) 
Stein’s unbiased risk estimate (SURE) is an unbiased estimator of the meansquared error of ‘a nearly arbitrary, nonlinear biased estimator.’ In other words, it provides an indication of the accuracy of a given estimator. This is important since the true meansquared error of an estimator is a function of the unknown parameter to be estimated, and thus cannot be determined exactly. The technique is named after its discoverer, Charles Stein. On an improvement of LASSO by scaling 
Stemming  Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers. 
Stepwise Regression  In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of Ftests or ttests, but other techniques are possible, such as adjusted Rsquare, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate. The frequent practice of fitting the final selected model followed by reporting estimates and confidence intervals without adjusting them to take the model building process into account has led to calls to stop using stepwise model building altogether or to at least make sure model uncertainty is correctly reflected. StepReg 
STNOCR  Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two subtasks (text detection and text recognition) have been proposed. In this paper we present STNOCR, a step towards semisupervised neural networks for scene text recognition, that can be optimized endtoend. In contrast to most existing works that consist of multiple deep neural networks and several preprocessing steps we propose to use a single deep neural network that learns to detect and recognize text from natural images in a semisupervised way. STNOCR is a network that integrates and jointly learns a spatial transformer network, that can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We investigate how our model behaves on a range of different tasks (detection and recognition of characters, and lines of text). Experimental results on public benchmark datasets show the ability of our model to handle a variety of different tasks, without substantial changes in its overall network structure. 
Stochastic  In probability theory, a purely stochastic system is one whose state is nondeterministic (i.e., “random”) so that the subsequent state of the system is determined probabilistically. Any system or process that must be analyzed using probability theory is stochastic at least in part. Stochastic systems and processes play a fundamental role in mathematical models of phenomena in many fields of science, engineering, and economics. Stochastic comes from a Greek word, which means “aim”. It also denotes a target stick; the pattern of arrows around a target stick stuck in a hillside is representative of what is stochastic. 
Stochastic Activation Pruning (SAP) 
Neural networks are known to be vulnerable to adversarial examples. Carefully chosen perturbations to real images, while imperceptible to humans, induce misclassification and threaten the reliability of deep learning systems in the wild. To guard against adversarial examples, we take inspiration from game theory and cast the problem as a minimax zerosum game between the adversary and the model. In general, for such games, the optimal strategy for both players requires a stochastic policy, also known as a mixed strategy. In this light, we propose Stochastic Activation Pruning (SAP), a mixed strategy for adversarial defense. SAP prunes a random subset of activations (preferentially pruning those with smaller magnitude) and scales up the survivors to compensate. We can apply SAP to pretrained networks, including adversarially trained models, without finetuning, providing robustness against adversarial examples. Experiments demonstrate that SAP confers robustness against attacks, increasing accuracy and preserving calibration. 
Stochastic Answer Network (SAN) 
We propose a stochastic answer network (SAN) to explore multistep inference strategies in Natural Language Inference. Rather than directly predicting the results given the inputs, the model maintains a state and iteratively refines its predictions. Our experiments show that SAN achieves the stateoftheart results on three benchmarks: Stanford Natural Language Inference (SNLI) dataset, MultiGenre Natural Language Inference (MultiNLI) dataset and Quora Question Pairs dataset. 
Stochastic Approximation of Expectation Maximization (SAEM) 
The SAEM algorithm: – computes the maximum likelihood estimator of the population parameters, without any approximation of the model (linearisation, quadrature approximation,…), using the Stochastic Approximation Expectation Maximization (SAEM) algorithm, – provides standard errors for the maximum likelihood estimator – estimates the conditional modes, the conditional means and the conditional standard deviations of the individual parameters, using the HastingsMetropolis algorithm. Several applications of SAEM in agronomy, animal breeding and PKPD analysis have been published by members of the Monolix group (http://group.monolix.org ). CensSpatial,saemix 
Stochastic Average Adjusted Gradient (SAAG) 
Big Data problems in Machine Learning have large number of data points or large number of features, or both, which make training of models di cult because of high computational complexities of single iteration of learning algorithms. To solve such learning problems, Stochastic Approximation o ers an optimization approach to make complexity of each it eration independent of number of data points by taking only one data point or minibatch of data points during each iteration and thereby helping to solve problems with large num ber of data points. Similarly, Coordinate Descent o ers another optimization approach to make iteration complexity independent of the number of features/coordinates/variables by taking only one feature or block of features, instead of all, during an iteration and thereby helping to solve problems with large number of features. In this paper, an op timization framework, namely, Batch Block Optimization Framework has been developed to solve big data problems using the best of Stochastic Approximation as well as the best of Coordinate Descent approaches, independent of any solver. This framework is used to solve strongly convex and smooth empirical risk minimization problem with gradient de scent (as a solver) and two novel Stochastic Average Adjusted Gradient methods have been proposed to reduce variance in minibatch and blockcoordinate setting of the developed framework. Theoretical analysis prove linear convergence of the proposed methods and empirical results with bench marked datasets prove the superiority of proposed methods against existing methods. SAAGs: Biased Stochastic Variance Reduction Methods 
Stochastic Average Gradient Algorithm (SAGA) 
In this work we introduce a new optimisation method called SAGA in the spirit of SAG, SDCA, MISO and SVRG, a set of recently proposed incremental gradient algorithms with fast linear convergence rates. SAGA improves on the theory behind SAG and SVRG, with better theoretical convergence rates, and has support for composite objectives where a proximal operator is used on the regulariser. Unlike SDCA, SAGA supports nonstrongly convex problems directly, and is adaptive to any inherent strong convexity of the problem. We give experimental results showing the effectiveness of our method. 
Stochastic Average Gradient Descent (SAGD) 
We develop and analyze a new algorithm for empirical risk minimization, which is the key paradigm for training supervised machine learning models. Our method—SAGD—is based on a probabilistic interpolation of SAGA and gradient descent (GD). In particular, in each iteration we take a gradient step with probability $q$ and a SAGA step with probability $1q$. We show that, surprisingly, the total expected complexity of the method (which is obtained by multiplying the number of iterations by the expected number of gradients computed in each iteration) is minimized for a nontrivial probability $q$. For example, for a well conditioned problem the choice $q=1/(n1)^2$, where $n$ is the number of data samples, gives a method with an overall complexity which is better than both the complexity of GD and SAGA. We further generalize the results to a probabilistic interpolation of SAGA and minibatch SAGA, which allows us to compute both the optimal probability and the optimal minibatch size. While the theoretical improvement may not be large, the practical improvement is robustly present across all synthetic and real data we tested for, and can be substantial. Our theoretical results suggest that for this optimal minibatch size our method achieves linear speedup in minibatch size, which is of key practical importance as minibatch implementations are used to train machine learning models in practice. This is the first time linear speedup in minibatch size is obtained for a variance reduced gradienttype method by directly solving the primal empirical risk minimization problem. 
Stochastic Batch Normalization  In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized loglikelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization — an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGGlike and ResNets) for MNIST and CIFAR10 datasets. 
Stochastic Block Model (SBM) 
The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data. 
Stochastic Computation Graph (SCG) 
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. 
Stochastic Computing based Deep Convolutional Neural Networks (SCDCNN) 
With recent advancing of Internet of Things (IoTs), it becomes very attractive to implement the deep convolutional neural networks (DCNNs) onto embedded/portable systems. Presently, executing the softwarebased DCNNs requires highperformance server clusters in practice, restricting their widespread deployment on the mobile devices. To overcome this issue, considerable research efforts have been conducted in the context of developing highlyparallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs. Stochastic Computing (SC), which uses bitstream to represent a number within [1, 1] by counting the number of ones in the bitstream, has a high potential for implementing DCNNs with high scalability and ultralow hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power/energy and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources bring about immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents the first comprehensive design and optimization framework of SCbased DCNNs (SCDCNNs). We first present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function. Then we propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. Besides, weight storage methods are investigated to reduce the area and power/energy consumption for storing weights. Finally, the whole SCDCNN implementation is optimized, with feature extraction blocks carefully selected, to minimize area and power/energy consumption while maintaining a high network accuracy level. 
Stochastic Configuration Networks (SCN) 
This paper contributes to a development of randomized methods for neural networks. The proposed learner model is generated incrementally by stochastic configuration (SC) algorithms, termed as Stochastic Configuration Networks (SCNs). In contrast to the existing randomised learning algorithms for single layer feedforward neural networks (SLFNNs), we randomly assign the input weights and biases of the hidden nodes in the light of a supervisory mechanism, and the output weights are analytically evaluated in either constructive or selective manner. As fundamentals of SCNbased data modelling techniques, we establish some theoretical results on the universal approximation property. Three versions of SC algorithms are presented for regression problems (applicable for classification problems as well) in this work. Simulation results concerning both function approximation and real world data regression indicate some remarkable merits of our proposed SCNs in terms of less human intervention on the network size setting, the scope adaptation of random parameters, fast learning and sound generalization. 
Stochastic Conjugate Gradient Algorithm with Variance Reduction (CGVR) 
Conjugate gradient methods are a class of important methods for solving linear equations and nonlinear optimization. In our work, we propose a new stochastic conjugate gradient algorithm with variance reduction (CGVR) and prove its linear convergence with the Fletcher and Revves method for strongly convex and smooth functions. We experimentally demonstrate that the CGVR algorithm converges faster than its counterparts for six largescale optimization problems that may be convex, nonconvex or nonsmooth, and its AUC (Area Under Curve) performance with $L2$regularized $L2$loss is comparable to that of LIBLINEAR but with significant improvement in computational efficiency. 
Stochastic Conjunctive Normal Form (SCNF) 
Probabilistic Boolean Networks (PBNs) have been previously proposed so as to gain insights into complex dynamical systems. However, identification of large networks and of the underlying discrete Markov Chain which describes their temporal evolution, still remains a challenge. In this paper, we introduce an equivalent representation for the PBN, the Stochastic Conjunctive Normal Form (SCNF), which paves the way to a scalable learning algorithm and helps predict longrun dynamic behavior of largescale systems. Moreover, SCNF allows its efficient sampling so as to statistically infer multistep transition probabilities which can provide knowledge on the activity levels of individual nodes in the long run. 
Stochastic Cubic Regularization  This paper proposes a stochastic variant of a classic algorithm—the cubicregularized Newton method [Nesterov and Polyak 2006]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only $\mathcal{\tilde{O}}(\epsilon^{3.5})$ stochastic gradient and stochastic Hessianvector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the $\mathcal{\tilde{O}}(\epsilon^{4})$ rate of stochastic gradient descent. Our rate matches the bestknown result for finding local minima without requiring any delicate acceleration or variancereduction techniques. 
Stochastic Decorrelation Loss (SDL) 
Multiview learning aims to learn an embedding space where multiple views are either maximally correlated for crossview recognition, or decorrelated for latent factor disentanglement. A key challenge for deep multiview representation learning is scalability. To correlate or decorrelate multiview signals, the covariance of the whole training set should be computed which does not fit well with the minibatch based training strategy, and moreover (de)correlation should be done in a way that is free of SVDbased computation in order to scale to contemporary layer sizes. In this work, a unified approach is proposed for efficient and scalable deep multiview learning. Specifically, a minibatch based Stochastic Decorrelation Loss (SDL) is proposed which can be applied to any network layer to provide soft decorrelation of the layer’s activations. This reveals the connection between deep multiview learning models such as Deep Canonical Correlation Analysis (DCCA) and Factorisation Autoencoder (FAE), and allows them to be easily implemented. We further show that SDL is superior to other decorrelation losses in terms of efficacy and scalability. 
Stochastic Differential Equation (SDE) 
A stochastic differential equation (SDE) is a differential equation in which one or more of the terms is a stochastic process, resulting in a solution which is itself a stochastic process. SDEs are used to model diverse phenomena such as fluctuating stock prices or physical systems subject to thermal fluctuations. Typically, SDEs incorporate random white noise which can be thought of as the derivative of Brownian motion (or the Wiener process); however, it should be mentioned that other types of random fluctuations are possible, such as jump processes. 
Stochastic Dual Coordinate Ascent (SCDA) 

Stochastic Frontier Analysis (SFA) 
Stochastic frontier analysis (SFA) is a method of economic modeling. It has its starting point in the stochastic production frontier models. ssfa 
Stochastic Frontier Models  Stochastic frontier models allow to analyse technical inefficiency in the framework of production functions. Production units (firms, regions, countries, etc.) are assumed to produce according to a common technology, and reach the frontier when they produce the maximum possible output for a given set of inputs. Inefficiencies can be due to structural problems or market imperfections and other factors which cause countries to produce below their maximum attainable output. Over time, production units can become less inefficient and catch up to the frontier. It is also possible that the frontier shifts, indicating technical progress. In addition, production units can move along the frontier by changing input quantities. Finally, there can be some combinations of these three effects. The stochastic frontier method allows to decompose growth into changes in input use, changes in technology and changes in efficiency, thus extending the widely used growth accounting method. 
Stochastic Gradient Descent (SGD) 
Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. sgd 
Stochastic Gradient Langevin Dynamics (SGLD) 
One way to avoid overfitting in machine learning is to use model parameters distributed according to a Bayesian posterior given the data, rather than the maximum likelihood estimator. Stochastic gradient Langevin dynamics (SGLD) is one algorithm to approximate such Bayesian posteriors for large models and datasets. SGLD is a standard stochastic gradient descent to which is added a controlled amount of noise, specifically scaled so that the parameter converges in law to the posterior distribution [WT11, TTV16]. The posterior predictive distribution can be approximated by an ensemble of samples from the trajectory. Choice of the variance of the noise is known to impact the practical behavior of SGLD: for instance, noise should be smaller for sensitive parameter directions. Theoretically, it has been suggested to use the inverse Fisher information matrix of the model as the variance of the noise, since it is also the variance of the Bayesian posterior [PT13, AKW12, GC11]. But the Fisher matrix is costly to compute for large dimensional models. Here we use the easily computed Fisher matrix approximations for deep neural networks from [MO16, Oll15]. The resulting natural Langevin dynamics combines the advantages of Amari’s natural gradient descent and Fisherpreconditioned Langevin dynamics for large neural networks. Smallscale experiments on MNIST show that Fisher matrix preconditioning brings SGLD close to dropout as a regularizing technique. 
Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) 
MetaLearning for Stochastic Gradient MCMC 
Stochastic Model Predictive Control (SMPC) 
Model predictive control (MPC) has demonstrated exceptional success for the highperformance control of complex systems. The conceptual simplicity of MPC as well as its ability to effectively cope with the complex dynamics of systems with multiple inputs and outputs, input and state/output constraints, and conflicting control objectives have made it an attractive multivariable constrained control approach. This article gives an overview of the main developments in the area of stochastic model predictive control (SMPC) in the past decade and provides the reader with an impression of the different SMPC algorithms and the key theoretical challenges in stochastic predictive control without undue mathematical complexity. The general formulation of a stochastic OCP is first presented, followed by an overview of SMPC approaches for linear and nonlinear systems. Suggestions of some avenues for future research in this rapidly evolving field concludes the article. 
Stochastic Multidimensional Scaling  Multidimensional scaling (MDS) is a popular dimensionality reduction techniques that has been widely used for network visualization and cooperative localization. However, the traditional stress minimization formulation of MDS necessitates the use of batch optimization algorithms that are not scalable to largesized problems. This paper considers an alternative stochastic stress minimization framework that is amenable to incremental and distributed solutions. A novel linearcomplexity stochastic optimization algorithm is proposed that is provably convergent and simple to implement. The applicability of the proposed algorithm to localization and visualization tasks is also expounded. Extensive tests on synthetic and real datasets demonstrate the efficacy of the proposed algorithm. 
Stochastic Neural Network  Stochastic neural networks are a type of artificial neural networks, which is a tool of artificial intelligence. They are built by introducing random variations into the network, either by giving the network’s neurons stochastic transfer functions, or by giving them stochastic weights. This makes them useful tools for optimization problems, since the random fluctuations help it escape from local minima. Stochastic neural networks that are built by using stochastic transfer functions are often called Boltzmann machines. 
Stochastic Optimization (SO) 
Stochastic optimization (SO) methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involve random objective functions or random constraints, for example. Stochastic optimization methods also include methods with random iterates. Some stochastic optimization methods use random iterates to solve stochastic problems, combining both meanings of stochastic optimization. Stochastic optimization methods generalize deterministic methods for deterministic problems. http://…/9783662462133 
Stochastic Ordering  In probability theory and statistics, a stochastic order quantifies the concept of one random variable being ‘bigger’ than another. These are usually partial orders, so that one random variable A may be neither stochastically greater than, less than nor equal to another random variable B. Many different orders exist, which have different applications. An Introduction to Stochastic Orders 
Stochastic Partial Differential Equation (SPDE) 
Stochastic partial differential equations (SPDEs) are similar to ordinary stochastic differential equations. They are essentially partial differential equations that have random forcing terms and coefficients. They can be exceedingly difficult to solve. However, they have strong connections with quantum field theory and statistical mechanics. 
Stochastic PathIntegrated Differential EstimatoR (SPIDER) 
In this paper, we propose a new technique named Stochastic PathIntegrated Differential EstimatoR (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. Combining SPIDER with the method of normalized gradient descent, we propose two new algorithms, namely SPIDERSFO and SPIDERSSO, that solve nonconvex stochastic optimization problems using stochastic gradients only. We provide sharp errorbound results on their convergence rates. Specially, we prove that the SPIDERSFO and SPIDERSSO algorithms achieve a recordbreaking $\tilde{O}(\epsilon^{3})$ gradient computation cost to find an $\epsilon$approximate firstorder and $(\epsilon, O(\epsilon^{0.5}))$approximate secondorder stationary point, respectively. In addition, we prove that SPIDERSFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finitesum setting. 
Stochastic Process  In probability theory, a stochastic process, or sometimes random process (widely used) is a collection of random variables, representing the evolution of some system of random values over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. In the simple case of discrete time, as opposed to continuous time, a stochastic process involves a sequence of random variables and the time series associated with these random variables (for example, see Markov chain, also known as discretetime Markov chain). One approach to stochastic processes treats them as functions of one or several deterministic arguments (inputs; in most cases this will be the time parameter) whose values (outputs) are random variables: nondeterministic (single) quantities which have certain probability distributions. Random variables corresponding to various times (or points, in the case of random fields) may be completely different. The main requirement is that these different random quantities all take values in the same space (the codomain of the function). Although the random values of a stochastic process at different times may be independent random variables, in most commonly considered situations they exhibit complicated statistical correlations. Familiar examples of processes modeled as stochastic time series include stock market and exchange rate fluctuations, signals such as speech, audio and video, medical data such as a patient’s EKG, EEG, blood pressure or temperature, and random movement such as Brownian motion or random walks. Examples of random fields include static images, random terrain (landscapes), wind waves or composition variations of a heterogeneous material. A generalization, the random field, is defined by letting the variables’ parameters be members of a topological space instead of limited to real values representing time. 
Stochastic Programming  In the field of mathematical optimization, stochastic programming is a framework for modeling optimization problems that involve uncertainty. Whereas deterministic optimization problems are formulated with known parameters, real world problems almost invariably include some unknown parameters. When the parameters are known only within certain bounds, one approach to tackling such problems is called robust optimization. Here the goal is to find a solution which is feasible for all such data and optimal in some sense. Stochastic programming models are similar in style but take advantage of the fact that probability distributions governing the data are known or can be estimated. The goal here is to find some policy that is feasible for all (or almost all) the possible data instances and maximizes the expectation of some function of the decisions and the random variables. More generally, such models are formulated, solved analytically or numerically, and analyzed in order to provide useful information to a decisionmaker. 
StochAstic Recursive grAdient algoritHm (SARAH) 
In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finitesum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm. 
Stochastic SelfOrganising Map (SOM) 
SOMbrero 
Stochastic Simulation Algorithm (SSA) 
Stochastic simulation is a simulation that operates with variables that can change with certain probability. Stochastic means that particular factors (values) are variable or random. With a stochastic model we create a projection which is based on a set of random values. Outputs are recorded and the projection is repeated with a new set of random (variable) values. Previous steps are repeated until a reasonable amount of data is gathered (thousandfold, millionfold, ..). In the end, the distribution of the outputs shows the most probable estimates as well as a frame of expectations (outlier values dividing those we still can expect from the ones we should not). 
Stochastic Stratified Average Gradient (SSAG) 
SGD (Stochastic Gradient Descent) is a popular algorithm for large scale optimization problems due to its low iterative cost. However, SGD can not achieve linear convergence rate as FGD (Full Gradient Descent) because of the inherent gradient variance. To attack the problem, minibatch SGD was proposed to get a tradeoff in terms of convergence rate and iteration cost. In this paper, a general CVI (ConvergenceVariance Inequality) equation is presented to state formally the interaction of convergence rate and gradient variance. Then a novel algorithm named SSAG (Stochastic Stratified Average Gradient) is introduced to reduce gradient variance based on two techniques, stratified sampling and averaging over iterations that is a key idea in SAG (Stochastic Average Gradient). Furthermore, SSAG can achieve linear convergence rate of $\mathcal {O}((1\frac{\mu}{8CL})^k)$ at smaller storage and iterative costs, where $C\geq 2$ is the category number of training data. This convergence rate depends mainly on the variance between classes, but not on the variance within the classes. In the case of $C\ll N$ ($N$ is the training data size), SSAG’s convergence rate is much better than SAG’s convergence rate of $\mathcal {O}((1\frac{\mu}{8NL})^k)$. Our experimental results show SSAG outperforms SAG and many other algorithms. 
Stochastic Variance Reduced Gradient (SVRG) 

StochasticNet  Deep neural networks is a branch in machine learning that has seen a meteoric rise in popularity due to its powerful abilities to represent and model highlevel abstractions in highly complex data. One area in deep neural networks that is ripe for exploration is neural connectivity formation. A pivotal study on the brain tissue of rats found that synaptic formation for specific functional connectivity in neocortical neural microcircuits can be surprisingly well modeled and predicted as a random formation. Motivated by this intriguing finding, we introduce the concept of StochasticNet, where deep neural networks are formed via stochastic connectivity between neurons. Such stochastic synaptic formations in a deep neural network architecture can potentially allow for efficient utilization of neurons for performing specific tasks. To evaluate the feasibility of such a deep neural network architecture, we train a StochasticNet using three image datasets. Experimental results show that a StochasticNet can be formed that provides comparable accuracy and reduced overfitting when compared to conventional deep neural networks with more than two times the number of neural connections. 
Stone’s Paradox  In technical jargon, he shows that ‘a finitely additive measure on the free group with two generators is nonconglomerable.’ In English: even for a simple problem with a discrete parameters space, flat priors can lead to surprises. https://…/theflatlandparadox 
Stop Words  In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as ‘The Who’, ‘The The’, or ‘Take That’. Other search engines remove some of the most common wordsincluding lexical words, such as “want”from a query in order to improve performance. 
Stopping Time  In probability theory, in particular in the study of stochastic processes, a stopping time (also Markov time) is a specific type of “random time”: a random variable whose value is interpreted as the time at which a given stochastic process exhibits a certain behavior of interest. A stopping time is often defined by a stopping rule, a mechanism for deciding whether to continue or stop a process on the basis of the present position and past events, and which will almost always lead to a decision to stop at some finite time. 
Strang’s Diagram  A diagram that shows actions of A, an m×n matrix, as linear transformations from the space R^m to R^n. The diagram helps to understand the fundamental concepts of Linear Algebra in terms of the four subspaces by visually illustrating the actions of A on all these subspaces. 
StrassenNets  A large fraction of the arithmetic operations required to evaluate deep neural networks (DNNs) are due to matrix multiplications, both in convolutional and fully connected layers. Matrix multiplications can be cast as $2$layer sumproduct networks (SPNs) (arithmetic circuits), disentangling multiplications and additions. We leverage this observation for endtoend learning of lowcost (in terms of multiplications) approximations of linear operations in DNN layers. Specifically, we propose to replace matrix multiplication operations by SPNs, with widths corresponding to the budget of multiplications we want to allocate to each layer, and learning the edges of the SPNs from data. Experiments on CIFAR10 and ImageNet show that this method applied to ResNet yields significantly higher accuracy than existing methods for a given multiplication budget, or leads to the same or higher accuracy compared to existing methods while using significantly fewer multiplications. Furthermore, our approach allows finegrained control of the tradeoff between arithmetic complexity and accuracy of DNN models. Finally, we demonstrate that the proposed framework is able to rediscover Strassen’s matrix multiplication algorithm, i.e., it can learn to multiply $2 \times 2$ matrices using only $7$ multiplications instead of $8$. 
Strategic Object Oriented Reinforcement Learning (SOORL) 
Humans learn to play video games significantly faster than stateoftheart reinforcement learning (RL) algorithms. Inspired by this, we introduce strategic object oriented reinforcement learning (SOORL) to learn simple dynamics model through automatic model selection and perform efficient planning with strategic exploration. We compare different exploration strategies in a modelbased setting in which exact planning is impossible. Additionally, we test our approach on perhaps the hardest Atari game Pitfall! and achieve significantly improved exploration and performance over prior methods. 
Stratified pCenter Problem (SpCP) 
This work presents an extension of the pcenter problem. In this new model, called Stratified pCenter Problem (SpCP), the demand is concentrated in a set of sites and the population of these sites is divided into different strata depending on the kind of service that they require. The aim is to locate p centers to cover the different types of services demanded minimizing the weighted average of the largest distances associated with each of the different strata. In addition, it is considered that more than one stratum can be present at each site. Different formulations, valid inequalities and preprocessings are developed and compared for this problem. An application of this model is presented in order to implement a heuristic approach based on the Sample Average Approximation method (SAA) for solving the probabilistic pcenter problem in an efficient way. 
Stream Processing  Stream processing is a computer programming paradigm, related to SIMD (single instruction, multiple data), that allows some applications to more easily exploit a limited form of parallel processing. Such applications can use multiple computational units, such as the FPUs on a GPU or field programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units. The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a set of data (a stream), a series of operations (kernel functions) is applied to each element in the stream. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Kernel functions are usually pipelined, and local onchip memory is reused to minimize external memory bandwidth. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize onchip management tasks. Stream processing hardware can use scoreboarding, for example, to launch DMAs at runtime, when dependencies become known. The elimination of manual DMA management reduces software complexity, and the elimination of hardware caches reduces the amount of the area not dedicated to computational units such as ALUs. During the 1980s stream processing was explored within dataflow programming. An example is the language SISAL (Streams and Iteration in a Single Assignment Language). 
StreamFlow  StreamFlow is a stream processing tool designed to rapidly build and monitor processing workflows. The ultimate goal of StreamFlow is to make working with stream processing frameworks such as Apache Storm easier, faster, and with “enterprise” like management functionality. StreamFlow provides a graphical user interface for nondevelopers such as data scientists, analysts, or operational users to rapidly build scalable data flows and analytics. The following image is a screenshot of this topology builder. 
Streamgraph  A streamgraph, or stream graph, is a type of stacked area graph which is displaced around a central axis, resulting in a flowing, organic shape. Streamgraphs were developed by Lee Byron and popularized by their use in a February 2008 New York Times article on movie box office revenues. 
Streaming Ensemble Algorithm (SEA) 
Ensemble methods have recently garnered a great deal of attention in the machine learning community. Techniques such as Boosting and Bagging have proven to be highly effective but require repeated resampling of the training data, making them inappropriate in a data mining context. The methods presented in this paper take advantage of plentiful data, building separate classifiers on sequential chunks of training points. These classifiers are combined into a fixedsize ensemble using a heuristic replacement strategy. The result is a fast algorithm for largescale or streaming data that classifies as well as a single decision tree built on all the data, requires approximately constant memory, and adjusts quickly to concept drift. 
Streaming Platform LAnguage Shell (SPLASH) 
Stream Processing LAnguage SHell (SPLASH) is a scripting language that brings extensibility to CCL (Continuous Computation Language), allowing you to create custom operators and functions that go beyond standard SQL. CCL is the primary event processing language of the Event Stream Processor. ESP projects are defined in CCL. 
Streaming Processing Engines (SPE) 
Apache Storm, Apache S4 
Streaming Variational Bayes (SVB) 
We present SDABayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a userspecified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two largescale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data – a case where SVI may be applied – and in the streaming setting, where SVI does not apply. 
Streamulus  Streamulus is a C++ library that makes it very easy to process event streams. You need to write code that handles a single event and the library turns this code into a data structure that handles infinite streams of such events. The stream operators you write can have side effects and they can maintain an internal state. 
Strict Very Fast Decision Tree (SVFDT) 
Dealing with memory and time constraints are current challenges when learning from data streams with a massive amount of data. Many algorithms have been proposed to handle these difficulties, among them, the Very Fast Decision Tree (VFDT) algorithm. Although the VFDT has been widely used in data stream mining, in the last years, several authors have suggested modifications to increase its performance, putting aside memory concerns by proposing memorycostly solutions. Besides, most data stream mining solutions have been centred around ensembles, which combine the memory costs of their weak learners, usually VFDTs. To reduce the memory cost, keeping the predictive performance, this study proposes the Strict VFDT (SVFDT), a novel algorithm based on the VFDT. The SVFDT algorithm minimises unnecessary tree growth, substantially reducing memory usage and keeping competitive predictive performance. Moreover, since it creates much more shallow trees than VFDT, SVFDT can achieve a shorter processing time. Experiments were carried out comparing the SVFDT with the VFDT in 11 benchmark data stream datasets. This comparison assessed the tradeoff between accuracy, memory, and processing time. Statistical analysis showed that the proposed algorithm obtained similar predictive performance and significantly reduced processing time and memory use. Thus, SVFDT is a suitable option for data stream mining with memory and time limitations, recommended as a weak learner in ensemblebased solutions. 
String Attractors  String attractors are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. 
String Distance Algorithms  String Distance Algorithms is about calculation of the distance between two strings. E.g. “WhatIs” compared to “wahtis” to identify how similar to strings are to get a fuzzy interpretation and so e.g. try to get rid of typos etc. 
Strong Exponential Time Hypothesis (SETH) 
In computational complexity theory, the exponential time hypothesis is an unproven computational hardness assumption that was formulated by Impagliazzo & Paturi (1999). The hypothesis states that 3SAT (or any of several related NPcomplete problems) cannot be solved in subexponential time in the worst case.[1] The exponential time hypothesis, if true, would imply that P ? NP, but it is a stronger statement. It can be used to show that many computational problems are equivalent in complexity, in the sense that if one of them has a subexponential time algorithm then they all do. 
Strongly Connected Component  In the mathematical theory of directed graphs, a graph is said to be strongly connected if every vertex is reachable from every other vertex. The strongly connected components of an arbitrary directed graph form a partition into subgraphs that are themselves strongly connected. It is possible to test the strong connectivity of a graph, or to find its strongly connected components, in linear time. 
Strongly Hierarchical Factorization Machine  Highorder parametric models that include terms for feature interactions are applied to various data min ing tasks, where ground truth depends on interactions of features. However, with sparse data, the high dimensional parameters for feature interactions often face three issues: expensive computation, difficulty in parameter estimation and lack of structure. Previous work has proposed approaches which can partially re solve the three issues. In particular, models with fac torized parameters (e.g. Factorization Machines) and sparse learning algorithms (e.g. FTRLProximal) can tackle the first two issues but fail to address the third. Regarding to unstructured parameters, constraints or complicated regularization terms are applied such that hierarchical structures can be imposed. However, these methods make the optimization problem more challeng ing. In this work, we propose Strongly Hierarchical Factorization Machines and ANOVA kernel regression where all the three issues can be addressed without making the optimization problem more difficult. Ex perimental results show the proposed models signifi cantly outperform the stateoftheart in two data min ing tasks: coldstart user response time prediction and stock volatility prediction. 
Structural Agnostic Model (SAM) 
We present the Structural Agnostic Model (SAM), a framework to estimate endtoend nonacyclic causal graphs from observational data. In a nutshell, SAM implements an adversarial game in which a separate model generates each variable, given real values from all others. In tandem, a discriminator attempts to distinguish between the joint distributions of real and generated samples. Finally, a sparsity penalty forces each generator to consider only a small subset of the variables, yielding a sparse causal graph. SAM scales easily to hundreds variables. Our experiments show the stateoftheart performance of SAM on discovering causal structures and modeling interventions, in both acyclic and nonacyclic graphs. 
Structural Causal Model (SCM) 

Structural Data  Structural data holds information about the relationship between events. Some key concerns include the following: 1) how these relationships should be conveyed from the user to the computer and from the computer back to the user; 2) how the data should be configured to allow for systemic evaluation; 3) how to effectively store large amounts while allowing for rapid access; 4) how to reliably transfer the details between computers regardless of operating platform; 5) how to process the data to achieve different strategic outcomes; 6) how to arrange the data in order to express the relevance between events; and 7) how to ensure current design does not limit future application. 
Structural Equation Modeling (SEM) 
Structural equation modelling (SEM) is a statistical technique for testing and estimating causal relations using a combination of statistical data and qualitative causal assumptions. This definition of SEM was articulated by the geneticist Sewall Wright (1921), the economist Trygve Haavelmo (1943) and the cognitive scientist Herbert A. Simon (1953), and formally defined by Judea Pearl (2000) using a calculus of counterfactuals. Structural equation models (SEM) allow both confirmatory and exploratory modeling, meaning they are suited to both theory testing and theory development. Confirmatory modeling usually starts out with a hypothesis that gets represented in a causal model. The concepts used in the model must then be operationalized to allow testing of the relationships between the concepts in the model. The model is tested against the obtained measurement data to determine how well the model fits the data. The causal assumptions embedded in the model often have falsifiable implications which can be tested against the data. With an initial theory, SEM can be used inductively by specifying a corresponding model and using data to estimate the values of free parameters. Often the initial hypothesis requires adjustment in light of model evidence. SEM can be used purely for exploration; this would usually be a technique similar to exploratory factor analysis, a technique commonly used in psychometrics. Among the strengths of SEM is the ability to construct latent variables: variables that are not measured directly, but are estimated in the model from several measured variables, each of which is predicted to ‘tap into’ the latent variables. This allows the modeler to explicitly capture the unreliability of measurement in the model, which in theory allows the structural relations between latent variables to be accurately estimated. Factor analysis, path analysis and regression all represent special cases of SEM. In SEM, the qualitative causal assumptions are represented by the missing variables in each equation, as well as vanishing covariances among some error terms. These assumptions are testable in experimental studies and must be confirmed judgmentally in observational studies. http://…/SEM sparseSEM,OpenMx,EffectLiteR,metaSEM 
Structural ExpectationMaximization Algorithm  In recent years there has been a flurry of works on learning probabilistic belief networks. Current state of the art methods have been shown to be successful for two learning scenarios: learning both network structure and parameters from complete data, and learning parameters for a fixed network from incomplete data – that is, in the presence of missing values or hidden variables. However, no method has yet been demonstrated to effectively learn network structure fromincomplete data. In this paper, we propose a new method for learning network structure from incomplete data. This method is based on an extension of the ExpectationMaximization (EM) algorithm for model selection problems that performs search for the best structure inside the EM procedure. We prove the convergence of this algorithm, and adapt it for learning belief networks. We then describe how to learn networks in two scenarios: when the data contains missing values, and in the presence of hidden variables. We provide experimental results that show the effectiveness of our procedure in both scenarios. The Bayesian Structural EM Algorithm bnstruct 
Structural Graph Convolutional Neural Network (SGCNN) 
The digitalization of automation engineering generates large quantities of engineering data that is interlinked in knowledge graphs. Classifying and clustering subgraphs according to their functionality is useful to discover functionally equivalent engineering artifacts that exhibit different graph structures. This paper presents a new graph learning algorithm designed to classify engineering data artifacts — represented in the form of graphs — according to their structure and neighborhood features. Our Structural Graph Convolutional Neural Network (SGCNN) is capable of learning graphs and subgraphs with a novel graph invariant convolution kernel and downsampling/pooling algorithm. On a realistic engineeringrelated dataset, we show that SGCNN is capable of achieving ~91% classification accuracy. 
Structural Hamming Distance (SHD) 
In simple terms, this is the number of edge instertion, deletions or flips in order to transform one graph to another graph. ➚ “Hamming Distance” ➘ “Structural Intervention Distance” 
Structural Intervention Distance (SID) 
Causal inference relies on the structure of a graph, often a directed acyclic graph (DAG). Different graphs may result in different causal inference statements and different intervention distributions. To quantify such differences, we propose a (pre) distance between DAGs, the structural intervention distance (SID). The SID is based on a graphical criterion only and quantifies the closeness between two DAGs in terms of their corresponding causal inference statements. It is therefore wellsuited for evaluating graphs that are used for computing interventions. Instead of DAGs it is also possible to compare CPDAGs, completed partially directed acyclic graphs that represent Markov equivalence classes. Since it differs significantly from the popular Structural Hamming Distance (SHD), the SID constitutes a valuable additional measure. SID 
Structural Learning and Integrative DEcomposition (SLIDE) 
The increased availability of the multiview data (data on the same samples from multiple sources) has led to strong interest in models based on lowrank matrix factorizations. These models represent each data view via shared and individual components, and have been successfully applied for exploratory dimension reduction, association analysis between the views, and further learning tasks such as consensus clustering. Despite these advances, there remain significant challenges in modeling partiallyshared components, and identifying the number of components of each type (shared/partiallyshared/individual). In this work, we formulate a novel linked component model that directly incorporates partiallyshared structures. We call this model SLIDE for Structural Learning and Integrative DEcomposition of multiview data. We prove the existence of SLIDE decomposition and explicitly characterize the identifiability conditions. The proposed model fitting and selection techniques allow for joint identification of the number of components of each type, in contrast to existing sequential approaches. In our empirical studies, SLIDE demonstrates excellent performance in both signal estimation and component selection. We further illustrate the methodology on the breast cancer data from The Cancer Genome Atlas repository. 
Structural Maxent Model  We present a new class of density estimation models, Structural Maxent models, with feature functions selected from a union of possibly very complex subfamilies and yet benefiting from strong learning guarantees. The design of our models is based on a new principle supported by uniform convergence bounds and taking into consideration the complexity of the different subfamilies composing the full set of features. We prove new datadependent learning bounds for our models, expressed in terms of the Rademacher complexities of these subfamilies. We also prove a duality theorem, which we use to derive our Structural Maxent algorithm. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments demonstrating that its performance improves on that of existing L1norm regularized Maxent algorithms. We further similarly define conditional Structural Maxent models for multiclass classification problems. These are conditional probability models also making use of a union of possibly complex feature subfamilies. We prove a duality theorem for these models as well, which reveals their connection with existing binary and multiclass deep boosting algorithms. 
Structural Similarity Metric (SSIM) 
➚ “SSIMLayer” 
Structural Topic Model (STM) 
The Structural Topic Model (STM) allows researchers to estimate a topic model which includes documentlevel metadata. Statistical models of text have become increasingly popular in statistics and com puter science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this paper, we develop a model of text data that supports this type of substantive re search. Our approach is to posit a hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates. In this model, topical prevalence and topical content are speci ed as a sim ple generalized linear model on an arbitrary number of documentlevel covariates, such as news source and time of release, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a gen erally applicable framework. We demonstrate the proposed methodology by analyzing a collection of news reports about China, where we allow the prevalence of topics to evolve over time and vary across newswire services. Our methods help quantify the e ect of news wire source on both the frequency and nature of topic coverage. All the methods we describe are available as part of the open source R package stm. stm,stmCorrViz,stmBrowser 
Structural Vector Autoregression (SVAR) 
svars 
Structure Learning  Recovering a graph that represents the statistical data dependency among nodes for a set of data samples generated by nodes, which provides the basic structure to perform an inference task, such as MAP (maximum a posteriori). This problem is referred to as structure learning. Learning Data Dependency with Communication Cost 
Structured Attention Networks  Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning endtoend training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard softselection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linearchain conditional random field and a graphbased parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention. 
Structured Control Net (SCN) 
In recent years, Deep Reinforcement Learning has made impressive advances in solving several important benchmark problems for sequential decision making. Many control applications use a generic multilayer perceptron (MLP) for nonvision parts of the policy network. In this work, we propose a new neural network architecture for the policy network representation that is simple yet effective. The proposed Structured Control Net (SCN) splits the generic MLP into two separate submodules: a nonlinear control module and a linear control module. Intuitively, the nonlinear control is for forwardlooking and global control, while the linear control stabilizes the local dynamics around the residual of global control. We hypothesize that this will bring together the benefits of both linear and nonlinear policies: improve training sample efficiency, final episodic reward, and generalization of learned policy, while requiring a smaller network and being generally applicable to different training methods. We validated our hypothesis with competitive results on simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban driving environment, with various ablation and generalization tests, trained with multiple blackbox and policy gradient training methods. The proposed architecture has the potential to improve upon broader control tasks by incorporating problem specific priors into the architecture. As a case study, we demonstrate much improved performance for locomotion tasks by emulating the biological central pattern generators (CPGs) as the nonlinear part of the architecture. 
Structured Factored Inference (SFI) 
Reasoning on large and complex realworld models is a computationally difficult task, yet one that is required for effective use of many AI applications. A plethora of inference algorithms have been developed that work well on specific models or only on parts of general models. Consequently, a system that can intelligently apply these inference algorithms to different parts of a model for fast reasoning is highly desirable. We introduce a new framework called structured factored inference (SFI) that provides the foundation for such a system. Using models encoded in a probabilistic programming language, SFI provides a sound means to decompose a model into submodels, apply an inference algorithm to each submodel, and combine the resulting information to answer a query. Our results show that SFI is nearly as accurate as exact inference yet retains the benefits of approximate inference methods. 
Structured Generative Adversarial Network (SGAN) 
We study the problem of conditional generative modeling based on designated semantics or structures. Existing models that build conditional generators either require massive labeled instances as supervision or are unable to accurately control the semantics of generated samples. We propose structured generative adversarial networks (SGANs) for semisupervised conditional generative modeling. SGAN assumes the data x is generated conditioned on two independent latent variables: y that encodes the designated semantics, and z that contains other factors of variation. To ensure disentangled semantics in y and z, SGAN builds two collaborative games in the hidden space to minimize the reconstruction error of y and z, respectively. Training SGAN also involves solving two adversarial games that have their equilibrium concentrating at the true joint data distributions p(x, z) and p(x, y), avoiding distributing the probability mass diffusely over data space that MLEbased methods may suffer. We assess SGAN by evaluating its trained networks, and its performance on downstream tasks. We show that SGAN delivers a highly controllable generator, and disentangled representations; it also establishes startoftheart results across multiple datasets when applied for semisupervised image classification (1.27%, 5.73%, 17.26% error rates on MNIST, SVHN and CIFAR10 using 50, 1000 and 4000 labels, respectively). Benefiting from the separate modeling of y and z, SGAN can generate images with high visual quality and strictly following the designated semantic, and can be extended to a wide spectrum of applications, such as style transfer. 
Structured Gradient Tree Boosting (SGTB) 
We present a gradienttreeboostingbased structured learning model for jointly disambiguating named entities in a document. Gradient tree boosting is a widely used machine learning algorithm that underlies many topperforming natural language processing systems. Surprisingly, most works limit the use of gradient tree boosting as a tool for regular classification or regression problems, despite the structured nature of language. To the best of our knowledge, our work is the first one that employs the structured gradient tree boosting (SGTB) algorithm for collective entity disambiguation. By defining global features over previous disambiguation decisions and jointly modeling them with local features, our system is able to produce globally optimized entity assignments for mentions in a document. Exact inference is prohibitively expensive for our globally normalized model. To solve this problem, we propose Bidirectional Beam Search with Gold path (BiBSG), an approximate inference algorithm that is a variant of the standard beam search algorithm. BiBSG makes use of global information from both past and future to perform better local search. Experiments on standard benchmark datasets show that SGTB significantly improves upon published results. Specifically, SGTB outperforms the previous stateoftheart neural system by near 1\% absolute accuracy on the popular AIDACoNLL dataset. 
Structured Learning  Structured prediction is a generalization of the standard paradigms of supervised learning, classification and regression. All of these can be thought of finding a function that minimizes some loss over a training set. The differences are in the kind of functions that are used and the losses. In classification, the target domain are discrete class labels, and the loss is usually the 01 loss, i.e. counting the misclassifications. In regression, the target domain is the real numbers, and the loss is usually mean squared error. In structured prediction, both the target domain and the loss are more or less arbitrary. This means the goal is not to predict a label or a number, but a possibly much more complicated object like a sequence or a graph. In structured prediction, we often deal with finite, but large output spaces Y. This situation could be dealt with using classification with a very large number of classes. The idea behind structured prediction is that we can do better than this, by making use of the structure of the output space. 
Structured Set Matching Network (SSMN) 
Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of oneshot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. For this settoset matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained using global normalization to maximize local match scores between corresponding elements and a global consistency score among all matched elements, while also enforcing a matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagramtodiagram, evaluated on a new diagram dataset of over 200 categories; imagetoimage, evaluated on a dataset built on top of the Pascal Part Dataset; and imagetodiagram, evaluated on transferring labels across these datasets. 
Structured Stein Variational Inference for Continuous Graphical Models  We propose a novel distributed inference algorithm for continuous graphical models by extending Stein variational gradient descent (SVGD) to leverage the Markov dependency structure of the distribution of interest. The idea is to use a set of local kernel functions over the Markov blanket of each node, which alleviates the problem of the curse of high dimensionality and simultaneously yields a distributed algorithm for decentralized inference tasks. We justify our method with theoretical analysis and show that the use of local kernels can be viewed as a new type of localized approximation that matches the target distribution on the conditional distributions of each node over its Markov blanket. Our empirical results demonstrate that our method outperforms a variety of baselines including standard MCMC and particle message passing methods. 
Structured Sufficient Dimension Reduction (sSDR) 
sSDR 
Structured Support Vector Machine  The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels. As an example, a sample instance might be a natural language sentence, and the output label is an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree. ➘ “Support Vector Machine” 
Structured Uncertainty Prediction Network  This paper is the first work to propose a network to predict a structured uncertainty distribution for a reconstructed image. Our novel model learns to predict a full Gaussian covariance matrix for each reconstruction, which permits efficient sampling and likelihood evaluation. We demonstrate that our model can accurately reconstruct ground truth correlated residual distributions for synthetic datasets and generate plausible high frequency samples for real face images. We also illustrate the use of these predicted covariances for structure preserving image denoising. 
STTM  Along with the emergence and popularity of social communications on the Internet, topic discovery from short texts becomes fundamental to many applications that require semantic understanding of textual content. As a rising research field, short text topic modeling presents a new and complementary algorithmic methodology to supplement regular text topic modeling, especially targets to limited word cooccurrence information in short texts. This paper presents the first comprehensive opensource package, called STTM, for use in Java that integrates the stateoftheart models of short text topic modeling algorithms, benchmark datasets, and abundant functions for model inference and evaluation. The package is designed to facilitate the expansion of new methods in this research field and make evaluations between the new approaches and existing ones accessible. STTM is opensourced at https://…/STTM. 
Style Memory  Deep networks have shown great performance in classification tasks. However, the parameters learned by the classifier networks usually discard stylistic information of the input, in favour of information strictly relevant to classification. We introduce a network that has the capacity to do both classification and reconstruction by adding a ‘style memory’ to the output layer of the network. We also show how to train such a neural network as a deep multilayer autoencoder, jointly minimizing both classification and reconstruction losses. The generative capacity of our network demonstrates that the combination of stylememory neurons with the classifier neurons yield good reconstructions of the inputs when the classification is correct. We further investigate the nature of the style memory, and how it relates to composing digits and letters. Finally, we propose that this architecture enables the bidirectional flow of information used in predictive coding, and that such bidirectional networks can help mitigate against being fooled by ambiguous or adversarial input. 
Subadditivity  In mathematics, subadditivity is a property of a function that states, roughly, that evaluating the function for the sum of two elements of the domain always returns something less than or equal to the sum of the function’s values at each element. There are numerous examples of subadditive functions in various areas of mathematics, particularly norms and square roots. Additive maps are special cases of subadditive functions. 
SubGram  Skipgram (word2vec) is a recent method for creating vector representations of words (‘distributed word representations’) using a neural network. The representation gained popularity in various areas of natural language processing, because it seems to capture syntactic and semantic information about words without any explicit supervision in this respect. We propose SubGram, a refinement of the Skipgram model to consider also the word structure during the training process, achieving large gains on the Skipgram original test set. 
SUbgraph Robust REpresentAtion Learning (SURREAL) 
The success of graph embeddings or node representation learning in a variety of downstream tasks, such as node classification, link prediction, and recommendation systems, has led to their popularity in recent years. Representation learning algorithms aim to preserve local and global network structure by identifying node neighborhood notions. However, many existing algorithms generate embeddings that fail to properly preserve the network structure, or lead to unstable representations due to random processes (e.g., random walks to generate context) and, thus, cannot generate to multigraph problems. In this paper, we propose a robust graph embedding using connection subgraphs algorithm, entitled: SURREAL, a novel, stable graph embedding algorithmic framework. SURREAL learns graph representations using connection subgraphs by employing the analogy of graphs with electrical circuits. It preserves both local and global connectivity patterns, and addresses the issue of highdegree nodes. Further, it exploits the strength of weak ties and metadata that have been neglected by baselines. The experiments show that SURREAL outperforms stateoftheart algorithms by up to 36.85% on multilabel classification problem. Further, in contrast to baselines, SURREAL, being deterministic, is completely stable. 
subgraph2vec  In this paper, we present subgraph2vec, a novel approach for learning latent representations of rooted subgraphs from large graphs inspired by recent advancements in Deep Learning and Graph Kernels. These latent representations encode semantic substructure dependencies in a continuous vector space, which is easily exploited by statistical models for tasks such as graph classification, clustering, link prediction and community detection. subgraph2vec leverages on local information obtained from neighbourhoods of nodes to learn their latent representations in an unsupervised fashion. We demonstrate that subgraph vectors learnt by our approach could be used in conjunction with classifiers such as CNNs, SVMs and relational data clustering algorithms to achieve significantly superior accuracies. Also, we show that the subgraph vectors could be used for building a deep learning variant of WeisfeilerLehman graph kernel. Our experiments on several benchmark and largescale realworld datasets reveal that subgraph2vec achieves significant improvements in accuracies over existing graph kernels on both supervised and unsupervised learning tasks. Specifically, on two realworld program analysis tasks, namely, code clone and malware detection, subgraph2vec outperforms stateoftheart kernels by more than 17% and 4%, respectively. 
Subjective Bayesian Trust (SBT) 
This paper is concerned with trust modeling for networked computing systems. Of particular interest to this paper is the observation that trust is a subjective notion that is invisible, implicit and uncertain in nature, therefore it may be suitable for being expressed by subjective probabilities and then modeled on the basis of Bayesian principle. In spite of a few attempts to model trust in the Bayesian paradigm, the field lacks a global comprehensive overview of Bayesian methods and their theoretical connections to other alternatives. This paper presents a study to fill in this gap. It provides a comprehensive review and analysis of the literature, showing that a large deal of existing work, whether or not proposed based on Bayesian principle, can cast into a general Bayesian paradigm termed subjective Bayesian trust (SBT) theory here. The SBT framework can thus act as a general theoretical infrastructure for comparing or analyzing theoretical ties among existing trust models, and for developing novel models. The aim of this study is twofold. One is to gain insights about Bayesian philosophy in modeling trust. The other is to drive current research step ahead in seeking a highlevel, abstract way of modeling and evaluating trust. 
SubjectVerbObject Semantic Suffix Tree Clustering (SVOSSTC) 
In recent years, situation awareness has been recognised as a critical part of effective decision making, in particular for crisis management. One way to extract value and allow for better situation awareness is to develop a system capable of analysing a dataset of multiple posts, and clustering consistent posts into different views or stories (or, world views). However, this can be challenging as it requires an understanding of the data, including determining what is consistent data, and what data corroborates other data. Attempting to address these problems, this article proposes SubjectVerbObject Semantic Suffix Tree Clustering (SVOSSTC) and a system to support it, with a special focus on Twitter content. The novelty and value of SVOSSTC is its emphasis on utilising the SubjectVerbObject (SVO) typology in order to construct semantically consistent world views, in which individuals—particularly those involved in crisis response—might achieve an enhanced picture of a situation from social media data. To evaluate our system and its ability to provide enhanced situation awareness, we tested it against existing approaches, including human data analysis, using a variety of realworld scenarios. The results indicated a noteworthy degree of evidence (e.g., in cluster granularity and meaningfulness) to affirm the suitability and rigour of our approach. Moreover, these results highlight this article’s proposals as innovative and practical system contributions to the research field. 
Sublinear Algorithm  
Submanifold Sparse Convolutional Network  Convolutional network are the defacto standard for analysing spatiotemporal data such as images, videos, 3D shapes, etc. Whilst some of this data is naturally dense (for instance, photos), many other data sources are inherently sparse. Examples include penstrokes forming on a piece of paper, or (colored) 3D point clouds that were obtained using a LiDAR scanner or RGBD camera. Standard ‘dense’ implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce a sparse convolutional operation tailored to processing sparse data that differs from prior work on sparse convolutional networks in that it operates strictly on submanifolds, rather than ‘dilating’ the observation with every layer in the network. Our empirical analysis of the resulting submanifold sparse convolutional networks shows that they perform on par with stateoftheart methods whilst requiring substantially less computation. 
SubMatrix Selection SingularValue Decomposition (SMSSVD) 
High throughput biomedical measurements normally capture multiple overlaid biologically relevant signals and often also signals representing different types of technical artefacts like e.g. batch effects. Signal identification and decomposition are accordingly main objectives in statistical biomedical modeling and data analysis. Existing methods, aimed at signal reconstruction and deconvolution, in general, are either supervised, contain parameters that need to be estimated or present other types of ad hoc features. We here introduce SubMatrix Selection SingularValue Decomposition (SMSSVD), a parameterfree unsupervised signal decomposition and dimension reduction method, designed to reduce noise, adaptively for each lowranksignal in a given data matrix, and represent the signals in the data in a way that enable unbiased exploratory analysis and reconstruction of multiple overlaid signals, including identifying groups of variables that drive different signals. The Submatrix Selection Singular Value Decomposition (SMSSVD) method produces a denoised signal decomposition from a given data matrix. The SMSSVD method guarantees orthogonality between signal components in a straightforward manner and it is designed to make automation possible. We illustrate SMSSVD by applying it to several real and synthetic datasets and compare its performance to golden standard methods like PCA (Principal Component Analysis) and SPC (Sparse Principal Components, using Lasso constraints). The SMSSVD is computationally efficient and despite being a parameterfree method, in general, outperforms existing statistical learning methods. A Julia implementation of SMSSVD is openly available on GitHub (https://…/SMSSVD.jl ). 
Subsample Winner Algorithm (SWA) 
subsamp 
Subsampled Double Bootstrap (SDB) 
Bayesian Bootstraps for Massive Data 
Subsampled Turbulence Removal Network  We present a deeplearning approach to restore a sequence of turbulencedistorted video frames from turbulent deformations and spacetime varying blurs. Instead of requiring a massive training sample size in deep networks, we purpose a training strategy that is based on a new data augmentation method to model turbulence from a relatively small dataset. Then we introduce a subsampled method to enhance the restoration performance of the presented GAN model. The contributions of the paper is threefold: first, we introduce a simple but effective data augmentation algorithm to model the turbulence in real life for training in the deep network; Second, we firstly purpose the Wasserstein GAN combined with $\ell_1$ cost for successful restoration of turbulencecorrupted video sequence; Third, we combine the subsampling algorithm to filter out strongly corrupted frames to generate a video sequence with better quality. 
SUBSCALE  Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottomup process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find nontrivial subspace clusters with minimal cost and it requires only k database scans for a kdimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper. 
Subspace Clustering (SC) 

Subspace Outlier Degree (SOD) 
(see Definition 1) 
Subspace Outlier Detection (SOD) 
http://…ent.cgi?article=7340&context=ecuworks HighDimOut 
SubspaceCUSUM  We consider the sequential changepoint detection problem of detecting changes that are characterized by a subspace structure. Such changes are frequent in highdimensional streaming data altering the form of the corresponding covariance matrix. In this work we present a SubspaceCUSUM procedure and demonstrate its firstorder asymptotic optimality properties for the case where the subspace structure is unknown and needs to be simultaneously estimated. To achieve this goal we develop a suitable analytical methodology that includes a proper parameter optimization for the proposed detection scheme. Numerical simulations corroborate our theoretical findings. 
Substationarity  Substationarity is a new concept, which has never been studied in the literature. It means that the distribution can only be invariant under location shifts within a linear subspace of the domain. Theoretically, substationarity is a concept between stationarity and nonstationarity, but it belongs to nonstationarity. 
Substitute Teacher Network  Learning through experience is timeconsuming, inefficient and often bad for your cortisol levels. To address this problem, a number of recently proposed teacherstudent methods have demonstrated the benefits of private tuition, in which a single model learns from an ensemble of more experienced tutors. Unfortunately, the cost of such supervision restricts good representations to a privileged minority. Unsupervised learning can be used to lower tuition fees, but runs the risk of producing networks that require extracurriculum learning to strengthen their CVs and create their own LinkedIn profiles. Inspired by the logo on a promotional stress ball at a local recruitment fair, we make the following three contributions. First, we propose a novel almost no supervision training algorithm that is effective, yet highly scalable in the number of student networks being supervised, ensuring that education remains affordable. Second, we demonstrate our approach on a typical use case: learning to bake, developing a method that tastily surpasses the current state of the art. Finally, we provide a rigorous quantitive analysis of our method, proving that we have access to a calculator. Our work calls into question the longheld dogma that life is the best teacher. 
Substochastic Monte Carlo (SSMC) 
In this paper we introduce and formalize Substochastic Monte Carlo (SSMC) algorithms. These algorithms, originally intended to be a better classical foil to quantum annealing than simulated annealing, prove to be worthy optimization algorithms in their own right. In SSMC, a population of walkers is initialized according to a known distribution on an arbitrary search space and varied into the solution of some optimization problem of interest. The first argument of this paper shows how an existing classical algorithm, ‘GoWithTheWinners’ (GWW), is a limiting case of SSMC when restricted to binary search and particular driving dynamics. Although limiting to GWW, SSMC is more general. We show that (1) GWW can be efficiently simulated within the SSMC framework, (2) SSMC can be exponentially faster than GWW, (3) by naturally incorporating structural information, SSMC can exponentially outperform the quantum algorithm that first inspired it, and (4) SSMC exhibits desirable search features in general spaces. Our approach combines ideas from genetic algorithms (GWW), theoretical probability (FlemingViot processes), and quantum computing. Not only do we demonstrate that SSMC is often more efficient than competing algorithms, but we also hope that our results connecting these disciplines will impact each independently. An implemented version of SSMC has previously enjoyed some success as a competitive optimization algorithm for Max$k$SAT. 
Subword Regularization  Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and outofdomain settings. 
Successor Representation (SR) 
Here we propose using the successor representation (SR) to accelerate learning in a constructive knowledge system based on general value functions (GVFs). In realworld settings like robotics for unstructured and dynamic environments, it is infeasible to model all meaningful aspects of a system and its environment by hand due to both complexity and size. Instead, robots must be capable of learning and adapting to changes in their environment and task, incrementally constructing models from their own experience. GVFs, taken from the field of reinforcement learning (RL), are a way of modeling the world as predictive questions. One approach to such models proposes a massive network of interconnected and interdependent GVFs, which are incrementally added over time. It is reasonable to expect that new, incrementally added predictions can be learned more swiftly if the learning process leverages knowledge gained from past experience. The SR provides such a means of separating the dynamics of the world from the prediction targets and thus capturing regularities that can be reused across multiple GVFs. As a primary contribution of this work, we show that using SRbased predictions can improve sample efficiency and learning speed in a continual learning setting where new predictions are incrementally added and learned over time. We analyze our approach in a gridworld and then demonstrate its potential on data from a physical robot arm. 
Sufficient Dimension Reduction (SDR) 
A theory of sufficient dimension reduction (SDR) is developed from an optimizational perspective. In our formulation of the problem, instead of dealing with raw data, we assume that our ground truth includes a mapping ${\mathbf f}: {\mathbb R}^n\rightarrow {\mathbb R}^m$ and a probability distribution function $p$ over ${\mathbb R}^n$, both given analytically. We formulate SDR as a problem of finding a function ${\mathbf g}: {\mathbb R}^k\rightarrow {\mathbb R}^m$ and a matrix $P\in {\mathbb R}^{k\times n}$ such that ${\mathbb E}_{{\mathbf x}\sim p({\mathbf x})} \left{\mathbf f}({\mathbf x}) – {\mathbf g}(P{\mathbf x})\right^2$ is minimal. It turns out that the latter problem allows a reformulation in the dual space, i.e. instead of searching for ${\mathbf g}(P{\mathbf x})$ we suggest searching for its Fourier transform. First, we characterize all tempered distributions that can serve as the Fourier transform of such functions. The reformulation in the dual space can be interpreted as a problem of finding a $k$dimensional linear subspace $S$ and a tempered distribution ${\mathbf t}$ supported in $S$ such that ${\mathbf t}$ is ‘close’ in a certain sense to the Fourier transform of ${\mathbf f}$. Instead of optimizing over generalized functions with a $k$dimensional support, we suggest minimizing over ordinary functions but with an additional term $R$ that penalizes a strong distortion of the support from any $k$dimensional linear subspace. For a specific case of $R$, we develop an algorithm that can be formulated for functions given in the initial form as well as for their Fourier transforms. Eventually, we report results of numerical experiments with a discretized version of the latter algorithm. sSDR 
Sufficient Factor Broadcasting (SFB) 
Matrixparametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to largescale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrixparameterized models, which share the following property: the parameter update computed on each data sample is a rank1 matrix, i.e., the outer product of two ‘sufficient factors’ (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency — communication costs are linear in the parameter matrix’s dimensions, rather than quadratic — without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrixparametrized ML models. 
Sufficient Statistic  In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if “no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter”. In particular, a statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken. 
SufficientComponent Cause Model (SCC) 
In 1976 Ken Rothman, who is a member of the epidemiology faculty at BUSPH, proposed a conceptual model of causation known as the ‘sufficientcomponent cause model’ in an attempt to provide a practical view of causation which also had a sound theoretical basis. The model has similarities to the ‘web of causation’ theory described above, but is more developed in the sense that it simultaneously provides a general model for the conditions necessary to cause (and prevent) disease in a single individual and for the epidemiological study of the causes of disease among groups of individuals. The SufficientComponent Cause Model causalpie 
Suffix Bidirectional LSTM (Suffix BiLSTM) 
Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving stateoftheart performance in a wide variety of tasks in NLP. We propose a general and effective improvement to the BiLSTM model which encodes each suffix and prefix of a sequence of tokens in both forward and reverse directions. We call our model Suffix BiLSTM or SuBiLSTM. Using an extensive set of experiments, we demonstrate that using SuBiLSTM instead of a BiLSTM in existing base models leads to improvements in performance in learning general sentence representations, text classification, textual entailment and named entity recognition. We achieve new stateoftheart results for finegrained sentiment classification and question classification using SuBiLSTM. 
Suffix Tree (PAT Tree) 
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed ‘trie’ (digital tree) containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provide one of the first lineartime solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself. 
Sugeno Integral  In mathematics, the Sugeno integral, named after M. Sugeno, is a type of integral with respect to a fuzzy measure. http://…/Ayub_Khan_2009.pdf kappalab 
Suggestion Mining  We propose a formal definition for the task of suggestion mining in the context of a wide range of open domain applications. Human perception of the term suggestion is subjective and this effects the preparation of hand labeled datasets for the task of suggestion mining. Existing work either lacks a formal problem definition and annotation procedure, or provides domain and application specific definitions. Moreover, many previously used manually labeled datasets remain proprietary. We first present an annotation study, and based on our observations propose a formal task definition and annotation procedure for creating benchmark datasets for suggestion mining. With this study, we also provide publicly available labeled datasets for suggestion mining in multiple domains. 
Suite of Fast Incremental Algorithms for Machine Learning (sofiaml) 
The suite of fast incremental algorithms for machine learning (sofiaml) can be used for training models for classification, regression, ranking, or combined regression and ranking. Several different techniques are available. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets. Supported classification, regression, and ranking learners include: · Pegasos SVM · Stochastic Gradient Descent (SGD) SVM · PassiveAggressive Perceptron · Perceptron with Margins · ROMMA · Logistic Regression (with Pegasos Projection) This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction that can be used in new applications. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms. 
Sum of Powered Score (SPU) 
aSPU 
Sum Product Networks (SPN) 
SumProduct Networks (SPNs) are recently introduced deep tractable probabilistic models by which several kinds of inference queries can be answered exactly and in a tractable time. Up to now, they have been largely used as black box density estimators, assessed only by comparing their likelihood scores only. In this paper we explore and exploit the inner representations learned by SPNs. We do this with a threefold aim: first we want to get a better understanding of the inner workings of SPNs; secondly, we seek additional ways to evaluate one SPN model and compare it against other probabilistic models, providing diagnostic tools to practitioners; lastly, we want to empirically evaluate how good and meaningful the extracted representations are, as in a classic Representation Learning framework. In order to do so we revise their interpretation as deep neural networks and we propose to exploit several visualization techniques on their node activations and network outputs under different types of inference queries. To investigate these models as feature extractors, we plug some SPNs, learned in a greedy unsupervised fashion on image datasets, in supervised classification learning tasks. We extract several embedding types from node activations by filtering nodes by their type, by their associated feature abstraction level and by their scope. In a thorough empirical comparison we prove them to be competitive against those generated from popular feature extractors as Restricted Boltzmann Machines. Finally, we investigate embeddings generated from random probabilistic marginal queries as means to compare other tractable probabilistic models on a common ground, extending our experiments to Mixtures of Trees. 
Summary Receiver Operating Characteristic (SROC) 

SumProduct Graphical Model (SPGM) 
This paper introduces a new probabilistic architecture called SumProduct Graphical Model (SPGM). SPGMs combine traits from SumProduct Networks (SPNs) and Graphical Models (GMs): Like SPNs, SPGMs always enable tractable inference using a class of models that incorporate context specific independence. Like GMs, SPGMs provide a highlevel model interpretation in terms of conditional independence assumptions and corresponding factorizations. Thus, the new architecture represents a class of probability distributions that combines, for the first time, the semantics of graphical models with the evaluation efficiency of SPNs. We also propose a novel algorithm for learning both the structure and the parameters of SPGMs. A comparative empirical evaluation demonstrates competitive performances of our approach in density estimation. 
SumProduct Network (SPN) 
The key limiting factor in graphical model inference and learning is the complexity of the partition function. We thus ask the question: what are general conditions under which the partition function is tractable The answer leads to a new kind of deep architecture, which we call sumproduct networks (SPNs). SPNs are directed acyclic graphs with variables as leaves, sums and products as internal nodes, and weighted edges. We show that if an SPN is complete and consistent it represents the partition function and all marginals of some graphical model, and give semantics to its nodes. Essentially all tractable graphical models can be cast as SPNs, but SPNs are also strictly more general. We then propose learning algorithms for SPNs, based on backpropagation and EM. Experiments show that inference and learning with SPNs can be both faster and more accurate than with standard deep networks. For example, SPNs perform image completion better than stateoftheart deep networks for this task. SPNs also have intriguing potential connections to the architecture of the cortex. 
SumProductQuotient Network  We present a novel tractable generative model that extends SumProduct Networks (SPNs) and significantly boost their power. We call it SumProductQuotient Networks (SPQNs), whose core concept is to incorporate conditional distributions into the model by direct computation using quotient nodes, e.g. $P(AB){=}\frac{P(A,B)}{P(B)}$. We provide sufficient conditions for the tractability of SPQNs that generalize and relax the decomposable and complete tractability conditions of SPNs. These relaxed conditions give rise to an exponential boost to the expressive efficiency of our model, i.e. we prove that there are distributions which SPQNs can compute efficiently but require SPNs to be of exponential size. Thus, we narrow the gap in expressivity between tractable graphical models and other Neural Networkbased generative models. 
Sunburst Chart  A ring chart, also known as a sunburst chart or a multilevel pie chart, is used to visualize hierarchical data, depicted by concentric circles. The circle in the centre represents the root node, with the hierarchy moving outward from the center. A segment of the inner circle bears a hierarchical relationship to those segments of the outer circle which lie within the angular sweep of the parent segment. 
Superadditivity  In mathematics, a sequence { an }, n ≥ 1, is called superadditive if it satisfies the inequality a_{n+m} > a_n+a_m, for all m and n. The major reason for the use of superadditive sequences is the following lemma due to Michael Fekete. 
Superhighway Construction  Crossdomain collaborative filtering (CF) aims to alleviate data sparsity in singledomain CF by leveraging knowledge transferred from related domains. Many traditional methods focus on enriching compared neighborhood relations in CF directly to address the sparsity problem. In this paper, we propose superhighway construction, an alternative explicit relationenrichment procedure, to improve recommendations by enhancing crossdomain connectivity. Specifically, assuming partially overlapped items (users), superhighway bypasses multihop interdomain paths between crossdomain users (items, respectively) with direct paths to enrich the crossdomain connectivity. The experiments conducted on a realworld crossregion music dataset and a crossplatform movie dataset show that the proposed superhighway construction significantly improves recommendation performance in both target and source domains. 
SuperNeurons  Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, \textit{Liveness Analysis}, \textit{Unified Tensor Pool}, and \textit{CostAware Recomputation}, all together they effectively reduce the networkwide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has $10^4$ basic network layers on a 12GB K40c. 
SuperPCA  As an unsupervised dimensionality reduction method, principal component analysis (PCA) has been widely considered as an efficient and effective preprocessing step for hyperspectral image (HSI) processing and analysis tasks. It takes each band as a whole and globally extracts the most representative bands. However, different homogeneous regions correspond to different objects, whose spectral features are diverse. It is obviously inappropriate to carry out dimensionality reduction through a unified projection for an entire HSI. In this paper, a simple but very effective superpixelwise PCA approach, called SuperPCA, is proposed to learn the intrinsic lowdimensional features of HSIs. In contrast to classical PCA models, SuperPCA has four main properties. (1) Unlike the traditional PCA method based on a whole image, SuperPCA takes into account the diversity in different homogeneous regions, that is, different regions should have different projections. (2) Most of the conventional feature extraction models cannot directly use the spatial information of HSIs, while SuperPCA is able to incorporate the spatial context information into the unsupervised dimensionality reduction by superpixel segmentation. (3) Since the regions obtained by superpixel segmentation have homogeneity, SuperPCA can extract potential lowdimensional features even under noise. (4) Although SuperPCA is an unsupervised method, it can achieve competitive performance when compared with supervised approaches. The resulting features are discriminative, compact, and noise resistant, leading to improved HSI classification performance. Experiments on three public datasets demonstrate that the SuperPCA model significantly outperforms the conventional PCA based dimensionality reduction baselines for HSI classification. The Matlab source code is available at https://…/SuperPCA. 
SuperPivot  We present SuperPivot, an analysis method for lowresource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting – to the best of our knowledge – the largest crosslingual computational study performed to date. We extend existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work: We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation. 
Superpixel Sampling Network (SSN) 
Superpixels provide an efficient low/midlevel representation of image data, which greatly reduces the number of image primitives for subsequent vision tasks. Existing superpixel algorithms are not differentiable, making them difficult to integrate into otherwise endtoend trainable deep neural networks. We develop a new differentiable model for superpixel sampling that leverages deep networks for learning superpixel segmentation. The resulting ‘Superpixel Sampling Network’ (SSN) is endtoend trainable, which allows learning taskspecific superpixels with flexible loss functions and has fast runtime. Extensive experimental analysis indicates that SSNs not only outperform existing superpixel algorithms on traditional segmentation benchmarks, but can also learn superpixels for other tasks. In addition, SSNs can be easily integrated into downstream deep networks resulting in performance improvements. 
Superpixels  Superpixels group perceptually similar pixels to create visually meaningful entities while heavily reducing the number of primitives. As of these properties, superpixel algorithms have received much attention since their naming in 2003. By today, publicly available and wellunderstood superpixel algorithms have turned into standard tools in lowlevel vision. As such, and due to their quick adoption in a wide range of applications, appropriate benchmarks are crucial for algorithm selection and comparison. Until now, the rapidly growing number of algorithms as well as varying experimental setups hindered the development of a unifying benchmark. We present a comprehensive evaluation of 28 stateoftheart superpixel algorithms utilizing a benchmark focussing on fair comparison and designed to provide new and relevant insights. To this end, we explicitly discuss parameter optimization and the importance of strictly enforcing connectivity. Furthermore, by extending wellknown metrics, we are able to summarize algorithm performance independent of the number of generated superpixels, thereby overcoming a major limitation of available benchmarks. Furthermore, we discuss runtime, robustness against noise, blur and affine transformations, implementation details as well as aspects of visual quality. Finally, we present an overall ranking of superpixel algorithms which redefines the stateoftheart and enables researchers to easily select appropriate algorithms and the corresponding implementations which themselves are made publicly available as part of our benchmark at davidstutz.de/projects/superpixelbenchmark/. 
SuperSpike  A vast majority of computation in the brain is performed by spiking neural networks. Despite the ubiquity of such spiking, we currently lack an understanding of how biological spiking neural circuits learn and compute invivo, as well as how we can instantiate such capabilities in artificial spiking circuits insilico. Here we revisit the problem of supervised learning in temporally coding multilayer spiking neural networks. First, by using a surrogate gradient approach, we derive SuperSpike, a nonlinear voltagebased three factor learning rule capable of training multilayer networks of deterministic integrateandfire neurons to perform nonlinear computations on spatiotemporal spike patterns. Second, inspired by recent results on feedback alignment, we compare the performance of our learning rule under different credit assignment strategies for propagating output errors to hidden units. Specifically, we test uniform, symmetric and random feedback, finding that simpler tasks can be solved with any type of feedback, while more complex tasks require symmetric feedback. In summary, our results open the door to obtaining a better scientific understanding of learning and computation in spiking neural networks by advancing our ability to train them to solve nonlinear problems involving transformations between different spatiotemporal spiketime patterns. 
Supervised Cadre Model (SCM) 
A Precision EnvironmentWide Association Study of Hypertension via Supervised Cadre Models 
Supervised Fuzzy Partitioning (SFP) 
Centroidbased methods including kmeans and fuzzy cmeans (FCM) are known as effective and easytoimplement approaches to clustering purposes in many areas of application. However, these algorithms cannot be directly applied to supervised tasks. We propose a generative model extending centroidbased clustering approaches to be applicable to classification and regression problems. Given an arbitrary loss function, our approach, termed supervised fuzzy partitioning (SFP), incorporates labels information into its objective function through a surrogate term penalizing the risk. We also fuzzify the partition and assign weights to features alongside entropybased regularization terms, enabling the method to capture more complex data structure, to identify significant features, and to yield better performance facing highdimensional data. An iterative algorithm based on block coordinate descent (BCD) scheme was formulated to efficiently find a local optimizer. The results show that the SFP performance in classification and supervised dimensionality reduction on synthetic and realworld datasets is competitive with stateoftheart algorithms such as random forest and SVM. Our method has a major advantage over such methods in that it not only leads to a flexible model but also uses the loss function in training phase without compromising computational efficiency. 
Supervised Node Saliency  The autoencoder is an artificial neural network model that learns hidden representations of unlabeled data. With a linear transfer function it is similar to the principal component analysis (PCA). While both methods use weight vectors for linear transformations, the autoencoder does not come with any indication similar to the eigenvalues in PCA that are paired with the eigenvectors. We propose a novel supervised node saliency (SNS) method that ranks the hidden nodes by comparing class distributions of latent representations against a fixed reference distribution. The latent representations of a hidden node can be described using a onedimensional histogram. We apply normalized entropy difference (NED) to measure the ‘interestingness’ of the histograms, and conclude a property for NED values to identify a good classifying node. By applying our methods to real data sets, we demonstrate the ability of SNS to explain what the trained autoencoders have learned. 
Supervised Policy Update  We propose a new sampleefficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a nonparameterized policy. It then solves a supervised regression problem to convert the nonparameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the nonparameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks. 
Supervised Quantile Normalisation (SUQUAN) 
Quantile normalisation is a popular normalisation method for data subject to unwanted variations such as images, speech, or genomic data. It applies a monotonic transformation to the feature values of each sample to ensure that after normalisation, they follow the same target distribution for each sample. Choosing a ‘good’ target distribution remains however largely empirical and heuristic, and is usually done independently of the subsequent analysis of normalised data. We propose instead to couple the quantile normalisation step with the subsequent analysis, and to optimise the target distribution jointly with the other parameters in the analysis. We illustrate this principle on the problem of estimating a linear model over normalised data, and show that it leads to a particular lowrank matrix regression problem that can be solved efficiently. We illustrate the potential of our method, which we term SUQUAN, on simulated data, images and genomic data, where it outperforms standard quantile normalisation. 
Supervised Tensor Learning (STL) 

Supplier’s Declaration of Conformity (SDoC) 
The accuracy and reliability of machine learning algorithms are an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety, security, and provenance, are also critical elements to engender consumers’ trust in a service. In this paper, we propose a supplier’s declaration of conformity (SDoC) for AI services to help increase trust in AI services. An SDoC is a transparent, standardized, but often not legally required, document used in many industries and sectors to describe the lineage of a product along with the safety and performance testing it has undergone. We envision an SDoC for AI services to contain purpose, performance, safety, security, and provenance information to be completed and voluntarily released by AI service providers for examination by consumers. Importantly, it conveys productlevel rather than componentlevel functional testing. We suggest a set of declaration items tailored to AI and provide examples for two fictitious AI services. 
Support  Support is defined on itemsets and gives the proportion of transactions which contain X. It is used as a measure of significance (importance) of an itemset. Since it basically uses the count of transactions it is often called a frequency constraint. An itemset with a support greater then a set minimum support threshold, supp(X)>σ, is called a frequent or large itemset. Supports main feature is that it possesses the downward closure property (antimonotonicity) which means that all sub sets of a frequent set are also frequent. This property (actually, the fact that no super set of a infrequent set can be frequent) is used to prune the search space (usually thought of as a lattice or tree of item sets with increasing size) in levelwise algorithms (e.g., the Apriori algorithm). The disadvantage of support is the rare item problem. Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules. The rare item problem is important for transaction data which usually have a very uneven distribution of support for the individual items (typical is a powerlaw distribution where few items are used all the time and most item are rarely used). 
Support Neighbor (SN) 
Person reidentification (reID) has recently been tremendously boosted due to the advancement of deep convolutional neural networks (CNN). The majority of deep reID methods focus on designing new CNN architectures, while less attention is paid on investigating the loss functions. Verification loss and identification loss are two types of losses widely used to train various deep reID models, both of which however have limitations. Verification loss guides the networks to generate feature embeddings of which the intraclass variance is decreased while the interclass ones is enlarged. However, training networks with verification loss tends to be of slow convergence and unstable performance when the number of training samples is large. On the other hand, identification loss has good separating and scalable property. But its neglect to explicitly reduce the intraclass variance limits its performance on reID, because the same person may have significant appearance disparity across different camera views. To avoid the limitations of the two types of losses, we propose a new loss, called support neighbor (SN) loss. Rather than being derived from data sample pairs or triplets, SN loss is calculated based on the positive and negative support neighbor sets of each anchor sample, which contain more valuable contextual information and neighborhood structure that are beneficial for more stable performance. To ensure scalability and separability, a softmaxlike function is formulated to push apart the positive and negative support sets. To reduce intraclass variance, the distance between the anchor’s nearest positive neighbor and furthest positive sample is penalized. Integrating SN loss on top of Resnet50, superior reID results to the stateoftheart ones are obtained on several widely used datasets. 
Support Tensor Machine (STM) 

Support Tensor Train Machine (STTM) 
There has been growing interest in extending traditional vectorbased machine learning techniques to their tensor forms. An example is the support tensor machine (STM) that utilizes a rankone tensor to capture the data structure, thereby alleviating the overfitting and curse of dimensionality problems in the conventional support vector machine (SVM). However, the expressive power of a rankone tensor is restrictive for many realworld data. To overcome this limitation, we introduce a support tensor train machine (STTM) by replacing the rankone tensor in an STM with a tensor train. Experiments validate and confirm the superiority of an STTM over the SVM and STM. 
Support Vector Data Description (SVDD) 
Data domain description concerns the characterization of a data set. A good description covers all target data but includes no superfluous space. The boundary of a dataset can be used to detect novel data or outliers. We will present the Support Vector Data Description (SVDD) which is inspired by the Support Vector Classifier. It obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions. The method is made robust against outliers in the training set and is capable of tightening the description by using negative examples. We show characteristics of the Support Vector Data Descriptions using artificial and real data. Sampling Method for Fast Training of Support Vector Data Description 
Support Vector Machine (SVM) 
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a nonprobabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. SwarmSVM 
Support Vector Machines Plus (SVM+) 
See (Vladimir et. al, 2009, <doi:10.1016/j.neunet.2009.06.042>) for theoretical details and see (Li et. al, 2016, <https://…/svmplus_matlab> ) for implementation details in ‘MATLAB’. svmplus 
SupportNet  A plain welltrained deep learning model often does not have the ability to learn new knowledge without forgetting the previously learned knowledge, which is known as the catastrophic forgetting. Here we propose a novel method, SupportNet, to solve the catastrophic forgetting problem in class incremental learning scenario efficiently and effectively. SupportNet combines the strength of deep learning and support vector machine (SVM), where SVM is used to identify the support data from the old data, which are fed to the deep learning model together with the new data for further training so that the model can review the essential information of the old data when learning the new information. Two powerful consolidation regularizers are applied to ensure the robustness of the learned model. Comprehensive experiments on various tasks, including enzyme function prediction, subcellular structure classification and breast tumor classification, show that SupportNet drastically outperforms the stateoftheart incremental learning methods and even reaches similar performance as the deep learning model trained from scratch on both old and new data. Our program is accessible at: https://…/SupportNet 
Sure Thing Principle (STP) 
In 1954, Jim Savage introduced the Sure Thing Principle to demonstrate that preferences among actions could constitute an axiomatic basis for a Bayesian foundation of statistical inference. Here, we trace the history of the principle, discuss some of its nuances, and evaluate its significance in the light of modern understanding of causal reasoning. The surething principle (STP) was introduced by L.T. Savage using the following story: ‘A businessman contemplates buying a certain piece of property. He considers the outcome of the next presidential election relevant. So, to clarify the matter to himself, he asks whether he would buy if he knew that the Democratic candidate were going to win, and decides that he would. Similarly, he considers whether he would buy if he knew that the Republican candidate were going to win, and again finds that he would. Seeing that he would buy in either event, he decides that he should buy, even though he does not know which event obtains, or will obtain, as we would ordinarily say.’ 
Surface Network (SN) 
We study datadriven representations for threedimensional triangle meshes, which are one of the prevalent objects used to represent 3D geometry. Recent works have developed models that exploit the intrinsic geometry of manifolds and graphs, namely the Graph Neural Networks (GNNs) and its spectral variants, which learn from the local metric tensor via the Laplacian operator. Despite offering excellent sample complexity and builtin invariances, intrinsic geometry alone is invariant to isometric deformations, making it unsuitable for many applications. To overcome this limitation, we propose several upgrades to GNNs to leverage extrinsic differential geometry properties of threedimensional surfaces, increasing its modeling power. In particular, we propose to exploit the Dirac operator, whose spectrum detects principal curvature directions — this is in stark contrast with the classical Laplace operator, which directly measures mean curvature. We coin the resulting model the \emph{Surface Network (SN)}. We demonstrate the efficiency and versatility of SNs on two challenging tasks: temporal prediction of mesh deformations under nonlinear dynamics and generative models using a variational autoencoder framework with encoders/decoders given by SNs. 
Surrogate Variable Analysis (SVA) 
Modern highthroughput molecular biology experiments measure data for thousands of related features and seek to rank those features for association with some variables of experimental or clinical importance. The process of ranking features for association with primary variables is complicated by genetic, environmental, and technical factors that influence hundreds or thousands of features at a time. In highdimensional experiments these factors are often unknown, unmeasured, or incapable of being tractably modeled. Consistent patterns of variation across features due to unmeasured or unmodeled factors can confound the relationship between the primary variables and the measured features. In this thesis we provide a statistical framework for modeling largescale noise dependence caused by unmeasured or unmodeled factors in highthroughput data. We argue that estimating the sources of noise dependence is more appropriate than estimating the pairwise covariance between all features when the number of features is large. A direct connection is made with the wellstudied problem of multiple testing dependence, which typically focuses on the distribution of Pvalues from multiple testing procedures. We introduce the concept of surrogate variables, estimable linear combinations of the true unmeasured or unmodeled factors causing noise dependence, that can be included when modeling the relationship between the primary variables and the feature level data. We also propose algorithms for estimating surrogate variables based on principal component analysis of relevant subsets of features. Under certain conditions accounting for the estimated surrogate variables asymptotically corrects the ranking and error rate estimation in highthroughput data analysis. We also discuss pathological situations when surrogate variables can not be estimated. To illustrate the power of this approach, we apply our estimates of the surrogate variables to improve reproducibility in a large clinical gene expression study of trauma related outcomes. 
Survival Analysis  Survival analysis is a branch of statistics which deals with analysis of time duration to until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, and duration analysis or duration modeling in economics or event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Book: Survival Analysis Book: Handbook of Survival Analysis https://…/survival_analysis.html https://…/Survival Analysis with Plotly survsim,survAccuracyMeasures,BSGW,flexPM 
SurvivalCRPS  Personalized probabilistic forecasts of time to event (such as mortality) can be crucial in decision making, especially in the clinical setting. Inspired by ideas from the meteorology literature, we approach this problem through the paradigm of maximizing sharpness of prediction distributions, subject to calibration. In regression problems, it has been shown that optimizing the continuous ranked probability score (CRPS) instead of maximum likelihood leads to sharper prediction distributions while maintaining calibration. We introduce the SurvivalCRPS, a generalization of the CRPS to the time to event setting, and present rightcensored and intervalcensored variants. To holistically evaluate the quality of predicted distributions over time to event, we present the SurvivalAUPRC evaluation metric, an analog to area under the precisionrecall curve. We apply these ideas by building a recurrent neural network for mortality prediction, using an Electronic Health Record dataset covering millions of patients. We demonstrate significant benefits in models trained by the SurvivalCRPS objective instead of maximum likelihood. 
Sutte Indicator  
SVDbased NMF Initialization  Due to the iterative nature of most nonnegative matrix factorization (\textsc{NMF}) algorithms, initialization is a key aspect as it significantly influences both the convergence and the final solution obtained. Many initialization schemes have been proposed for NMF, among which one of the most popular class of methods are based on the singular value decomposition (SVD). However, these SVDbased initializations do not satisfy a rather natural condition, namely that the error should decrease as the rank of factorization increases. In this paper, we propose a novel SVDbased \textsc{NMF} initialization to specifically address this shortcoming by taking into account the SVD factors that were discarded to obtain a nonnegative initialization. This method, referred to as nonnegative SVD with lowrank correction (NNSVDLRC), allows us to significantly reduce the initial error at a negligible additional computational cost using the lowrank structure of the discarded SVD factors. NNSVDLRC has two other advantages compared to previous SVDbased initializations: (1) it provably generates sparse initial factors, and (2) it is faster as it only requires to compute a truncated SVD of rank $\lceil r/2 + 1 \rceil$ where $r$ is the factorization rank of the sought NMF decomposition (as opposed to a rank$r$ truncated SVD for other methods). We show on several standard dense and sparse data sets that our new method competes favorably with stateoftheart SVDbased initializations for NMF. 
Svensson’s Method  Svensson’s Method is a rankinvariant nonparametric method for the analysis of ordered scales which measures the level of change both from systematic and individual aspects. For the details, please refer to Svensson E. Analysis of systematic and random differences between paired ordinal categorical data [dissertation]. Stockholm: Almqvist & Wiksell International; 1993. svenssonm 
svg.js  A lightweight library for manipulating and animating SVG. GitHub 
svgpanzoom.js  JavaScript library that enables panning and zooming of an SVG in an HTML document, with mouse events or custom JavaScript hooks. svg.panzoom.js svgPanZoom 
SWAG  Given a partial description like ‘she opened the hood of the car,’ humans can reason about the situation and anticipate what might come next (‘then, she examined the engine’). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a debiased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use stateoftheart language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research. 
Swapout  We describe Swapout, a new stochastic training method, that outperforms ResNets of identical network structure yielding impressive results on CIFAR10 and CIFAR100. Swapout samples from a rich set of architectures including dropout, stochastic depth and residual architectures as special cases. When viewed as a regularization method swapout not only inhibits coadaptation of units in a layer, similar to dropout, but also across network layers. We conjecture that swapout achieves strong regularization by implicitly tying the parameters across layers. When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth. We propose a parameterization that reveals connections to exiting architectures and suggests a much richer set of architectures to be explored. We show that our formulation suggests an efficient training method and validate our conclusions on CIFAR10 and CIFAR100 matching state of the art accuracy. Remarkably, our 32 layer wider model performs similar to a 1001 layer ResNet model. 
Swift  A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines – even the compiled ones – incur significant runtime interpretation overhead, especially for contingent and openuniverse models. This paper describes Swift, a compiler for the BLOG PPL. Swiftgenerated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on a variety of inference problems demonstrate speedups ranging from 12x to 326x. 
Swish  The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widelyused activation function is the Rectified Linear Unit (ReLU). Although various alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose a new activation function, named Swish, which is simply $f(x) = x \cdot \text{sigmoid}(x)$. Our experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top1 classification accuracy on ImageNet by 0.9% for Mobile NASNetA and 0.6% for InceptionResNetv2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network. 
Switchable Temporal Propagation Network  Videos contain highly redundant information between frames. Such redundancy has been extensively studied in video compression and encoding, but is less explored for more advanced video processing. In this paper, we propose a learnable unified framework for propagating a variety of visual properties of video images, including but not limited to color, high dynamic range (HDR), and segmentation information, where the properties are available for only a few keyframes. Our approach is based on a temporal propagation network (TPN), which models the transitionrelated affinity between a pair of frames in a purely datadriven manner. We theoretically prove two essential factors for TPN: (a) by regularizing the global transformation matrix as orthogonal, the ‘style energy’ of the property can be well preserved during propagation; (b) such regularization can be achieved by the proposed switchable TPN with bidirectional training on pairs of frames. We apply the switchable TPN to three tasks: colorizing a grayscale video based on a few color keyframes, generating an HDR video from a low dynamic range (LDR) video and a few HDR frames, and propagating a segmentation mask from the first frame in videos. Experimental results show that our approach is significantly more accurate and efficient than the stateoftheart methods. 
Switching Neural Network (SNN) 
A new connectionist model, called Switching Neural Network (SNN), for the solution of classification problems is presented. SNN includes a first layer containing a particular kind of A/D converters, called latticizers, that suitably transform input vectors into binary strings. Then, the subsequent two layers of an SNN realize a positive Boolean function that solve in a lattice domain the original classification problem. Every function realized by an SNN can be written in terms of intelligible rules. Training can be performed by adopting a proper method for positive Boolean function reconstruction, called Shadow Clustering (SC). Simulation results obtained on the StatLog benchmark show the good quality of the SNNs trained with SC. 
SwitchOut  In this work, we examine methods for data augmentation for textbased tasks such as neural machine translation (NMT). We formulate the design of a data augmentation policy with desirable properties as an optimization problem, and derive a generic analytic solution. This solution not only subsumes some existing augmentation schemes, but also leads to an extremely simple data augmentation strategy for NMT: randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies. We name this method SwitchOut. Experiments on three translation datasets of different scales show that SwitchOut yields consistent improvements of about 0.5 BLEU, achieving better or comparable performances to strong alternatives such as word dropout (Sennrich et al., 2016a). Code to implement this method is included in the appendix. 
SWP Operator  The sweep operator as defined in (Dempster, 1969), commonly referred to as the SWP operator, is a useful tool for a computational statistician working with covariance matrices. In particular, the SWP operator allows a statistician to quickly regress all variables against one specified variable, obtaining OLS estimates for regression coefficients and variances in a single application. Subsequent applications of the SWP operator allows for regressing against more variables. 
Sybase IQ  SAP Sybase IQ is a highly optimized analytics server designed specifically to deliver superior performance for missioncritical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system. 
Sylvester Normalizing Flows  Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to construct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the wellknown singleunit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets. 
Symbiosis  The 20th century paradigm of paper forms and typewriters lives on in most of today’s User Interfaces. This kind of UI is adequate for repeatable tasks, but not for highly dynamic, situationdriven activities. The ubiquity of new devices with amazing capabilities has opened the door for a completely new way of working with computers: Combining the respective strengths of human and computer by means of frictionless interaction. 
SymbolConcept Association Network (SCAN) 
The natural world is infinitely diverse, yet this diversity arises from a relatively small set of coherent properties and rules, such as the laws of physics or chemistry. We conjecture that biological intelligent systems are able to survive within their diverse environments by discovering the regularities that arise from these rules primarily through unsupervised experiences, and representing this knowledge as abstract concepts. Such representations possess useful properties of compositionality and hierarchical organisation, which allow intelligent agents to recombine afinite set of conceptual building blocks into an exponentially large set of useful new concepts. This paper describes SCAN (SymbolConcept Association Network), a new framework for learning such concepts in the visual domain. We first use the previously published betaVAE (Higgins et al., 2017a) architecture to learn a disentangled representation of the latent structure of the visual world, before training SCAN to extract abstract concepts grounded in such disentangled visual primitives through fast symbol association. Our approach requires very few pairings between symbols and images and makes no assumptions about the choice of symbol representations.Once trained, SCAN is capable of multimodal bidirectional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of compositional visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to invent and learn novel visual concepts through recombination of the few learnt concepts. 
Symbolic Aggregate Approximation (SAX) 
While there are literally hundreds of papers on discretizing (symbolizing, tokenizing, quantizing) time series, none of the techniques allows a distance measure that lower bounds a distance measure defined on the original time series. For this reason, the generic time series data mining approach illustrated in Table 1 is of little utility, since the approximate solution to problem created in main memory may be arbitrarily dissimilar to the true solution that would have been obtained on the original data. If, however, one had a symbolic approach that allowed lower bounding of the true distance, one could take advantage of the generic time series data mining model, and of a host of other algorithms, definitions and data structures which are only defined for discrete data, including hashing, Markov models, and suffix trees. This is exactly the contribution of this paper. We call our symbolic representation of time series SAX (Symbolic Aggregate approXimation), and define it in the next section…. TSMining 
Symbolic Computation  In mathematics and computer science, computer algebra, also called symbolic computation or algebraic computation is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expressions and other mathematical objects. Although, properly speaking, computer algebra should be a subfield of scientific computing, they are generally considered as distinct fields because scientific computing is usually based on numerical computation with approximate floating point numbers, while symbolic computation emphasizes exact computation with expressions containing variables that have not any given value and are thus manipulated as symbols (therefore the name of symbolic computation). 
Symbolic Data  Any data taking care on the variation inside classes of standard observation: The data descriptions of the units are called ‘symbolic’ when they are more complex than standard ones due to the fact that they contain internal variation and are structured. http://…=10.1.1.95.4237&rep=rep1&type=pdf 
Symbolic Data Analysis (SDA) 
Symbolic data analysis (SDA) is an extension of standard data analysis where symbolic data tables are used as input and symbolic objects are outputted as a result. The data units are called symbolic since they are more complex than standard ones, as they not only contain values or categories, but also include internal variation and structure. SDA is based on four spaces: the space of individuals, the space of concepts, the space of descriptions, and the space of symbolic objects. The space of descriptions models individuals, while the space of symbolic objects models concepts. An Introduction to Symbolic Data Analysis and the Sodas Software 
Symbolic Multidimensional Scaling  Symbolic multidimensional scaling aims to present relations between objects treated as hypercubes in multidimensional space. To allow interpretation and graphical representation of the results usually twodimensional space is used. Most of symbolic multidimensional scaling methods require interval dissimilarity matrix as input. This matrix can be obtained from n judges, opinions or from dissimilaritymeasure for intervalvalued variables that produces intervalvalued dissimilarities. smds 
Symbolic Multidimensional Scaling of Interval Dissimilarities (SymScal) 
Multidimensional scaling aims at reconstructing dissimilarities between pairs of objects by distances in a low dimensional space. However, in some cases the dissimilarity itself is unknown, but the range of the dissimilarity is given. Such fuzzy data fall in the wider class of symbolic data (Bock & Diday, 2000). Denoeux and Masson (2002) have proposed to model an interval dissimilarity by a range of the distance defined as the minimum and maximum distance between two rectangles representing the objects. In this paper, we provide a new algorithm called SymScal that is based on iterative majorization. The advantage is that each iteration is guaranteed to improve the solution until no improvement is possible. In a simulation study, we investigate the quality of this algorithm. We discuss the use of SymScal on empirical dissimilarity intervals of sounds. 
Symbolic Reinforcement Learning with Common Sense (SRL+CS) 
Deep Reinforcement Learning (deep RL) has made several breakthroughs in recent years in applications ranging from complex control tasks in unmanned vehicles to game playing. Despite their success, deep RL still lacks several important capacities of human intelligence, such as transfer learning, abstraction and interpretability. Deep Symbolic Reinforcement Learning (DSRL) seeks to incorporate such capacities to deep Qnetworks (DQN) by learning a relevant symbolic representation prior to using Qlearning. In this paper, we propose a novel extension of DSRL, which we call Symbolic Reinforcement Learning with Common Sense (SRL+CS), offering a better balance between generalization and specialization, inspired by principles of common sense when assigning rewards and aggregating Qvalues. Experiments reported in this paper show that SRL+CS learns consistently faster than Qlearning and DSRL, achieving also a higher accuracy. In the hardest case, where agents were trained in a deterministic environment and tested in a random environment, SRL+CS achieves nearly 100% average accuracy compared to DSRL’s 70% and DQN’s 50% accuracy. To the best of our knowledge, this is the first case of near perfect zeroshot transfer learning using Reinforcement Learning. 
Symmetrical Distillation Network (SDN) 
Texttoimage synthesis aims to automatically generate images according to text descriptions given by users, which is a highly challenging task. The main issues of texttoimage synthesis lie in two gaps: the heterogeneous and homogeneous gaps. The heterogeneous gap is between the highlevel concepts of text descriptions and the pixellevel contents of images, while the homogeneous gap exists between synthetic image distributions and real image distributions. For addressing these problems, we exploit the excellent capability of generic discriminative models (e.g. VGG19), which can guide the training process of a new generative model on multiple levels to bridge the two gaps. The highlevel representations can teach the generative model to extract necessary visual information from text descriptions, which can bridge the heterogeneous gap. The midlevel and lowlevel representations can lead it to learn structures and details of images respectively, which relieves the homogeneous gap. Therefore, we propose Symmetrical Distillation Networks (SDN) composed of a source discriminative model as ‘teacher’ and a target generative model as ‘student’. The target generative model has a symmetrical structure with the source discriminative model, in order to transfer hierarchical knowledge accessibly. Moreover, we decompose the training process into two stages with different distillation paradigms for promoting the performance of the target generative model. Experiments on two widelyused datasets are conducted to verify the effectiveness of our proposed SDN. 
Syn2Real  Unsupervised transfer of object recognition models from synthetic to real data is an important problem with many potential applications. The challenge is how to ‘adapt’ a model trained on simulated images so that it performs well on realworld data without any additional supervision. Unfortunately, current benchmarks for this problem are limited in size and task diversity. In this paper, we present a new largescale benchmark called Syn2Real, which consists of a synthetic domain rendered from 3D object models and two realimage domains containing the same object categories. We define three related tasks on this benchmark: closedset object classification, openset object classification, and object detection. Our evaluation of multiple stateoftheart methods reveals a large gap in adaptation performance between the easier closedset classification task and the more difficult openset and detection tasks. We conclude that developing adaptation methods that work well across all three tasks presents a significant future challenge for syn2real domain transfer. 
SyncGAN  Generative adversarial network (GAN) has achieved impressive success on crossdomain generation, but it faces difficulty in crossmodal generation due to the lack of a common distribution between heterogeneous data. Most existing methods of conditional based crossmodal GANs adopt the strategy of onedirectional transfer and have achieved preliminary success on texttoimage transfer. Instead of learning the transfer between different modalities, we aim to learn a synchronous latent space representing the crossmodal common concept. A novel network component named synchronizer is proposed in this work to judge whether the paired data is synchronous/corresponding or not, which can constrain the latent space of generators in the GANs. Our GAN model, named as SyncGAN, can successfully generate synchronous data (e.g., a pair of image and sound) from identical random noise. For transforming data from one modality to another, we recover the latent code by inverting the mappings of a generator and use it to generate data of different modality. In addition, the proposed model can achieve semisupervised learning, which makes our model more flexible for practical applications. 
SyntaxDirected Variational Autoencoder (SDVAE) 
Deep generative models have been enjoying success in modeling continuous data. However it remains challenging to capture the representations for discrete structures with formal grammars and semantics, e.g., computer programs and molecular structures. How to generate both syntactically and semantically correct data still remains largely an open problem. Inspired by the theory of compiler where the syntax and semantics check is done via syntaxdirected translation (SDT), we propose a novel syntaxdirected variational autoencoder (SDVAE) by introducing stochastic lazy attributes. This approach converts the offline SDT check into onthefly generated guidance for constraining the decoder. Comparing to the stateoftheart methods, our approach enforces constraints on the output space so that the output will be not only syntactically valid, but also semantically reasonable. We evaluate the proposed model with applications in programming language and molecules, including reconstruction and program/molecule optimization. The results demonstrate the effectiveness in incorporating syntactic and semantic constraints in discrete generative models, which is significantly better than current stateoftheart approaches. 
Synthesis Using Geometrically Aligned Randomwalks (SUGAR) 
Many generative models attempt to replicate the density of their input data. However, this approach is often undesirable, since data density is highly affected by sampling biases, noise, and artifacts. We propose a method called SUGAR (Synthesis Using Geometrically Aligned Randomwalks) that uses a diffusion process to learn a manifold geometry from the data. Then, it generates new points evenly along the manifold by pulling randomly generated points into its intrinsic structure using a diffusion kernel. SUGAR equalizes the density along the manifold by selectively generating points in sparse areas of the manifold. We demonstrate how the approach corrects sampling biases and artifacts, while also revealing intrinsic patterns (e.g. progression) and relations in the data. The method is applicable for correcting missing data, finding hypothetical data points, and learning relationships between data features. 
Synthesizing What I Mean (SWIM) 
Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given APIrelated natural language queries such as ‘generate md5 hash code’. We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from opensource code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce \emph{structured call sequences} to capture APIusage patterns. Structured call sequences are a generalized form of method call sequences, with ifbranches and whileloops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# APIrelated queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet. 
Synthetic Gradient (SG) 
Artifical Neural Network are a particular class of learning system modeled after biological neural functions with an interesting penchant for Hebbian learning, that is ‘neurons that wire together, fire together’. However, unlike their natural counterparts, artificial neural networks have a close and stringent coupling between the modules of neurons in the network. This coupling or locking imposes upon the network a strict and inflexible structure that prevent layers in the network from updating their weights until a full feedforward and backward pass has occurred. Such a constraint though may have sufficed for a while, is now no longer feasible in the era of verylargescale machine learning, coupled with the increased desire for parallelization of the learning process across multiple computing infrastructures. To solve this problem, synthetic gradients (SG) with decoupled neural interfaces (DNI) are introduced as a viable alternative to the backpropagation algorithm. This paper performs a speed benchmark to compare the speed and accuracy capabilities of SGDNI as over to a standard neural interface using multilayer perceptron MLP. SGDNI shows good promise, in that it not only captures the learning problem, it is also over 3fold faster due to it asynchronous learning capabilities. 
Syntree2Vec  Word embeddings aims to map sense of the words into a lower dimensional vector space in order to reason over them. Training embeddings on domain specific data helps express concepts more relevant to their use case but comes at a cost of accuracy when data is less. Our effort is to minimise this by infusing syntactic knowledge into the embeddings. We propose a graph based embedding algorithm inspired from node2vec. Experimental results have shown that our algorithm improves the syntactic strength and gives robust performance on meagre data. 
SySeVR  The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by many vulnerabilities reported on a daily basis. This calls for machine learning methods to automate vulnerability detection. Deep learning is attractive for this purpose because it does not require human experts to manually define features. Despite the tremendous success of deep learning in other domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities. The framework, dubbed Syntaxbased, Semanticsbased, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been ‘silently’ patched by the vendors when releasing newer versions of the products. 
Syslog  Syslog has been around for a number of decades and provides a protocol used for transporting event messages between computer systems and software applications. The protocol utilizes a layered architecture, which allows the use of any number of transport protocols for transmission of syslog messages. It also provides a message format that allows vendorspecific extensions to be provided in a structured way. Syslog is now standardized by the IETF in RFC 5424 (since 2009), but has been around since the 80’s and for many years served as the de facto standard for logging without any authoritative published specification. Best practices often promote storing log messages on a centralized server that can provide a correlated view on all the log data generated by different system components. Otherwise, analyzing each log file separately and then manually linking each related log message is extremely timeconsuming. As a result, forwarding local log messages to a remote log analytics server/service via Syslog has been commonly adopted as a standard industrial logging solution. 
System G  Motivated by the need to extract knowledge and value frominterconnected data, graph analytics on big data is a veryactive area of research in both industry and academia. Tosupport graph analytics efficiently a large number of in memory graph libraries, graph processing systems and graphdatabases have emerged. Projects in each of these categories focus on particular aspects such as static versus dynamic graphs, off line versus on line processing, small versuslarge graphs, etc. While there has been much advance in graph processingin the past decades, there is still a need for a fast graph processing, using a cluster of machines with distributed storage. In this paper, we discuss a novel distributed graph database called System G designed for efficient graph data storage andprocessing on modern computing architectures. In particular we describe a single node graph database and a runtimeand communication layer that allows us to compose a distributed graph database from multiple single node instances. From various industry requirements, we find that fast insertions and large volume concurrent queries are critical partsof the graph databases and we optimize our database forsuch features. We experimentally show the efficiency of System G for storing data and processing graph queries onstateoftheart platforms. 
Systematic Compositionality  Systematic compositionality is the ability to recombine meaningful units with regular and predictable outcomes, and it’s seen as key to humans’ capacity for generalization in language. 
Systemic Net  Beyond Bags of Words: Inferring Systemic Nets 
Systems Of Insight  Systems of insight are the business discipline and technology to harness insights and turn data into action. Systems of insight deliver what big data cannot – effective action through insights driven software; after all that’s the only thing firms really care about. 
Advertisements