Hourly Forecasting of Emergency Department Arrivals : Time Series Analysis

Background: The stochastic behavior of patient arrival at an emergency department (ED) complicates the management of an ED. More than 50% of hospitals ED capacity tends to operate beyond its normal capacity and eventually fails to deliver high-quality care. To address the concern of stochastics ED arrivals, many types of research has been done using yearly, monthly and weekly time series forecasting. Aim: Our research team believes that hourly time-series forecasting of the load can improve ED management by predicting the arrivals of future patients, and thus, can support strategic decisions in terms of quality enhancement. Methods: Our research does not involve any human subject, only ED admission data from January 2014 to August 2017 retrieved from the UnityPoint Health database. Autoregressive integrated moving average (ARIMA), Holt Winters, TBATS, and neural network methods were implemented to forecast hourly ED patient arrival. Findings: ARIMA (3,0,0) (2,1,0) was selected as the best fit model with minimum Akaike information criterion and Schwartz Bayesian criterion. The model was stationary and qualified the Box Ljung correlation test and the Jarque Bera test for normality. The mean error (ME) and root mean square error (RMSE) were selected as performance measures. An ME of 1.001 and an RMSE of 1.55 was obtained. Conclusions: ARIMA can be used to provide hourly forecasts for ED arrivals and can be utilized as a decision support system in the healthcare industry. Application: This technique can be implemented in hospitals worldwide to predict ED patient arrival.


A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference

Artificial Neural Networks are connectionist systems that perform a given task by learning on examples without having prior knowledge about the task. This is done by finding an optimal point estimate for the weights in every node. Generally, the network using point estimates as weights perform well with large datasets, but they fail to express uncertainty in regions with little or no data, leading to overconfident decisions. In this paper, Bayesian Convolutional Neural Network (BayesCNN) using Variational Inference is proposed, that introduces probability distribution over the weights. Furthermore, the proposed BayesCNN architecture is applied to tasks like Image Classification, Image Super-Resolution and Generative Adversarial Networks. The results are compared to point-estimates based architectures on MNIST, CIFAR-10 and CIFAR-100 datasets for Image CLassification task, on BSD300 dataset for Image Super Resolution task and on CIFAR10 dataset again for Generative Adversarial Network task. BayesCNN is based on Bayes by Backprop which derives a variational approximation to the true posterior. We, therefore, introduce the idea of applying two convolutional operations, one for the mean and one for the variance. Our proposed method not only achieves performances equivalent to frequentist inference in identical architectures but also incorporate a measurement for uncertainties and regularisation. It further eliminates the use of dropout in the model. Moreover, we predict how certain the model prediction is based on the epistemic and aleatoric uncertainties and empirically show how the uncertainty can decrease, allowing the decisions made by the network to become more deterministic as the training accuracy increases. Finally, we propose ways to prune the Bayesian architecture and to make it more computational and time effective.


Autoencoders and Generative Adversarial Networks for Anomaly Detection for Sequences

We introduce synthetic oversampling in anomaly detection for multi-feature sequence datasets based on autoencoders and generative adversarial networks. The first approach considers the use of an autoencoder in conjunction with standard oversampling methods to generate synthetic data that captures the sequential nature of the data. A different model uses generative adversarial networks to generate structure preserving synthetic data for the minority class. We also use generative adversarial networks on the majority class as an outlier detection method for novelty detection. We show that the use of generative adversarial network based synthetic data improves classification model performance on a variety of sequence data sets.


Supervised Transfer Learning for Product Information Question Answering

Popular e-commerce websites such as Amazon offer community question answering systems for users to pose product related questions and experienced customers may provide answers voluntarily. In this paper, we show that the large volume of existing community question answering data can be beneficial when building a system for answering questions related to product facts and specifications. Our experimental results demonstrate that the performance of a model for answering questions related to products listed in the Home Depot website can be improved by a large margin via a simple transfer learning technique from an existing large-scale Amazon community question answering dataset. Transfer learning can result in an increase of about 10% in accuracy in the experimental setting where we restrict the size of the data of the target task used for training. As an application of this work, we integrate the best performing model trained in this work into a mobile-based shopping assistant and show its usefulness.


High Fidelity Vector Space Models of Structured Data

Machine learning systems regularly deal with structured data in real-world applications. Unfortunately, such data has been difficult to faithfully represent in a way that most machine learning techniques would expect, i.e. as a real-valued vector of a fixed, pre-specified size. In this work, we introduce a novel approach that compiles structured data into a satisfiability problem which has in its set of solutions at least (and often only) the input data. The satisfiability problem is constructed from constraints which are generated automatically a priori from a given signature, thus trivially allowing for a bag-of-words-esque vector representation of the input to be constructed. The method is demonstrated in two areas, automated reasoning and natural language processing, where it is shown to be near-perfect in producing vector representations of natural-language sentences and first-order logic clauses that can be translated back to their original, structured input forms.


Hybrid Rebeca: Modeling and Analyzing of Cyber-Physical Systems

In cyber-physical systems like automotive systems, there are components like sensors, actuators, and controllers that communicate asynchronously with each other. The computational model of actor supports modeling distributed asynchronously communicating systems. We propose Hybrid Rebeca language to support modeling of cyber-physical systems. Hybrid Rebeca is an extension of actor-based language Rebeca. In this extension, physical actors are introduced as new computational entities to encapsulate physical behaviors. To support various means of communication among the entities, the network is explicitly modeled as a separate entity from actors. We derive hybrid automata as the basis for analysis of Hybrid Rebeca models. We demonstrate the applicability of our approach through a case study in the domain of automotive systems. We use SpaceEx framework for the analysis of the case study.


D${}^3$TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

We address weakly-supervised action alignment and segmentation in videos, where only the order of occurring actions is available during training. We propose Discriminative Differentiable Dynamic Time Warping (D{}^3TW), which is the first discriminative model for weak ordering supervision. This allows us to bypass the degenerated sequence problem usually encountered in previous work. The key technical challenge for discriminative modeling with weak-supervision is that the loss function of the ordering supervision is usually formulated using dynamic programming and is thus not differentiable. We address this challenge by continuous relaxation of the min-operator in dynamic programming and extend the DTW alignment loss to be differentiable. The proposed D{}^3TW innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks. We show that our model outperforms the current state-of-the-art across three evaluation metrics in two challenging datasets.


The Universal model and prior: multinomial GLMs

This paper generalises the exponential family GLM to allow arbitrary distributions for the response variable. This is achieved by combining the model-assisted regression approach from survey sampling with the GLM scoring algorithm, weighted by random draws from the posterior Dirichlet distribution of the support point probabilities of the multinomial distribution. The generalisation provides fully Bayesian analyses from the posterior sampling, without MCMC. Several examples are given, of published GLM data sets. The approach can be extended widely: an example of a GLMM extension is given.


What do Language Representations Really Represent?

A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships—a convenient benchmark used for evaluation in previous work—appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.


Change Detection and Notification of Webpages: A Survey

Majority of the currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages which contain information valuable to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process and notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights current challenges and areas of improvement present within the field of research.


Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or ‘discovered’, including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first large-scale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.


Transfer Representation Learning with TSK Fuzzy System

Transfer learning can address the learning tasks of unlabeled data in the target domain by leveraging plenty of labeled data from a different but related source domain. A core issue in transfer learning is to learn a shared feature space in where the distributions of the data from two domains are matched. This learning process can be named as transfer representation learning (TRL). The feature transformation methods are crucial to ensure the success of TRL. The most commonly used feature transformation method in TRL is kernel-based nonlinear mapping to the high-dimensional space followed by linear dimensionality reduction. But the kernel functions are lack of interpretability and are difficult to be selected. To this end, the TSK fuzzy system (TSK-FS) is combined with transfer learning and a more intuitive and interpretable modeling method, called transfer representation learning with TSK-FS (TRL-TSK-FS) is proposed in this paper. Specifically, TRL-TSK-FS realizes TRL from two aspects. On one hand, the data in the source and target domains are transformed into the fuzzy feature space in which the distribution distance of the data between two domains is min-imized. On the other hand, discriminant information and geo-metric properties of the data are preserved by linear discriminant analysis and principal component analysis. In addition, another advantage arises with the proposed method, that is, the nonlinear transformation is realized by constructing fuzzy mapping with the antecedent part of the TSK-FS instead of kernel functions which are difficult to be selected. Extensive experiments are conducted on the text and image datasets. The results obviously show the superiority of the proposed method.


Dirichlet Variational Autoencoder

This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichlet prior for a continuous latent variable that exhibits the characteristic of the categorical probabilities. To infer the parameters of DirVAE, we utilize the stochastic gradient method by approximating the Gamma distribution, which is a component of the Dirichlet distribution, with the inverse Gamma CDF approximation. Additionally, we reshape the component collapsing issue by investigating two problem sources, which are decoder weight collapsing and latent value collapsing, and we show that DirVAE has no component collapsing; while Gaussian VAE exhibits the decoder weight collapsing and Stick-Breaking VAE shows the latent value collapsing. The experimental results show that 1) DirVAE models the latent representation result with the best log-likelihood compared to the baselines; and 2) DirVAE produces more interpretable latent values with no collapsing issues which the baseline models suffer from. Also, we show that the learned latent representation from the DirVAE achieves the best classification accuracy in the semi-supervised and the supervised classification tasks on MNIST, OMNIGLOT, and SVHN compared to the baseline VAEs. Finally, we demonstrated that the DirVAE augmented topic models show better performances in most cases.


Fast Newton Method for Sparse Logistic Regression

Sparse logistic regression has been developed tremendously in recent two decades, from its origination the \ell_1-regularized version by Tibshirani(1996) to the sparsity constrained models by Bahmani, Raj, and Boufounos (2013); Plan and Vershynin (2013). This paper is carried out on the sparsity constrained logistic regression through the classical Newton method. We begin with analysing its first optimality condition to acquire a strong \tau-stationary point for some \tau>0. This point enables us to equivalently derive a stationary equation system which is able to be efficiently solved by Newton method. The proposed method FNSLR, an abbreviation for Newton method for sparse logistic regression, enjoys a very low computational complexity, local quadratic convergence rate and termination within finite steps. Numerical experiments on random data and real data demonstrate its superior performance when against with seven state-of-the-art solvers.


Causal mediation analysis for stochastic interventions

Mediation analysis in causal inference has traditionally focused on binary treatment regimes and deterministic interventions, and a decomposition of the average treatment effect in terms of direct and indirect effects. In this paper we present an analogous decomposition of the \textit{population intervention effect}, defined through stochastic interventions. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart. In particular, identification of direct effects is guaranteed in experiments that randomize the treatment and the mediator. We discuss various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques. Our efficient estimator is asymptotically linear under a condition requiring n^{1/4}-consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.


Sentiment Analysis of Czech Texts: An Algorithmic Survey

In the area of online communication, commerce and transactions, analyzing sentiment polarity of texts written in various natural languages has become crucial. While there have been a lot of contributions in resources and studies for the English language, ‘smaller’ languages like Czech have not received much attention. In this survey, we explore the effectiveness of many existing machine learning algorithms for sentiment analysis of Czech Facebook posts and product reviews. We report the sets of optimal parameter values for each algorithm and the scores in both datasets. We finally observe that support vector machines are the best classifier and efforts to increase performance even more with bagging, boosting or voting ensemble schemes fail to do so.


A Potential Outcomes Approach to Answer Reviewing in Multiple-Choice Exams

Does reviewing previous answers during multiple-choice exams help examinees increase their final score? This article formalizes the question using a rigorous causal framework, the potential outcomes framework. Viewing examinees’ reviewing status as a treatment and their final score as an outcome, the article first explains the challenges of identifying the causal effect of answer reviewing in regular exam-taking settings. In addition to the incapability of randomizing the treatment selection (reviewing status) and the lack of other information to make this selection process ignorable, the treatment variable itself is not fully known to researchers. Looking at examinees’ answer sheet data, it is unclear whether an examinee who did not change his or her answer on a specific item reviewed it but retained the initial answer (treatment condition) or chose not to review it (control condition). Despite such challenges, however, the article develops partial identification strategies and shows that the sign of the answer reviewing effect can be reasonably inferred. By analyzing a statewide math assessment data set, the article finds that reviewing initial answers is generally beneficial for examinees.


Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, \textit{Transformer-XL}, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80\% longer than RNNs and 450\% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.


A Constructive Approach for One-Shot Training of Neural Networks Using Hypercube-Based Topological Coverings

In this paper we presented a novel constructive approach for training deep neural networks using geometric approaches. We show that a topological covering can be used to define a class of distributed linear matrix inequalities, which in turn directly specify the shape and depth of a neural network architecture. The key insight is a fundamental relationship between linear matrix inequalities and their ability to bound the shape of data, and the rectified linear unit (ReLU) activation function employed in modern neural networks. We show that unit cover geometry and cover porosity are two design variables in cover-constructive learning that play a critical role in defining the complexity of the model and generalizability of the resulting neural network classifier. In the context of cover-constructive learning, these findings underscore the age old trade-off between model complexity and overfitting (as quantified by the number of elements in the data cover) and generalizability on test data. Finally, we benchmark on algorithm on the Iris, MNIST, and Wine dataset and show that the constructive algorithm is able to train a deep neural network classifier in one shot, achieving equal or superior levels of training and test classification accuracy with reduced training time.