Practical and theoretical aspects of mixture-of-experts modeling: An overview

Mixture-of-experts (MoE) models are a powerful paradigm for modeling data arising from complex data generating processes (DGPs). In this article, we demonstrate how different MoE models can be constructed to approximate the underlying DGPs of arbitrary types of data. Due to the probabilistic nature of MoE models, we propose the maximum quasi-likelihood (MQL) approach as a method for estimating MoE model parameters from data, and we provide conditions under which MQL estimators are consistent and asymptotically normal. The blockwise minorization-maximization (blockwise-MM) algorithm framework is proposed as an all-purpose method for constructing algorithms for obtaining MQL estimators. An example derivation of a blockwise-MM algorithm is provided. We then present a method for constructing information criteria for estimating the number of components in MoE models and provide justification for the classic Bayesian information criterion (BIC). We explain how MoE models can be used to conduct classification, clustering, and regression and illustrate these applications via two worked examples.

Cloud-based data streams optimization

Many modern applications of sensor networks and transaction analysis require real-time processing of their stream data sets. These data streams vary continuously over time. Current stream processing approaches focus on only one of the two optimization perspectives, proposing optimization techniques for data streams processing regardless of the processing environment or improving the processing environment only. In this paper, a brief survey of recent approaches to data streams processing coming from the two optimizations perspectives is proposed; their shortcomings are presented as well. Then, a proposal to an innovative and integrative framework is developed; it is referred to as the continuous query optimization based on multiple plans (CQOMP) for data streams over the cloud environment. CQOMP combines the two optimization perspectives and provides an optimized stream clusters processing using multiple split query plans. Each plan is constructed for a cluster of data that has nearest characteristics and it processes streams tuples over the cloud. We also propose a novel algorithm called the optimized multiple plans (OMP) for processing data streams clusters on Cloud Computing. The OMP algorithm efficiently divides data streams and generates optimized multiple split plans. Each plan is for processing a group of data streams on the cloud. We present the experimental results of the OMP solution compared to the alternative state-of-the-art data stream approaches. The experiments show the efficiency and the scalability of the combined OMP algorithm on different cloud environments, the real Amazon cloud environment, and the simulated windows azure cloud environment.

How to Access Datasets in R

Have you spent hours, pulling your hair out trying to figure out how to access datasets in R? Once imported to a variable, columns from a dataset (eg: CSV) can be very tricky to access. Sometimes columns contain spaces, funky characters or other incosistencies. Here are some examples on how to access the data from CSV and JSON datasets.

A dynamic-adversarial mining approach to the security of machine learning

Operating in a dynamic real-world environment requires a forward thinking and adversarial aware design for classifiers beyond fitting the model to the training data. In such scenarios, it is necessary to make classifiers such that they are: (a) harder to evade, (b) easier to detect changes in the data distribution over time, and (c) be able to retrain and recover from model degradation. While most works in the security of machine learning have concentrated on the evasion resistance problem (a), there is little work in the areas of reacting to attacks (b) and (c). Additionally, while streaming data research concentrates on the ability to react to changes to the data distribution, they often take an adversarial agnostic view of the security problem. This makes them vulnerable to adversarial activity, which is aimed toward evading the concept drift detection mechanism itself. In this paper, we analyze the security of machine learning from a dynamic and adversarial aware perspective. The existing techniques of restrictive one-class classifier models, complex learning-based ensemble models, and randomization-based ensemble models are shown to be myopic as they approach security as a static task. These methodologies are ill suited for a dynamic environment, as they leak excessive information to an adversary who can subsequently launch attacks which are indistinguishable from the benign data. Based on empirical vulnerability analysis against a sophisticated adversary, a novel feature importance hiding approach for classifier design is proposed. The proposed design ensures that future attacks on classifiers can be detected and recovered from. The proposed work provides motivation, by serving as a blueprint, for future work in the area of dynamic-adversarial mining, which combines lessons learned from streaming data mining, adversarial learning, and cybersecurity.

Kaggle data science survey data analysis using Highcharter

In this article, you will be exploring the Kaggle data science survey data which was done in 2017. Kaggle conducted a worldwide survey to know about the state of data science and machine learning. The survey received over 16,000 responses and one can learn a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field etc.

A Basic Recipe for Machine Learning

Ever since wrapping up the three Deep Learning courses by Andrew Ng I’ve been meaning to write down some of the gems that he’s highlighted throughout the course. One of the nice ones that I felt needed to be written down is his general recipe to approaching a deep learning algorithm/model.