Clustering of nonstationary data streams: A survey of fuzzy partitional methods

Data streams have arisen as a relevant research topic during the past decade. They are real-time, incremental in nature, temporally ordered, massive, contain outliers, and the objects in a data stream may evolve over time (concept drift). Clustering is often one of the earliest and most important steps in the streaming data analysis workflow. A comprehensive literature is available about stream data clustering; however, less attention is devoted to the fuzzy clustering approach, even though the nonstationary nature of many data streams makes it especially appealing. This survey discusses relevant data stream clustering algorithms focusing mainly on fuzzy methods, including their treatment of outliers and concept drift and shift.

Anomaly Detection in R – The Tidy Way

In Data Science, As much as it is important to find patterns that repeat, It is also equally important to find anomalies that break those. This is actually very important in a place where we’ve got Time Series Data. Time Series Data is one where the data is spread across a Time Series Data.

Deep Learning from first principles in Python, R and Octave – Part 6

In this 6th instalment of ‘Deep Learning from first principles in Python, R and Octave-Part6’, I look at a couple of different initialization techniques used in Deep Learning, L2 regularization and the ‘dropout’ method. Specifically, I implement “He initialization” & “Xavier Initialization”.

News from the R Consortium

The R Consortium has been quite busy lately, so I thought I’d take a moment to bring you up to speed on some recent news.

Key Algorithms and Statistical Models for Aspiring Data Scientists

As a data scientist who has been in the profession for several years now, I am often approached for career advice or guidance in course selection related to machine learning by students and career switchers on LinkedIn and Quora. Some questions revolve around educational paths and program selection, but many questions focus on what sort of algorithms or models are common in data science today. With a glut of algorithms from which to choose, it’s hard to know where to start. Courses may include algorithms that aren’t typically used in industry today, and courses may exclude very useful methods that aren’t trending at the moment. Software-based programs may exclude important statistical concepts, and mathematically-based programs may skip over some of the key topics in algorithm design.

When Do We Trust Machines?

We propose a framework of ‘trust heatmap’, show how the trust in machines depends on two key elements: their error rate and the costs of mistakes, and examine the automation frontier.

Social network analysis: An overview

Social network analysis (SNA) is a core pursuit of analyzing social networks today. In addition to the usual statistical techniques of data analysis, these networks are investigated using SNA measures. It helps in understanding the dependencies between social entities in the data, characterizing their behaviors and their effect on the network as a whole and over time. Therefore, this article attempts to provide a succinct overview of SNA in diverse topological networks (static, temporal, and evolving networks) and perspective (ego-networks). As one of the primary applicability of SNA is in networked data mining, we provide a brief overview of network mining models as well; by this, we present the readers with a concise guided tour from analysis to mining of networks.

An overview on the evolution and adoption of deep learning applications used in the industry

With continuous improvements in performance of microprocessors over the years, they now possess capabilities of supercomputers of earlier decade. Further with the continuous increase in the packaging density on the silicon and General Purpose Graphics Processing Unit (GPGPU) enhancements, has led to utilization the deep learning (DL) techniques, which had lost steam during the last decade. A GPGPU is a parallel programming setup using a combination of GPUs and CPUs that can manipulate large matrices. Interestingly, GPUs were created for faster graphic processing, but found its way into relevant scientific computing. DL is a subset of the artificial intelligence (AI) domain and falls specifically under the set of machine learning (ML) techniques which are based on learning data representations rather than task-specific algorithms. It has been observed that the accuracy and the pragmatism of deploying DL at massive level was restricted by technological issues of executing DL based AI models, with extremely large training sessions running into weeks. DL applications can solve problems of very large order and areas like computer vision/image processing is one of the early successes and becoming quite a sensation in many areas such as natural language processing (NLP) with state of the art real-time translation capabilities, automatic game playing, optical character recognition especially handwritten text, and so on. This overview traverses the evolution and successful adoption in the various industry verticals.