The Fundamentals of Data Science

The easiest thing for people within the big data / analytics / data science disciplines is to say “I do data science”. However, when it comes to data science fundamentals, we need to ask the following critical questions: What really is “data”, what are we trying to do with data, and how do we apply scientific principles to achieve our goals with data?
• What is Data?
• The Goal of Data Science
• The Scientific Method

• Probability and Statistics
• Decision Theory
• Estimation Theory
• Coordinate Systems
• Linear Transformations
• Effects of Computation on Data
• Prototype Coding / Programming
• Graph Theory
• Algorithms
• Machine Learning

Connecting the dots for a Deep Learning App …

Our day to day activities is filled with Emotions and Sentiments. Ever wondered how we can identify these sentiments through computers

Dot 0: Deep Learning in Sentiment Analysis
Dot 1: Data Preparation
Dot 2: Baseline Model
Dot 3: Experimentation of Different Model Architectures
Dot 4: CNN-GRU Architecture
Dot 5: App

Gradient boosting in R

Boosting is another famous ensemble learning technique in which we are not concerned with reducing the variance of learners like in Bagging where our aim is to reduce the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from training data, so as to avoid overfitting. Another major difference between both the techniques is that in Bagging the various models which are generated are independent of each other and have equal weightage .Whereas Boosting is a sequential process in which each next model which is generated is added so as to improve a bit from the previous model.Simply saying each of the model that is added to mix is added so as to improve on the performance of the previous collection of models.In Boosting we do weighted averaging.

Computer Age Statistical Inference: Algorithms, Evidence and Data Science

The twenty-first century has seen a breathtaking expansion of statistical methodology, both in scope and in influence. “Big data,” “data science,” and “machine learning” have become familiar terms in the news, as statistical methods are brought to bear upon the enormous data sets of modern science and commerce. How did we get here? And where are we going? This book takes us on a journey through the revolution in data analysis following the introduction of electronic computation in the 1950s. Beginning with classical inferential theories – Bayesian, frequentist, Fisherian – individual chapters take up a series of influential topics: survival analysis, logistic regression, empirical Bayes, the jackknife and bootstrap, random forests, neural networks, Markov chain Monte Carlo, inference after model selection, and dozens more. The book integrates methodology and algorithms with statistical inference, and ends with speculation on the future direction of statistics and data science.

The t-distribution: a key statistical concept discovered by a beer brewery

In this post we will look at two probability distributions you will encounter almost each time you do data science, statistics, or machine learning.

Ebook Launch: The Ultimate Guide to Basic Data Cleaning

Imagine that you have a room filled with dozens of sleeping cats, and you want to know how many cats there are. It would also be good to know some basic insights about your new cat colony — for example, what colors the cats are and whether any of them have extra long tails. This doesn’t seem too difficult, right? Just go around the room and check out each cat. Now imagine that the room is also filled with dozens of birds and flying squirrels, and all the cats are hyped up on catnip. It’s hard enough to stick your head in the room without getting smacked by a flying animal; counting the cats is now out of the question, let alone checking out their tails.

Data science without borders

Wes McKinney makes the case for a shared infrastructure for data science.

Linear Congruential Generator in R

A Linear congruential generator (LCG) is a class of pseudorandom number generator (PRNG) algorithms used for generating sequences of random-like numbers. The generation of random numbers plays a large role in many applications ranging from cryptography to Monte Carlo methods. Linear congruential generators are one of the oldest and most well-known methods for generating random numbers primarily due to their comparative ease of implementation and speed and their need for little memory. Other methods such as the Mersenne Twister are much more common in practical use today.

Calculating a fuzzy kmeans membership matrix with R and Rcpp

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like fanny, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the kmeans function.

Practical Guide to Principal Component Methods in R

Although there are several good books on principal component methods (PCMs) and related topics, we felt that many of them are either too theoretical or too advanced. This book provides a solid practical guidance to summarize, visualize and interpret the most important information in a large multivariate data sets, using principal component methods in R.