Building a Successful Data Lake: An Information Strategy Foundation

As the Strata Data Conference begins this week in New York, it´s interesting to see how the big data proponents have all been able to rally around the data lake as a concept, with a side of Artificial Intelligence and Machine Learning to make it interesting.


A Gentle Introduction to Handling a Non-Stationary Time Series in Python

What do these applications have in common: predicting the electricity consumption of a household for the next three months, estimating traffic on roads at certain periods, and predicting the price at which a stock will trade on the New York Stock Exchange? They all fall under the concept of time series data! You cannot accurately predict any of these results without the ‘time’ component. And as more and more data is generated in the world around us, time series forecasting keeps becoming an ever more critical technique for a data scientist to master. But time series is a complex topic with multiple facets at play simultaneously.


Everything you need to know about AutoML and Neural Architecture Search

AutoML and Neural Architecture Search (NAS) are the new kings of the deep learning castle. They´re the quick and dirty way of getting great accuracy for your machine learning task without much work. Simple and effective; it´s what we want AI to be all about! So how does it work? How do you use it? What options do you have to harness that power today? Here´s everything you need to know about AutoML and NAS.


Aggregating Data with Apache Spark

Data engineering is core to any big data analytics project. A key function data engineers often perform is aggregating large amounts of data to create various groupings for many different uses in data science. However, as data volumes and complexities increase, the act of performing various forms of aggregations gets more challenging.


If not Notebooks, then what? Look to Literate Programming

There’s no video yet available of Joel’s talk, but you can guess the theme of that opening slide, and walking through the slides conveys the message well, I think. Yuhui Xie, author and creator of the rmarkdown package, provides a detailed summary and response to Joel’s talk, where he lists Joel’s main critiques of Notebooks: …


Getting started with deep learning in R

There are good reasons to get into deep learning: Deep learning has been outperforming the respective ‘classical’ techniques in areas like image recognition and natural language processing for a while now, and it has the potential to bring interesting insights even to the analysis of tabular data. For many R users interested in deep learning, the hurdle is not so much the mathematical prerequisites (as many have a background in statistics or empirical sciences), but rather how to get started in an efficient way. This post will give an overview of some materials that should prove useful. In the case that you don´t have that background in statistics or similar, we will also present a few helpful resources to catch up with ‘the math’.


Paper Summary: Unsupervised Deep Embedding for Clustering Analysis

One of the most important aspect in clustering is the means of measuring distance. (or dissimilarity). For k-means we uses Euclidean distance between points. However, another important aspect is the feature space in which those measurements are performed. K-means clustering in raw pixel space is ineffective, so the authors wanted to answer the question Can we use a data driven approach to solve for the feature space and cluster memberships jointly?


Understanding K-means Clustering in Machine Learning

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. AndreyBu, who has more than 5 years of machine learning experience and currently teaches people his skills, says that ‘the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.’ A cluster refers to a collection of data points aggregated together because of certain similarities. You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.


3 facts about time series forecasting that surprise experienced machine learning practitioners.

Time series forecasting is something of a dark horse in the field of data science: It is one of the most applied data science techniques in business, used extensively in finance, in supply chain management and in production and inventory planning, and it has a well established theoretical grounding in statistics and dynamic systems theory. Yet it retains something of an outsider status compared to more recent and popular machine learning topics such as image recognition and natural language processing, and it gets little or no treatment at all in introductory courses to data science and machine learning.