The Beautiful Duality of TDA

One of my favorite things about Topological Data Analysis (TDA) is how malleable it is, because its methods are both general and precise. These properties might sound incompatible, but perhaps I can explain it better describing how I used to approach data analysis problems before I started working on TDA.


Learning Analytics

Education for years has been one model that fits for all. A fixed curriculum developed based on experience is delivered to a group of students with the hope of it getting across to all the students with the same effectiveness. Then each student is graded based on the basis of his understanding of that curriculum. Now imagine a classroom where its not just the student that is learning about the world but where the classroom is also learning about the student. Each day is a new learning experience for both, the classroom and the student. The adaptive classroom creates and delivers content specific to the needs of each student individually based on what it learnt about that student.


Deep Learning in a Nutshell: Core Concepts

This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction to deep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning. I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts. Part 1 focuses on introducing the main concepts of deep learning. Future posts will provide historical background and delve into the training procedures, algorithms and practical tricks that are used in training for deep learning.


Use Box Plots to Assess the Distribution and to Identify the Outliers in Your Dataset

After you check the distribution of the data by ploting the histogram, the second thing to do is to look for outliers. Identifying the outliers is important becuase it might happen that an association you find in your analysis can be explained by the presence of outliers. The best tool to identify the outliers is the box plot. Through box plots we find the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum of an continues variable. The function to build a boxplot is boxplot().


5 Best Machine Learning APIs for Data Science

Machine Learning APIs make it easy for developers to develop predictive applications. Here we review 5 important Machine Learning APIs: IBM Watson, Microsoft Azure Machine Learning, Google Prediction API, Amazon Machine Learning API, and BigML.


Beginners Guide: Apache Spark Machine Learning Scenario With A Large Input Dataset

What if you want to create a machine learning model but realized that your input dataset doesn’t fit your computer memory? Usual you would use distributed computing tools like Hadoop and Apache Spark for that computation in a cluster with many machines. However, Apache Spark is able to process your data in local machine standalone mode and even build models when the input data set is larger than the amount of memory your computer has. In this blog post, I’ll show you an end-to-end scenario with Apache Spark where we will be creating a binary classification model using a 34.6 gigabytes of input dataset. Run this scenario in your laptop (yes, yours with its 4-8 gigabytes of memory and 50+ gigabytes of disk space) to test this.


Data-Planet Statistical Datasets

Data-Planet Statistical Datasets provides easy access to an extensive repository of standardized and structured statistical data, with more than 25 billion data points from more than 70 source organizations.


Accessing Bitcoin Data with R

I am not yet a Bitcoin advocate. Nevertheless, I am impressed with the amount of Bitcoin activity and the progress that advocates are making towards having Bitcoin recognized as a legitimate currency. Right now, I am mostly interested in the technology behind bitcoin and the possibility of working with some interesting data sets. A good bit of historical data is located on sites like bitstamp.net and bitcoincharts.com, and most of it is easily accessible from R with just a little data munging. In this post, I present some code that may be helpful to someone who wants to get started working with Bitcoin data in R. Transaction data is available in a JSON file from bitstamp.net here. This can be easily read with the fromJSON() function from the RJSONIO package and put into a data frame with the help of do.call().


Multiple legends for the same aesthetic

Enrico is a colleague of mine in Quantide. Some days ago, he asked me how to get two different legends for several coloured lines.