PowerBI adds support for R

In the latest update released on November 20, PowerBI has added support for R. The desktop edition of Microsoft’s data visualization and reporting tool now allows you to run an R script to generate data; the resulting data frames from the script can then be used for data visualization or any other activities within Power BI. This PowerBI Support article provides the details. Simply select the new ‘Execute R Script’ option within the Other section of the ‘Get Data’ dialog, and paste in the R script you want to run.

Data discretization: taxonomy and big data challenge

Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well-known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge.

Building Analytics at Simple

Early in 2014, Simple was a mid-stage startup with only a single analytics-focused employee. When we wanted to answer a question about customer behavior or business performance, we would have to query production databases. Everybody in the company wanted to make informed decisions, from engineering to product strategy to business development to customer relations, so it was clear that we needed to build a data warehouse and a team to support it.

The Black Friday Puzzle – Understanding Markov Chains

This week we celebrate Thanksgiving and Black Friday with a fun puzzle and our first look at Markov Chains!

What is Refinery?

Refinery is an open source platform for the massive analysis of large unstructured document collections using the latest state of the art topic models. The goal of Refinery is to simplify this process within an intuitive web-based interface. What makes Refinery unique is that its meant to be run locally, thus bypassing the need for securing document collections over the internet. Refinery was developed by myself and Ben Swanson at MIT Media Lab. It was also the recipient of the Knight Prototype Award in 2014.

Flavour of Physics Winner’s Interview: 3rd place, Josef Slavicek

The Flavour of Physics competition challenged Kagglers to identify a rare decay phenomenon (τ- → μ+μ-μ- or τ → 3μ) to help establish proof of ‘new physics’. The competition was an exciting opportunity for the community to collaborate with scientists from the LHCb experiment at CERN. 706 data scientists on 673 teams participated, grappling with the complex subject matter and unusual competition setup. Josef Slavicek finished 3rd with the help of XGBoost (and a lucky typo). Below he explains the competition design, its potential weaknesses, and his wild path to the top of the leaderboard.

Bot or Not: an end-to-end data analysis in Python

In this post I want to discuss an Internets phenomena knows as bots, specifically Twitter bots. I’m focusing on Twitter bots primarily because they’re fun and funny, but also because Twitter happens to provide a rich and comprehensive API that allows users to access information about the platform and how it’s used. In short, it makes for a compelling demonstration of Python’s prowess for data analysis work, and also areas of relative weakness.

Arabesque Distributed Graph Mining Platform

Arabesque provides an elegant solution to the difficult problem of graph mining that lets a user easily express graph algorithms and efficiently distribute the computation.

Using Apache SparkR to Power Shiny Applications: Part I

The objective of this blog post is demonstrate how to use Apache SparkR to power Shiny applications. I have been curious about what the use cases for a “Shiny-SparkR” application would be and how to develop and deploy such an app.

Scaling data.table using index

R can handle fairly big data working on a single machine, 2B (2E9) rows and couple of columns require about 100 GB of memory. This is already well enough to care about performance. With this post I’m going discuss scalability of filter queries.

Estimating the exponent of discrete power law data

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n-α. How can you recover α? A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That does a very poor job [1]. The discrete case needs its own estimator.