EU Data Innovation Day 2018: Building a Social Contract for Data

Join the Center for Data Innovation for a series of conversations about rethinking rights and responsibilities in the data society and how policymakers can enable responsible data sharing to improve the economy and quality of life.

Highly interpretable, sklearn-compatible classifier and regressor based on simplified decision trees

Implementation of a simple, greedy optimization approach to simplifying decision trees for better interpretability and readability. It produces small decision trees, which makes trained classifiers easily interpretable to human experts, and is competitive with state of the art classifiers such as random forests or SVMs. Turns out to frequently outperform Bayesian Rule Lists in terms of accuracy and computational complexity, and Logistic Regression in terms of interpretability. Note that a feature selection method is highly advisable on large datasets, as the runtime directly depends on the number of features.

Gainers and Losers in Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms

We compare Gartner 2018 Magic Quadrant for Data Science, Machine Learning Platforms vs its 2017 version and identify notable changes for leaders and challengers, including IBM, SAS, RapidMiner, KNIME, Alteryx,, and Domino.

Comparing production-grade NLP libraries: Accuracy, performance, and scalability

This is the third and final installment in this blog series comparing two leading open source natural language processing software libraries: John Snow Labs’ NLP for Apache Spark and Explosion AI’s spaCy. In the previous two parts, we walked through the code for training tokenization and part-of-speech models, running them on a benchmark data set, and evaluating the results. In this part, we compare the accuracy and performance of both libraries on this and additional benchmarks, and provide recommendations on which use cases fit each library best.

Big, fast, easy data with KSQL

Modern businesses have data at their core, and this data is changing continuously at a rapid pace, with increasing volumes. Stream processing allows businesses to harness this torrent of information in real time, and tens of thousands of companies like Netflix, Uber, Airbnb, PayPal, and The New York Times use Apache Kafka as the streaming platform of choice to reshape their industries. Whether you are booking a hotel or a flight, taking a cab, playing a video game, reading a newspaper, shopping online, or wiring money, many of these daily activities are powered by Kafka behind the scenes. However, the world of stream processing still has a very high barrier to entry. Today’s most popular stream processing technologies, including Apache Kafka’s Streams API, still require the user to write code in programming languages such as Java or Scala. This hard requirement on coding skills is preventing many companies from unlocking the benefits of stream processing to their full effect. But thankfully, now there is a better way.

Using R to Reason & Test Theory: A Case Study from the Field of Reading Education

This past week I was preparing slides for a reading assessment class with a lecture focus on the Visual Word Form Area [VWFA] (Cohen, et al., 2000). This is an area of the brain that is hypothesized to be able to see words (plus morphemes and likely smaller chunks) as shapes, as picture forms and that may have a connecting link between the visual and language portions of the brain.

New releases: Microsoft R Client 3.4.3, Microsoft ML Server 9.3

An update to Microsoft R Client, Microsoft’s distribution of open source R with additional proprietary packages — including RevoScaleR (for data analysis at scale) and MicrosoftML (for machine learning) — is now available. Microsoft R Client 3.4.3 updates the R engine to R 3.4.3, and (on Linux) now supports deploying computations to a remote SQL Server with the sqlrutils package.