How to track and visualize data lineage

Data lineage is about tracking the flow of information. It is necessary to guarantee the quality, usability and security of your data. For large organizations, it is also a key conformity requirement. With Linkurious, it is possible to use a graph-based approach to solve these challenges.

Basic recommendation engine using R

In our day to day life, we come across a large number of Recommendation engines like Facebook Recommendation Engine for Friends’ suggestions, and suggestions of similar Like Pages, Youtube recommendation engine suggesting videos similar to our previous searches/preferences. In today’s blog post I will explain how to build a basic recommender System.

Data Science Learning Resources

Different people learn in different ways. Some learn best by taking a class, others learn best by reading and following along in a book or online tutorial, and still others prefer to watch a video on the topic they want to learn.

Your best references to do your job or get started in data science.

Data Science for Losers, Part 2 – Addendum

This should have been the third part of the Loser’s article series but as you may know I’m trying very hard to keep the overall quality as low as possible. This, of course, implies missing parts, misleading explanations, irrational examples and an awkward English syntax (it’s actually German syntax covered by English-like semantics). And that’s why we now have to go through this addendum and not the real Part Three about using Apache Spark with IPython.

Data Science for Losers, Part 2

In the first article we’ve learned a bit about Data Science for Losers. And the most important message, in my opinion, is that patterns are everywhere but many of them can’t be immediately recognized. This is one of the reasons why we’re digging deep holes in our databases, data warehouses, and other silos. In this article we’ll use a few more methods from Pandas’ DataFrames and generate plots. We’ll also create pivot tables and query an MS SQL database via ODBC. SqlAlchemy will be our helper in this case and we’ll see that even Losers like us can easily merge and filter SQL tables without touching the SQL syntax. No matter the task you always need a powerful tool-set in the first place. Like the Anaconda Distribution which we’ll be using here. Our data sources will be things like JSON files containing reddit comments or SQL-databases like Northwind. Many 90’es kids used Northwind to learn SQL.

Data Science for Losers

Presumably, I’m not the only coder having a dirty little secret: I sucked at math when I was in school. Today, I think I rejected math because of our wrecked educational system. When it comes to math, biology and physics we’re heading for a total disaster. So many lost talents. However, I’ve survived, somehow…even without having a talent in any of the named scientific fields. And even became a software developer. But, that’s another story and much much dirtier. Now, let’s jump over a few decades and start playing with IPython & Pandas.

Thinking Deeply about IoT Analytics

This post discusses different aspects of an IoT analytics solutions pointing out challenges that you need to think about while building IoT analytics solutions or choosing analytics solutions. Big data has solved many IoT analytics challenges. Specially system challenges related to large-scale data management, learning, and data visualizations. However, significant thinking and work required to match the IoT use cases to analytics systems.

Visual Information Theory

Information theory gives us precise language for describing a lot of things. How uncertain am I? How much does knowing the answer to question A tell me about the answer to question B? How similar is one set of beliefs to another? I’ve had informal versions of these ideas since I was a young child, but information theory crystallizes them into precise, powerful ideas. These ideas have an enormous variety of applications, from the compression of data, to quantum physics, to machine learning, and vast fields in between.

Where does that 2 come from in the likelihood ratio test?

Tests, Power and Significance

parallelsugar: An implementation of mclapply for Windows

An easy way to run R code in parallel on a multicore system is with the mclapply() function. Unfortunately, mclapply() does not work on Windows machines because the mclapply() implementation relies on forking and Windows does not support forking.

A preview of using Revolution R Enterprise inside SQL Server

Although the functionality of using R directly inside SQL Server will only be part of SQL Server 2016, Microsoft announced earlier this year that SQL Server 2016 will include Revolution Analytics.