Causal Information Theory – Formal Introduction of Key Concepts

Let´s ground information in the interventionist account of causation: aninformative cause is a specific difference maker (Griffiths and Stotz 2013).
Causes enable prediction and control. But they are profligate.
How to quantify how much variables ‘influence’ (Woodward 2010,p. 305) their putative effects?

What Is UIMA?

UIMA stands for Unstructured Information Management Architecture and is a component architecture and software framework implementation for the analysis of unstructured content like text, video and audio data. Unstructured information represents the largest, most current and fastest growing source of information available to businesses and governments. The motivation to develop such a framework was to build a common platform for unstructured analytics, to foster reuse of analysis components and to reduce duplication of analysis development. The pluggable architecture of UIMA allows to easily plug-in your own analysis components and combine them together with others. A full analysis task of a solution using unstructured analytics like search or government intelligence applications is often not a monolithic thing but a multi-stage process where different modules need to build on each other to get a powerful analysis chain. In some cases also annotators from different specialized vendors may need to work together to produce the results needed. The UIMA application interested in such analysis results does not need to know the details of how annotators work together to create the results. The UIMA framework take care of the integration and orchestration of multiple annotators. So the major goal of UIMA is to transform unstructured information to structured information by orchestrating analysis engines to detect entities or relations and thus to build the bridge between the unstructured and the structured world.

What is a Data Hub? Shucks – Piccadilly Circus!

It seems that introducing a data hub strategy a couple of years ago was both good and bad. For some clients we put a name to something that makes total sense and something they had been working on for years. For other clients we have just opened up Pandora´s Box and introduced more complexity. For another set of clients we have just birthed a new toy that might indeed solve all their problems – whatever they are. For some vendors, we even gave a breath of fresh air to some pretty tired old marketing messages. Welcome to the life of an analyst.

Who Owns Your Data Lake?

Over the past few weeks of client interactions, it´s becoming more common for Chief Data Officers (CDOs) to ‘own’ the data lake initiative in enterprises. The data warehouse and mart environments stay with the CIO´s team. This creates a massive amount of tension within the organization and leads to competition between groups that should be collaborating. It also leads to data lake failures.

A Hands-On Guide to Automated Feature Engineering using Featuretools in Python

Anyone who has participated in machine learning hackathons and competitions can attest to how crucial feature engineering can be. It is often the difference between getting into the top 10 of the leaderboard and finishing outside the top 50! I have been a huge advocate of feature engineering ever since I realized it´s immense potential. But it can be a slow and arduous process when done manually. I have to spend time brainstorming over what features to come up, and analyze their usability them from different angles. Now, this entire FE process can be automated and I´m going to show you how in this article. Source: VentureBeat We will be using the Python feature engineering library called Featuretools to do this. But before we get into that, we will first look at the basic building blocks of FE, understand them with intuitive examples, and then finally dive into the awesome world of automated feature engineering using the BigMart Sales dataset.

Watson – Time to Prune the ML Tree?

IBM´s Watson QAM (Question Answering Machine), famous for its 2011 Jeopardy win was supposed to bring huge payoffs in healthcare. Instead both IBM and its Watson Healthcare customers are rapidly paring back these projects that have largely failed to pay off. Watson was the first big out-of-the-box commercial application in ML/AI. Has it become obsolete?

Statistical Tests – When to use Which ?

For a person being from a non-statistical background the most confusing aspect of statistics, are always the fundamental statistical tests, and when to use which. This blog post is an attempt to mark out the difference between the most common tests, the use of null value hypothesis in these tests and outlining the conditions under which a particular test should be used.

Market Basket Analysis using R

Learn about Market Basket Analysis & the APRIORI Algorithm that works behind it. You’ll see how it is helping retailers boost business by predicting what items customers buy together.

Analyzing Complexity of Code through Python

Get introduced to Asymptotic Analysis. Learn more about the complexity of the algorithm as well as asymptotic notation, such as Big O, Big ?, and Big O notation. Along with the examples of complexity in a different algorithm.

How to Operationalize Machine Learning and Data Science Projects

The democratization of machine learning platforms is proliferating analytical assets and models. The challenge now is to deploy and operationalize at scale. Data and analytics leaders must establish operational tactics and strategies to secure and systematically monetize data science efforts.

Docker Cheat Sheet

This comprehensive cheat sheet will assist Docker users, experienced and new, in getting containers up-and-running quickly. We list commands that will allow users to install, build, ship and run Docker containers.

UX Design Guide for Data Scientists and AI Products

When looking up UX design strategies for AI products, I found little to no relevant material. Among the few I found, most were either too domain specific or completely focused on visual designs of web UIs. The best articles I have came across on this subject matter were Vladimir Shapiro´s ‘UX for AI: Trust as a Design Challenge’ and Dávid Pásztor´s ‘AI UX: 7 Principles of Designing Good AI Products’. Realizing that there is a legitimate knowledge gap between UX Designers and Data Scientists, I have decided to attempt addressing the needs from the Data Scientist´s perspective. Hence, my assumption is that the readers have some basic understanding of data science. For UX Designers with little to no data science background, I have avoided the use of complex mathematics and programming (though I do encourage reading Michael Galarnyk´s ‘How to Build a Data Science Portfolio’ and my ‘Data Science Interview Guide’).

Basic Statistics in Python: Probability

When studying statistics, you will inevitably have to learn about probability. It is easy lose yourself in the formulas and theory behind probability, but it has essential uses in both working and daily life. We’ve previously discussed some basic concepts in descriptive statistics; now we’ll explore how statistics relates to probability.

Exploring correlations in R with corrr

Moving to corrr – the first package I ever created. It started when I was a postgrad student studying individual differences in decision making. My research data was responses to test batteries. My statistical bread and butter was regression-based techniques like multiple regression, path analysis, factor analysis (EFA and CFA), and structural equation modelling. I spent a lot of time exploring correlation matrices to make model decisions, and diagnose poor fits or unexpected results! If you need proof, check out some of the correlations tables published in my academic papers like ‘Individual Differences in Decision Making Depend on Cognitive Abilities, Monitoring and Control’