The Art of Story Telling in Data Science and how to create data stories?

The idea of storytelling is fascinating; to take an idea or an incident, and turn it into a story. It brings the idea to life and makes it more interesting. This happens in our day to day life. Whether we narrate a funny incident or our findings, stories have always been the “go-to” to draw interest from listeners and readers alike. For instance; when we talk of how one of our friends got scolded by a teacher, we tend to narrate the incident from the beginning so that a flow is maintained. Let’s take an example of the most common driving distractions by gender. There are two ways to tell this.

An overview of emerging pattern mining in supervised descriptive rule discovery: taxonomy, empirical study, trends, and prospects

Emerging pattern mining is a data mining task that aims to discover discriminative patterns, which can describe emerging behavior with respect to a property of interest. In recent years, the description of datasets has become an interesting field due to the easy acquisition of knowledge by the experts. In this review, we will focus on the descriptive point of view of the task. We collect the existing approaches that have been proposed in the literature and group them together in a taxonomy in order to obtain a general vision of the task. A complete empirical study demonstrates the suitability of the approaches presented. This review also presents future trends and emerging prospects within pattern mining and the benefits of knowledge extracted from emerging patterns.

Have You Heard About Unsupervised Decision Trees

Unless you’re involved in anomaly detection you may never have heard of Unsupervised Decision Trees. It’s a very interesting approach to decision trees that on the surface doesn’t sound possible but in practice is the backbone of modern intrusion detection.

Introduction to batch processing – MapReduce

Today, the volume of data is often too big for a single server – node – to process. Therefore, there was a need to develop code that runs on multiple nodes. Writing distributed systems is an endless array of problems, so people developed multiple frameworks to make our lives easier. MapReduce is a framework that allows the user to write code that is executed on multiple nodes without having to worry about fault tolerance, reliability, synchronization or availability.

Super Fast String Matching in Python

Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms the problem into a matrix multiplication problem, which is computationally much cheaper. Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop.

The First Rule of Data: Do No Harm

At the Strata Big Data Conference in New York, one of the major themes was the responsibility that data scientists have to do their best to prevent the biases and prejudices that exist in society from creeping into data and the way algorithms are built.

Planet: Understanding the Amazon from Space, 1st Place Winner’s Interview

In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label satellite images from the Amazon basin, in order to better track and understand causes of deforestation.

Random Forests(r), Explained

Random Forest, one of the most popular and powerful ensemble method used today in Machine Learning. This post is an introduction to such algorithm and provides a brief overview of its inner workings.

Natural Stupidity is more Dangerous than Artificial Intelligence

Do you know what’s more dangerous than artificial intelligence? Natural stupidity. In this article, I will explore natural stupidity in more detail and show how our current technology (driven by narrow artificial intelligence) is making us collectively dumber. We’ve all had this experience of using a GPS to guide us around an unfamiliar place only to realize later that we have no recollection or ability to get to that place again without the aid of a GPS. Not only is our directional instinct diminish because of lack of use, but so is our own memories. We’ve all experienced losing our ability to recall due to our over use of Google. We now recall more as to how we can search for something rather than the details of that something. The framework that I often use to explore intuition is the Cognitive Bias Codex found at Wikipedia. It’s a massive list of biases, however to get an overview of it, there are four high level categories that are the the drivers of theses biases. These are “Too Much Information”, “Not Enough Meaning”, “Need to Act Fast” and “What Should we Remember?”.

Enabling data science for the majority

As practitioners who build data science tools, we seem to have a rather myopic obsession with the challenges faced by the Googles, Amazons, and Facebooks of the world—companies with massive and mature data analytics ecosystems, supported by experienced systems engineers, and used by data scientists who are capable programmers. However, these companies represent a tiny fraction of the “big data” universe. It’s helpful to think of them as the “1% of big data”: the minority whose struggles are not often what the rest of the “big data” world faces. Yet, they occupy the majority of discourse around how to utilize the latest tools and technologies in the industry.

How to manage Docker containers in Kubernetes with Java

In Containerizing Continuous Delivery in Java we explored the fundamentals of packaging and deploying Java applications within Docker containers. This was only the first step in creating production-ready, container-based systems. Running containers at any real-world scale requires a container orchestration and scheduling platform, and although many exist (i.e., Docker Swarm, Apache Mesos, and AWS ECS), the most popular is Kubernetes. Kubernetes is used in production at many organizations, and is now hosted by the Cloud Native Computing Foundation (CNCF). In this article, we will take the previous simple Java-based, e-commerce shop that we packaged within Docker containers and run this on Kubernetes.

Planning for AI

What you need know before committing to AI.

Writing Julia functions in R with examples

The Julia programming language is growing fast and its efficiency and speed is now well-known. Even-though I think R is the best language for Data Science, sometimes we just need more. Modelling is an important part of Data Science and sometimes you may need to implement your own algorithms or adapt existing models to your problems. If performance is not essential and the complexity of your problem is small, R alone is enough. However, if you need to run the same model several times on large datasets and available implementations are not suit to your problem, you will need to go beyond R. Fortunately, you can go beyond R in R, which is great because you can do your analysis in R and call complex models from elsewhere. The book “Extending R” from John Chambers presents interfaces in R for C++, Julia and Python. The last two are in the XRJulia and in the XRPython packages, which are very straightforward.

How we built a Shiny App for 700 users?

Olga’s talk was entitled ‘How we built a Shiny App for 700 users?’ She went over the main challenges associated with scaling a Shiny application, and the methods we used to resolve them. The talk was partly in the form of a case study based on Appsilon’s experience.

From Power Calculations to P-Values: A/B Testing at Stack Overflow

If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process. Which version of a button, predictive model, or ad is better? We don’t have to guess blindly, but instead we can use tests as part of our decision-making toolkit.