Lessons from Bayesian disease diagnosis: Don’t over-interpret the Bayes factor, VERSION 2

A primary example of Bayes’ rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here’s one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes’ rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.

How to create a Twitter Sentiment Analysis using R and Shiny

Everytime you release a product or service you want to receive feedback from users so you know what they like and what they don’t. Sentiment Analysis can help you. I will show you how to create a simple application in R and Shiny to perform Twitter Sentiment Analysis in real-time. I use RStudio. We will be able to see if they liked our products or not. Also, we will create a wordcloud to find out why they liked it and why not.

Commuting between districts and cities in New Zealand

At this year’s New Zealand Statisticians Association conference I gave a talk on Modelled Territorial Authority Gross Domestic Product. One thing I’d talked about was the impact on the estimates of people residing in one Territorial Authority (district or city) but working in another one. This was important because data on earnings by place of residence formed a crucial step in those particular estimates of modelled GDP, which needs to be based on place of production. I had a slide to visualise the “commuting patterns”, which I’d prepared for that talk but isn’t used elsewhere, and thought I’d share it and a web version here on this blog. The web version is the one in the frame above this text. It’s designed to be interacted with – try hovering over circles, or picking them up and dragging them around.

The Star Wars social network

Some of us are looking forward to Christmas, and some of us are looking forward to the new film in the Star Wars franchise, The Force Awakens. Meanwhile, I decided to look at the whole 6-movie cycle from a quantitative point of view and extract the Star Wars social networks, both within each film and across the whole Star Wars universe. Looking at the social network structure reveals some surprising differences between the original trilogy and the prequels. If you’re interested in technical details of how I extracted the data, head down to the How I did the analysis section. But let’s start with some visualizations.

Tour of Real-World Machine Learning Problems

Real-world examples make the abstract description of machine learning become concrete. In this post you will go on a tour of real world machine learning problems. You will see how machine learning can actually be used in fields like education, science, technology and medicine. Each machine learning problem listed also includes a link to the publicly available dataset. This means that if a particular concrete machine learning problem interest you, you can download the dataset and start practicing immediately.

Data Management

Preparing the data for analysis it requires to create new variable, to merge datasets or to subset the big dataset in small parts. Also we cover how to identify missings values and other data manipulation of the dataset.

Simulating Uncertain Decisions With Python and Petersburg

In the 17th century, salon mathematicians wrestled with a puzzle called the problem of points. In this puzzle, two players are playing some game, but must quit early, and therefore have to divide the remaining pot in a fair way. Over the centuries, many proposed different solutions with no consensus on a true solution. Eventually in 1654 Blaise Pascal proposed that the value of a future gain is directly proportional to the chance of getting it. More formally: the expected value is the sum of all potential outcomes multiplied by their probability of occurring. This method is referred to today as the Expected Value.

Analyzing networks of characters in ‘Love Actually’

Every Christmas Eve, my family watches Love Actually. Objectively it’s not a particularly, er, good movie, but it’s well-suited for a holiday tradition. (Vox has got my back here). Even on the eighth or ninth viewing, it’s impressive what an intricate network of characters it builds. This got me wondering how we could visualize the connections quantitatively, based on how often characters share scenes. So last night, while my family was watching the movie, I loaded up RStudio, downloaded a transcript, and started analyzing.

Data Mining for Predictive Social Network Analysis

Social networks, in one form or another, have existed since people first began to interact. Indeed, put two or more people together and you have the foundation of a social network. It is therefore no surprise that, in today’s Internet-everywhere world, online social networks have become entirely ubiquitous. Within this world of online social networks, a particularly fascinating phenomenon of the past decade has been the explosive growth of Twitter, often described as “the SMS of the Internet”. Launched in 2006, Twitter rapidly gained global popularity and has become one of the ten most visited websites in the world. As of May 2015, Twitter boasts 302 million active users who are collectively producing 500 million Tweets per day. And these numbers are continually growing. Given this enormous volume of social media data, analysts have come to recognize Twitter as a virtual treasure trove of information for data mining, social network analysis, and information for sensing public opinion trends and groundswells of support for (or opposition to) various political and social initiatives.Twitter Trend Topics in particular are becoming increasingly recognized as a valuable proxy for measuring public opinion.

Free Data Mining Programs for Everyday use

If you want quickly to get started with data analysis, here is my advise on free software programs that I use every day for data analysis, statistics and data mining.
• R-package – a software for statistical computing written in C. Script oriented.
Pros: widely used, simple, extensive documentation.
Cons: less options for graphics compared to competitors, no multi-threading, scripting features are limited compared to full-featured programming languages (such as Python, C++ or Java) .
• DMelt – a mathematical software written in Java and based on GNU libraries supported by the DMelt team.
Pros: support for many languages, Java, Python/Jython, Groovy, Ruby, Octave. Multi-threading. Extensive documentation, 2D/3D graphics and hundreds of code examples.
Cons: Hard to bind with CPython. Many advanced topics of DMelt documentation are proprietary.
• Weka – A Java environment for data mining.
Pros: advanced GUI, good documentation.
Cons: support for data visualization is less advanced compared to alternative programs. Scripting support is limited.
• Orange – visualization and analysis for novice and experts.
Pros: Advanced GUI and graphics. Good documentation. Support for CPython.
Cons: Less choice for scripting compared to alternative statistical packages

Top Algorithm, Data Science, Big Data and Machine Learning Experts

Resolving Skewness

The fundamental assumption in many predictive models is that the predictors have normal distributions. Normal distribution is un-skewed. An un-skewed distribution is the one which is roughly symmetric. It means the probability of falling in the right side of mean is equal to probability of falling on left side of mean.

Unusual Big Data Use Cases

• Campaign Analytics
• Traffic and Diagnostics
• Efficient Fleet Management
• Intelligent News Discovery