**5 Best Python Libraries For Data Science**

• Numpy

• Scipy

• Pandas

• Matplotlib

• Scikit-learn

**Sampling from a Discrete Distribution**

You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?’ This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 16 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 12 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, …, 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren’t perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way. Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that’s too small, and a 1% chance of handing back a value that’s too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case. The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn’t find a single site that compiled together the intuition and explanation behind the technique. This writeup is my attempt to give a quick survey of various approaches for simulating a loaded die, ranging from simple techniques that are highly impractical to the very optimized and efficient alias method. My hope here is to capture different intuitions about the problem and how each highlights some new aspect of simulating loaded dice. For each approach, my goal is to explore the motivating idea, core algorithm, correctness proof, and runtime analysis (in terms of time, memory, and randomness required).

**Interactive Maps for John Snow’s Cholera Data**

This week, in Istanbul, for the second training on data science, we’ve been discussing classification and regression models, but also visualisation. Including maps. And we did have a brief introduction to the leaflet package …

**Using closures as objects in R**

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book Hands-On Programming with R: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain). It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the ‘function that returns a list of functions’ pattern is really strong. We will discuss this programming pattern and how to use it effectively.

**Pre-CRAN waffle update – isotype pictograms**

It seems Ruben C. Arslan had the waffle idea about the same time I did. Apart from some extra spiffy XKCD-like styling, one other thing his waffling routines allowed for was using FontAwesome icons. When you use an icon vs a block, you are really making a basic version of isotype pictograms. They can add a dimension to the story you’re trying to tell without using any words. I’ve added two parameters to a pre-release CRAN version that I’d like folks to kick the tyres on a bit. Said parameters are use_glyph- which is either FALSE or a character string for a FontAwesome icon (more on that in a bit) — and glyph_size — which is a numeric value for the font size since it won’t scale when the graphic resizes.

**Two things to stop saying about null hypotheses**

There is a currently fashionable way of describing Bayes factors that resonates with experimental psychologists. I hear it often, particularly as a way to describe a particular use of Bayes factors. For example, one might say, ‘I needed to prove the null, so I used a Bayes factor,’ or ‘Bayes factors are great because with them, you can prove the null.’ I understand the motivation behind this sort of language but please: stop saying one can ‘prove the null’ with Bayes factors.