Python: Simplifying the Creation of a Stop Word List With defaultdict
I’ve been playing around with topics models again and recently read a paper by David Mimno which suggested the following heuristic for working out which words should go onto the stop list:

Why is Data now called “Big”?: Part 1
Ever since Man could count we have used data to make sense of the world around us, measuring phenomena through the correcting lens of statistics and facts. The amount of data we’ve traditionally been able to collect and store has been comparatively small, and notoriously difficult to handle when swamped with too much of the stuff, which is why population censuses are still conducted only rarely. A traditional data analyst’s stock-in-trade was the classic database table, with its neat rows and columns, from which one could extrapolate meaningful insightful, all be it from a comparatively small sample that was sortarepresentative.

Quality of Things
Measurement owes its existence to Earth; estimation of quantity to measurement; calculation to estimation of quantity; balancing of chances to calculation; and victory to balancing of chances.