Who’s Using Big Data in 2015? 3 Big Data Use Cases
Big Data isn’t specific to any one industry, meaning it’s universal in usefulness. Using Big Data analytics to learn key information about your market and consumer behavior is just as important to financial firms as it is to fast food chains. However, while broad in its application, Big Data can certainly be more valuable to some companies than others.

Four Different Types of Regression Residuals
When we estimate a regression model, the differences between the actual and “predicted” values for the dependent variable (over the sample) are termed the “residuals”.

There are no easy charts
Every chart, even if the dataset is small, deserves care. Long-time reader zbicyclist submits the following, which illustrates this point well.

How the JavaScript Heatmap Implementation Works
A heatmap is a powerful way to visualize data. Given a matrix of data each value is represented by a color. The implementation of the heatmap algorithm is expensive in computation terms: for each grid’s pixel you need to compute its color from a set of known values. As you could imagine, it is not feasible to implement it on the client side because map rendering would be really slow.

5 Steps to Transition Your Career To Analytics
Analytics job skills sought by employers can be broadly classified into seven buckets. Watch for these requirements in the job description, along with some common verbiage to describe each skill:
1.Analytics Skills: “Passion for data analysis supported by personal and professional experiences”, “Strong familiarity with statistical concepts”
2.Tool Skills: “Comfort with Excel and other standard productivity tools”, “Technical skills such as SQL, Python, R/SAS/Stata a plus”, “Knowledge of Omniture, Google Analytics required”
3.Education: “Bachelor’s degree in a quantitative or technical field”
4.Problem-Solving Skills: “Troubleshoot and prevent technical issues”, “Creatively solve problems while operating in a dynamic environment”
5.Communication Skills: “Immaculate written and verbal communication”
6.Functional Background: “Previous work experience in performance marketing, media buying, lead generation, or related spaces”
7.Industry/Work Experience: “Previous work experience in financial services, consulting or other quant/data driven fields”, “Up to 3 years prior work experience”

Quickcheck: Randomized unit testing for R
Hadley Wickham’s testthat package has been a boon for R package authors, making it easy to write tests to verify that your code is working directly, and alerting you when you make changes to your code that inadvertently breaks things.

Network structure and dynamics in online social systems
I rarely work with social network data, but I’m familiar with the standard problems confronting data scientists who work in this area. These include questions pertaining to network structure, viral content, and the dynamics of information cascades.

Here’s Waldo: Computing the optimal search strategy for finding Waldo
As I found myself unexpectedly snowed in this weekend, I decided to take on a weekend project for fun. While searching for something to catch my fancy, I ran across an old Slate article claiming that they found a foolproof strategy for finding Waldo in the classic “Where’s Waldo?” book series. Now, I’m no Waldo-spotting expert, but even I could tell that the strategy they proposed there is far from perfect.

Shazam It! Music Processing, Fingerprinting, and Recognition
You hear a familiar song in the club or the restaurant. You listened to this song a thousand times long ago, and the sentimentality of the song really touches your heart. You desperately want to heart it tomorrow, but you can’t remember its name! Fortunately, in our amazing futuristic world, you have a phone with music recognition software installed, and you are saved. You can relax, because software told you the name of the song, and you know that you can hear it again and again until it becomes a part of you…or you get sick of it.

Practical Data Science in Python
This notebook accompanies my talk on “Data Science with Python”. Questions & comments welcome @RadimRehurek. The goal of this talk is to demonstrate some high level, introductory concepts behind (text) machine learning. The concepts are accompanied by concrete code examples in this notebook, which you can run yourself (after installing IPython, see below), on your own computer. The talk audience is expected to have some basic programming knowledge (though not necessarily Python) and some basic introductory data mining background. This is not an “advanced talk” for machine learning experts. The code examples build a working, executable prototype: an app to classify phone SMS messages in English (well, the “SMS kind” of English…) as either “spam” or “ham” (=not spam).

A Machine Learning Result
Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)

Why the Future of Digital Asset Management Hinges on Big Data
Besides cloud computing, one of the popular applications of cloud is in digital asset management, or DAM for short. For the uninitiated, DAM refers to the process of storing, cataloguing, searching and delivering of digital files, mainly audio, video, images and office documents. DAM is an extremely critical element of businesses like media and journalism that deal with lots of content. A typical media house, especially one that deals with videos, owns digital assets that are dozens of terrabytes large. In terms of data volume, it is common for these media houses to own northwards of a hundreds thousand files. Tagging every file, along with storing and retrieving them is a lot of work.