So, You Need a Statistically Significant Sample?
The most common way to determine the necessary sample size is through a prospective power analysis – a classic technique that can be used to derive the sample size needed to detect an effect of a given size at a given level of confidence. The purpose of this post is to demystify power analysis for those who are new to data science, as well as to provide some tips that make life easier when determining the necessary sample size.
Respecting Real-World Decision Making and Rejecting Models That Do Not: No MaxDiff or Best-Worst Scaling
As this link illustrates, Sawtooth’s MaxDiff provides an instructive example of reification in marketing research. What is the contribution of ‘clean bathrooms’ when selecting a fast food restaurant? When using the drive-thru window, the cleanliness of the bathrooms is never considered, yet that is not how we answer that self-report question, either in a rating scale or a best-worst choice exercise. Actual usage never enters the equation. Instead, the wording of the question invites us to enter a Platonic world of ideals inhabited by abstract concepts of ‘clean bathrooms’ and ‘reasonable prices’ where everything can be aligned on a stable and context-free utility scale. We ‘trade-off’ the semantic meanings of these terms with the format of the question shaping our response, such is the nature of self-reports (see especially the Self-Reports paper from 1999).
The 22 Skills of a Data Scientist…
• Algorithms (ex: computational complexity, CS theory) DD,DR
• Back-End Programming (ex: JAVA/Rails/Objective C) DC, DD
• Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS) DD, DR
• Big and Distributed Data (ex: Hadoop, Map/Reduce) DB, DC, DD
• Business (ex: management, business development, budgeting) DB
• Classical Statistics (ex: general linear model, ANOVA) DB, DC, DR
• Data Manipulation (ex: regexes, R, SAS, web scraping) DC, DR
• Graphical Models (ex: social networks, Bayes networks) DD, DR
• Machine Learning (ex: decision trees, neural nets, SVM, clustering) DC, DD
• Math (ex: linear algebra, real analysis, calculus) DD,DR
• Optimization (ex: linear, integer, convex, global) DD, DR
• Product Development (ex: design, project management) DB
• Science (ex: experimental design, technical writing/publishing) DC, DR
• Simulation (ex: discrete, agent-based, continuous) DD,DR
• Spatial Statistics (ex: geographic covariates, GIS) DC, DR
• Structured Data (ex: SQL, JSON, XML) DC, DD
• Surveys and Marketing (ex: multinomial modeling) DC, DR
• Systems Administration (ex: *nix, DBA, cloud tech.) DC, DD
• Temporal Statistics (ex: forecasting, time-series analysis) DC, DR
• Unstructured Data (ex: noSQL, text mining) DC, DD
• Visualisation (ex: statistical graphics, mapping, web-based data‐viz) DC, DR
How to determine the quality and correctness of classification models? Part 2 – Quantitative quality indicators
• Basic quantitative quality indicators
• Derived quality indicators
• Example: how to select the appropriate indicator
A quick, incomplete comparison of ggplot2 & rbokeh plotting idioms
I set aside a small bit of time to give rbokeh a try and figured I’d share a small bit of code that shows how to make the ‘same’ chart in both ggplot2 and rbokeh.