New Toy: SAS® University Edition
So I started using SAS® University Edition which is a FREE version of SAS® software. Again it’s FREE, and that’s the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me. The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari.

Python: Getting Started with Data Analysis
Analysis with Programming have recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python.

spaCy: Industrial-strength NLP
spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology. To do great NLP, you have to know a little about linguistics, a lot about machine learning, and almost everything about the latest research. The people who fit this description seldom join small companies. Most are broke – they’ve just finished grad school. If they don’t want to stay in academia, they join Google, IBM, etc.

Python: Notebook Gallery
Links to the best IPython and Jupyter Notebooks

Hierarchical log odds model example
I am working through Bayesian Approaches to Clinical Trials and Health-Care Evaluation (David J. Spiegelhalter, Keith R. Abrams, Jonathan P. Myles) (referred to as BACTHCE from here on). In chapter three I saw an example (3.13) where I wanted to do the calculations myself. The example concerns a hierarchical model of death rates which is calculated via a normal approximation of odds ratios.

9 Generic Big Data Use Cases to Apply in Your Organization
Big Data means something different for every organization and every industry. What Big Data can do for your organization depends on the type of company, the amount of data that you have, the industry that you are in and a whole lot of other variables. Whenever I advise organization on their Big Data strategy, this is the main problem; there are so many different possibilities and often it is a struggle to find the right use case to develop into a Proof of Concept. That’s why I have developed the Big Data Use Case framework, to help organizations understand the different possibilities of Big Data and what it can do for their business. The framework divides 9 generic Big Data use cases into three different pillars:
• Your Customers;
• Your Product;
• Your Organization.
For each pillar there are three Big Data use cases that can be defined, which are relevant for all organizations across all industries.

PELTing a Competing Changepoint Algorithm
This post will demonstrate the PELT algorithm from the changepoint package–a competing algorithm to the twitter package’s breakout detection algorithm. While neither of these algorithms produce satisfactory results, one change point location approximation algorithm that makes no distributional assumptions shows potentially promising results.

Best Time to Learn Linear Algebra is Now!
Linear Algebra is a crucial prerequisite for many things, including Statistics, Data Mining, Machine Learning, Computer Vision, Image Processing and many many others, so it’s very important to know the basics of Linear Algebra to understand more advanced concepts. For example, it’s really helpful for our IT4BI studies, especially for the specialization at TU Berlin. And the best time to learn Linear Algebra or refresh your knowledge about it is right now! At this moment there are a couple of nice MOOCs that have just started and a few more are about to start in the nearest future. Even if you don’t join right now, they should be available in the future for learning as self-paced versions. Additionally I would like to include my favorite video courses on Linear Algebra, they are also for learning at your own speed with no deadlines.

Random Correlations
I’ve noticed that people frequently misusing data to find correlations between seemingly unrelated data sets and inferring a relationship. While they’ll generally volunteer that they haven’t proven causality, they frequently claim that there must be some relationship for the p value to be so low. I built a toy to try and show the error in this. Essentially, you can take almost any real life data and infer a relationship. Here I took a number of data sets from Quandl to show that there are relationships with very low p-values just by chance.

Understanding the Google Analytics Cohort Report
A very common data analysis technique is called Cohort Analysis. A Cohort is simply a segment of users which is based on a date. For example, a cohort could be all users based on their Acquisition Date (in Google Analytics this is really the Date of First Session). Another cohort might be all users that completed their first transaction during a specific time period. This is a very common cohort used in ecommerce businesses. You’ll commonly hear ecommerce companies talk about the performance of new customers acquired during the holiday shopping season. This is simply a cohort. It’s all customers whose first transaction occurred between thanksgiving and Christmas (or some day before Christmas). I’ve written about cohorts before. But to be honest, a lot of other analytics tools have been hard on Google Analytics for it’s lack Cohort functionality – and that was well deserved! For a long time the only way to do cohort analysis in Google Analytics was via segmentation – but that was really a hack. Now Google Analytics has a real Cohort report that makes it much easier to perform cohort analysis.

Programming for Data Science the Polyglot approach: Python + R + SQL
In this post, I discuss a possible new approach to teaching Programming for Data Science. Programming for Data Science is focussed on the R vs. Python question. Everyone seems to have a view including the venerable Nature journal (Programming – Pick up Python). Here, I argue that we look beyond Python vs. R debate and look to teach R, Python and SQL together. To do this, we need to look at the big picture first (the problem we are solving in Data science) and then see how that problem is broken down and solved by different approaches. In doing so, we can more easily master multiple approaches and then even combine them if needed