Rexer Analytics Survey Results
Rexer Analytics has released preliminary results showing the usage of various data science tools. I’ve added the results to my continuously-updated article, The Popularity of Data Analysis Software. For your convenience, the new section is repeated below.
Quick Guide to learn Statistics for R Users (with Titanic Data Set)
People are keen to pursue their career as a data scientist. And why shouldn’t they be? After all, this comes with a pride of holding the sexiest job of this century. But, in order to become one, you must master ‘statistics’ in great depth. Statistics lies at the heart of data science. Think of statistics as the first brick laid to build a monument. You can’t build great monuments until you place a strong foundation. Experts say, ‘If you struggle with deciphering the statistical coefficients , you’d have tough time ahead in data science’. Don’t worry! I shall help you decipher some of these statistical mysteries in this post. I will use R as my tool to perform the calculations, but you can choose any tool of your convenience.
A Step-by-Step Plan for Getting Your Company Started with Predictive Analytics – Part 2
In Part 1 we covered our fundamental strategy for getting started. That is to find a single skill set your initial team can focus on to create fastest returns. We suggested that was Predictive Modeling and laid out four steps for implementation of that Phase 1. Here in our final Part 2 we show the remaining four phases and lay out an implementation sequence and rationale for following this path.
Python: Creating interactive crime maps with Folium
Folium is a powerful Python library that helps you create several types of Leaflet maps. The fact that the Folium results are interactive makes this library very useful for dashboard building. To get an idea, just zoom/click around on the next map to get an impression. The Folium github contains many other examples.
pdc: An R Package for Complexity-Based Clustering of Time Series
Permutation distribution clustering is a complexity-based approach to clustering time series. The dissimilarity of time series is formalized as the squared Hellinger distance between the permutation distribution of embedded time series. The resulting distance measure has linear time complexity, is invariant to phase and monotonic transformations, and robust to outliers. A probabilistic interpretation allows the determination of the number of significantly different clusters. An entropy-based heuristic relieves the user of the need to choose the parameters of the underlying time-delayed embedding manually and, thus, makes it possible to regard the approach as parameter-free. This approach is illustrated with examples on empirical data.
Programmatically create interactive Powerpoint slides with R
Powerpoint is a powerful application for creating presentations, and allows you to include all sorts of text, pictures, animations and interactivity to create a compelling story. Most of the time you’ll use the Powerpoint application to create slides, but if you want to include data and/or charts in your slides, in the interests of reproducibility you may want to automate the slide creation process. By using the R language with the Powerpoint API, you can recreate your slides in an instant whenever your data changes.