p-hacking, or cheating on a p-value
Yesterday evening, I discovered some interesting slides on False-Positives, p-Hacking, Statistical Power, and Evidential Value, via @UCBITSS ‘s post on Twitter. More precisely, there was this slide on how cheating (because that’s basically what it is) to get a ‘good’ model (by targeting the p-value)
DataPyR is an attempt to create a comprehensive curated collection of any and every possible useful resource for Python, R and data science.
The core Python packages you need to know for data science
SciPy (pronounced ‘Sigh Pie’) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:
• NumPy – Base N-dimensional array package
• SciPy library – Fundamental library for scientific computing
• Matplotlib – Comprehensive 2D Plotting
• IPython – Enhanced Interactive Console
• Sympy – Symbolic mathematics
• pandas – Data structures & analysis
Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Python, SAS
Identifying the OS from R
Sometimes a bit of R code needs to know what operating system it’s running on. Here’s a short account of where you can find this information and a little function to wrap the answer up neatly. Operating systems are a platform issue, so let’s start with the constants in the list .Platform. For Windows the OS.type is just ‘windows’, but confusingly it’s ‘unix’ for Unix, Linux, and my Mac OSX laptop. To be fair that’s because OSX is built on a base of tweaked BSD Unix. But it does seem that .Platform won’t distinguish OSX from a more traditional Unix or Linux machine.
The Fundamental Theorem of Linear Algebra
Strang’s diagram’, a diagram that shows actions of A , an m×n matrix, as linear transformations from the space R^m to R^n . The diagram helps to understand the fundamental concepts of Linear Algebra in terms of the four subspaces by visually illustrating the actions of A on all these subspaces.
Comparing all the treatments
This story didn’t get into the local media, but I’m writing about it because it illustrates the benefit of new statistical methods, something that’s often not visible to outsiders.
Spotting Potential Battles in F1 Races
Over the last couple of races, I’ve started trying to review a variety of battlemaps for various drivers in each race. Prompted by an email request for more info around the battlemaps, I generated a new sketch charting the on track gaps between each driver and the lap leader for each lap of the race (How the F1 Canadian Grand Prix Race Evolved on Track). Colour is used to identify cars on lead lap compared to lapped drivers. For lapped drivers, a count of how many laps they are behind the leader is displayed. I additionally overplot with a highlight for specified driver, as well as adding in a mark that shows the on track position of the leader of the next lap, along with their driver code.
GPU-Accelerated R in the Cloud with Teraproc Cluster-as-a-Service
The examples in this post build on the excellent work of Mr. Chi Yau available at r-tutor.com. Chi is the author of the CRAN open-source rpud package as well as rpudplus, R libraries that make is easy for developers to harness the power of GPUs without programming directly in CUDA C++. To learn more about R and parallel programming with GPUs you can download Chi’s e-book. For illustration purposes, I’ll focus on an example involving distance calculations and hierarchical clustering, but you can use the rpud package to accelerate a variety of applications.