R Journal – Volume 7/1, June 2015
• Peptides: A Package for Data Mining of Antimicrobial Peptides
• fanplot: An R Package for Visualising Sequential Distributions
• sparkTable: Generating Graphical Tables for Websites and Documents with R
• rdrobust: An R Package for Robust Inference in Regression-Discontinuity Designs
• Frames2: A Package for Estimation in Dual Frame Surveys
• The Complex Multivariate Gaussian Distribution
• sae: An R Package for Small Area Estimation
• showtext: Using System Fonts in R Graphics
• Correspondence Analysis on Generalised Aggregated Lexical Table (CA-GALT) in the FactoMineR Package
• Implementing Persistent O(1) Stacks and Queues in R
• R as an Environment for Reproducible Analysis of DNA Amplification Experiments
• The gridGraphics Package
• fslr: Connecting the FSL Software with R
• Identifying Complex Causal Dependencies in Configurational Data with Coincidence Analysis
• Manipulation of Discrete Random Variables with discreteRV
• Estimability Tools for Package Developers
• R Foundation News
• Changes on CRAN
• Changes in R
• News from the Bioconductor Project
6 Tips for Being an Awesome Data Scientist
Here’s our list of the non-technical aspects of being a great Data Scientist:
• Excellent Communication Skills
• Knows THE Data, Not Just About Data
• Crunch Outside the Box
• Knows When to Stick to the Basics
• Don’t Reinvent the Wheel
• Have a Process in Place
Difference between Machine Learning & Statistical Modeling
In this article, I will try to bring out the difference between the two to the best of my understanding. I encourage more seasoned folks of this industry to add on to this article, to bring out the difference.
How to score data in Hadoop/Hive in a flash
Reference to Hadoop implies huge amount of data. The intend of the data is of course to derive insights that will help businesses stay competitive. ‘Scoring’ the data is a common exercise in determining e.g. customer churn, fraud detection, risk mitigation, etc… It is one of the slowest analytics activities and especially when very large data set is involved. There are various fast scoring products in the market but they are very specialized and/or are provided by one vendor, usually requiring the entire scoring process to be done using its tools set. This poses a problem for those who build their scoring model using tools other than that of the scoring engine vendor.
Python: Seaborn: statistical data visualization
Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
Hot news detection using Wikipedia
This is the goal of this post. Here, I show how we can use the number of Wikipedia article page views to determine if a news story is hot. This approach is fully data-driven and does not need any human supervision.
Geographic visualization with R’s ggmap
Have you ever crunched some numbers on data that involved spatial locations? If the answer is no, then boy are you missing out! So much spatial data to analyze and so little time. Since your time is precious, you know that attempting to create spatial plots in languages like Matlab or applications like Excel can be a tedious, long process. Thankfully there are a number of new R libraries being created to make spatial data visualization a more enjoyable endeavor. Of these new options, one useful package is ggmap.
Visualising High-Dimensional Data
Visualising data is important for aiding intuition & good understanding, but high-dimensional datasets can be hard to display. Here we demonstrate techniques to tackle the issue.
Code is Beautiful!
At QuantifiedCode, we help Python developers to write better code through data. Engaging visualizations of source code are an important element for this. With this project, we want to provide an open source collection of beautiful and engaging code visualizations.
Random Forest Classifiers as a Web Service in PHP
In this post, we’ll walk through all of the code necessary to export a random forest classifier from R and use it to make real-time online predictions in a PHP script.
Exploring SparkR
A colleague from work, asked me to investigate about Spark and R. So the most obvious thing to was to investigate about SparkR. I installed Scala, Hadoop, Spark and SparkR…not sure Hadoop is needed for this…but I wanted to have the full picture. Anyway…I came across a piece of code that reads lines from a file and count how many lines have a ‘a’ and how many lines have a ‘b’…