Top 20 R packages by popularity
1. Rcpp: Seamless R and C++ Integration
2. ggplot2: An Implementation of the Grammar of Graphics
3. stringr: Simple, Consistent Wrappers for Common String Operations
4. plyr: Tools for Splitting, Applying and Combining Data
5. digest: Create Cryptographic Hash Digests of R Objects
6. reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package
7. colorspace: Color Space Manipulation
8. RColorBrewer: ColorBrewer Palettes
9. manipulate: Interactive Plots for Rstudio
10. scales: Scale Functions for Visualization
11. labeling: Axis Labeling
12. proto: Prototype object-based programming
13. munsell: Munsell colour system
14. gtable: Arrange grobs in tables
15. dichromat: Color Schemes for Dichromats
16. mime: Map Filenames to MIME Types
17. Rcurl: General network (HTTP/FTP/…) client interface for R
18. bitops: Bitwise Operations
19. zoo: S3 Infrastructure for Regular and Irregular Time Series
20. knitr: A General-Purpose Package for Dynamic Report Generation in R
An Attempt to Understand Boosting Algorithm(s)
Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones
If you reduce the Bayesian posterior to a point estimate and an interval, you can compare Bayesian and frequentist results, but in doing so you discard useful information and lose what I think is the most important advantage of Bayesian methods: the ability to use posterior distributions as inputs to other analyses and decision-making processes.
Kaggle Bike Sharing Demand Prediction – How I got in top 5 percentile of participants?
In Kaggle knowledge competition – Bike Sharing Demand, the participants are asked to forecast bike rental demand of Bike sharing program in Washington, D.C based on historical usage patterns in relation with weather, time and other data. Using these Bike Sharing systems, people rent a bike from one location and return it to a different or same place on need basis. People can rent a bike through membership (mostly regular users) or on demand basis (mostly casual users). This process is controlled by a network of automated kiosk across the city.
Facebook IV Winner’s Interview: 1st place, Peter Best (aka fakeplastictrees)
Peter Best (aka fakeplastictrees) took 1st place in Human or Robot?, our fourth Facebook recruiting competition. Finishing ahead of 984 other data scientists, Peter ignored early results from the public leaderboard and stuck to his own methodology (which involved removing select bots from the training set). In this blog, he shares what led to this winning approach and how the competition helped him grow as a data scientist.
KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!
The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.
Working with the RStudio CRAN logs
The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://…/. The following code uses the download_RStudio_CRAN_data() function to download a month’s worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.