Code Profiling in R: A Review of Existing Methods and an Introduction to Package GUIProfiler

Code analysis tools are crucial to understand program behavior. Profile tools use the results of time measurements in the execution of a program to gain this understanding and thus, help in the optimization of the code. In this paper, we review the different available packages to profile R code and show the advantages and disadvantages of each of them. In additon, we present GUIProfiler, a package that fulfills some unmet needs. Package GUIProfiler generates an HTML report with the timing for each code line and the relationships between different functions. This package mimics the behavior of the MATLAB profiler. The HTML report includes information on the time spent on each of the lines of the profiled code (the slowest code is highlighted). If the package is used within the RStudio environment, the user can navigate across the bottlenecks in the code and open the editor to modify the lines of code where more time is spent. It is also possible to edit the code using Notepad++ (a free editor for Windows) by simply clicking on the corresponding line. The graphical user interface makes it easy to identify the specific lines which slow down the code. The integration in RStudio and the generation of an HTML report makes GUIProfiler a very convenient tool to perform code optimization.

Lifetime Lessons: 20 Things Every Data Scientist Must Know Today

I’ve spent close to a decade in data science & analytics now. Over this period, I have learnt new ways of working on data sets and creating interesting stories. However, before I could succeed, I failed numerous times. Success doesn’t come easy!

Plotly.js Open-Source Announcement

Today, Plotly is announcing that we have open-sourced plotly.js, the core technology and JavaScript graphing library behind Plotly’s products (MIT license). It’s all out there and free. Any developer can now integrate Plotly’s library into their own applications unencumbered. Plotly.js supports 20 chart types, including 3D plots, geographic maps, and statistical charts like density plots, histograms, box plots, and contour plots.

Mining Georeferenced Data: Location-based Services and the Sharing Economy

A hands on guide on using Python to collect, analyse and mine geo-referenced data from location based services (e.g. Foursquare, Twitter) and the Sharing Economy (Uber, Airbnb etc.). This code can be better understood following the slides below from the original presentation at the PyData NYC conference.

Practical Natural Language Processing for Determing Wifi Quality in Hostels

I was planning my trip to Amsterdam in January and was looking through hostels in Hostel World filtering for different features and amenities. One amenity that I thought I would definitely need was free wifi if I wanted to do some programming from the hostel and also just because life demands it in general. While there’s a ton of hostels that offer free wifi, I’ve definitely been at the end of the stick where the quality of wifi has been unmentionably bad. This probably goes for hotels as well as hostels, but generally hostels are cheaper and offer less in the way of complementary services. That got me thinking about creating an interesting application that could judge the quality of wifi in reviews. Randomly I decided to spin up a new idea for a scraping/api for Hostel World where I could actually find the reviews that mention wifi and other amenities that would be useful. Instead of meticulously scanning through hundreds of reviews, I could just scrape the reviews, parse out keywords, and assign sentiment scores to each review.

The Data-Driven Weekly #1.2

Last week witnessed a number of exciting announcements from the big data and machine learning space. What it shows is that there are still lots of problems to solve in 1) working with/deriving insights from big data, 2) integrating insights into business processes.

Generating SVG for Web Pages with the gridSVG Package

This document describes several different techniques for including SVG images within a web page and points out the important SVG attributes that control the final appearance of the SVG image within the web page. The document then describes how to control those attributes when generating SVG images with the ‘gridSVG’ package for R.

A Look at SparkSQL

SparkSQL, as the name suggests, is a way to use Apache Spark using the SQL language. Apache Spark makes it easy to run complex queries over lots of nodes, something that’s rather difficult with conventional RDBMSs like MySQL.