7 Must Watch Documentaries on Statistics and Machine Learning

We live in a sphere of technological advancements which is majorly driven by data. A decade back, the use of data was limited for statistical measures and studies. But now, everything has changed. Data influences our lives at personal and professional level. The combination of data, programming and mathematics have completely transformed the ways of doing business. Even businesses like wall street trading, casino gambling, sports are largely driven by data mining methods.


PubMed search Shiny App using RISmed

In part one of a series of tutorials, we will develop a Shiny App for performing analysis of academic text from PubMed. There’s no shortage of great tutorials for developing a Shiny App using R, including Shiny’s own tutorial. Here at DataScience+ we have a perfect introduction by Teja Kodali and a more in-depth development by J.S. Ramos. Here I will focus on the basics of making PubMed queries using the RISmed package, and to demonstrate how easily you can share any of your R functionality using Shiny. Click here to see the App in action and follow along. In this introductory tutorial, we’ll get a taste of what we can accomplish, try to cover all the basics, and hopefully streamline some potential time-sinks.


Profile Likelihood


Unbiased metrics of friends’ influence in multi-level networks

The spreading of information is of crucial importance for the modern information society. While we still receive information from mass media and other non-personalized sources, online social networks and influence of friends have become important personalized sources of information. This calls for metrics to measure the influence of users on the behavior of their friends. We demonstrate that the currently existing metrics of friends’ influence are biased by the presence of highly popular items in the data, and as a result can lead to an illusion of friends influence where there is none. We correct for this bias and develop three metrics that allow to distinguish the influence of friends from the effects of item popularity, and apply the metrics on real datasets. We use a simple network model based on the influence of friends and preferential attachment to illustrate the performance of our metrics at different levels of friends’ influence.


A job board for people and companies looking to hire R users

The vision for this site is to give one more tool for people and companies to find R users and promote collaboration. The site allows R users (and companies) to post “job” listings on all things related to R.
Examples of who should care about this:
1) Companies looking to hire people with R skills.
2) R users who are looking for a job
3) R freelancer who wish to invite people to get in touch with them.
4) R users who are looking to find partners for their work. This includes people who want help with making an R package from their new statistical algorithm, or speeding up their R code. Just post a “job” with the details.


How to Search for Census Data from R

In my course Learn to Map Census Data in R I provide people with a handful of interesting demographics to analyze. This is convenient for teaching, but people often want to search for other demographic statistics. To address that, today I will work through an example of starting with a simple demographic question and using R to answer it. Here is my question: I used to live in Japan, and to this day I still enjoy practicing Japanese with native speakers. If I wanted to move from San Francisco to a part of the country that has more Japanese people, where should I move?


An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities

The term smart manufacturing refers to a future-state of manufacturing, where the real-time transmission and analysis of data from across the factory creates manufacturing intelligence, which can be used to have a positive impact across all aspects of operations. In recent years, many initiatives and groups have been formed to advance smart manufacturing, with the most prominent being the Smart Manufacturing Leadership Coalition (SMLC), Industry 4.0, and the Industrial Internet Consortium. These initiatives comprise industry, academic and government partners, and contribute to the development of strategic policies, guidelines, and roadmaps relating to smart manufacturing adoption. In turn, many of these recommendations may be implemented using data-centric technologies, such as Big Data, Machine Learning, Simulation, Internet of Things and Cyber Physical Systems, to realise smart operations in the factory. Given the importance of machine uptime and availability in smart manufacturing, this research centres on the application of data-driven analytics to industrial equipment maintenance. The main contributions of this research are a set of data and system requirements for implementing equipment maintenance applications in industrial environments, and an information system model that provides a scalable and fault tolerant big data pipeline for integrating, processing and analysing industrial equipment data. These contributions are considered in the context of highly regulated large-scale manufacturing environments, where legacy (e.g. automation controllers) and emerging instrumentation (e.g. internet-aware smart sensors) must be supported to facilitate initial smart manufacturing efforts.


Deep Learning with MXNetR

Deep learning has been an active field of research for some years, there are breakthroughs in image and language understanding etc. However, there has not yet been a good deep learning package in R that offers state-of-art deep learning models and the real GPU support to do fast training on these models.
In this post, we introduce MXNetR, an R package that brings fast GPU computation and state-of-art deep learning to the R community. MXNet allows you to flexibly configure state-of-art deep learning models backed by the fast CPU and GPU back-end. This post will cover the following topics:
• Train your first neural network in five minutes
• Use MXNet for Handwritten Digits Classification Competition
• Classify real world images using state-of-art deep learning models.


Why is Apache Spark So Hot?

Spark, when compared to MapReduce, offers greater flexibility. MapReduce only offers two operations: Map and Reduce, whereas Spark offers more than 80 high-level operations. MapReduce’s inefficient handling of iterative algorithms as well as interactive analytic tools served as the motivation for developing alternatives. Spark excels at programming models involving iterations, interactivity, streaming and more.


Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

An open-source exploration of the city’s neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data