How to Search PubMed with RISmed package in R

In the last tutorial, we developed a simple shiny R App to provide a tool to collect and analyze PubMed data. There was some interest in learning more about RISmed itself, so I’ll back up a little and present some of the core RISmed package.

Tutorial – Build a simple Machine Learning Model using AzureML

How difficult is it to build a machine learning model on R or Python? For beginners, it’s a Herculean task. For intermediates and experts, it’s just a matter of system capacity, problem understanding and a little time. Machine Learning models sometime face the issue of system incompatibility. Specially, when the data set is huge. In such cases, either the model takes longer to compute or the system crashes. Hence, for beginners and experts, the use of machine learning offers untimely challenges as well. The good news is, machine learning has become a lot easier is last few years. As a beginner in machine learning, you can kick start your machine learning journey with Microsoft AzureML. In this article, I’ll impart the necessary information to get you started with Machine Learning. Also, I’ve demonstrated a step by step tutorial to create a machine learning model using this software. The speed of computation on Microsoft AzureML is comparable to R or Python. Hence, I’d say its worth trying for experts also.

See sklearn trees with D3

The decision trees from scikit-learn are very easy to train and predict with, but it’s not easy to see the rules they learn. The code below makes it easier to see inside sklearn classification trees.

Infinite Dimensional Word Embeddings

Word embeddings have been huge for the NLP community ever since Tomas Mikolov’s 2013 paper (it’s gotten over 1000 citations in 2 years!). The basic idea is to learn a z -dimensional parameter vector w i ∈R z associated to each word in a vocabulary. This is important as part of a machine learning pipeline which extracts features for language models, text classification, etc. Not unlike advertisements of discrete Bayesian nonparametric models, Nalisnick and Ravi propose the Infinite Skip-Gram model, which allows the dimension z of the word vectors to grow arbitrarily.

Overfitting, Regularization, and Hyperparameters

One of the goals of machine learning is generalizability. A model that only works on the exact data it was trained on is effectively useless. Let’s say you’re tasked with creating a bird-recognition system. If you train a model to recognize pictures of birds, and it gets 100% accuracy on the 130 pictures of 10 classes of birds you showed it, is it a good model?

Bandit Algorithms for Bullying: Getting More Lunch Money

Let me tell you a story about Browning McClementine. Browning was a pretty typical, acne faced teenager who grew up in a very small town in Washington, not far from Rainier National Forest. His high school only had ten students in it. Across from the town’s only bikini espresso shack, which amply satisfied the needs of passing hiker convoys, was the town’s only gas station. Long story short, Browning would pathetically frequent this espresso shack at least once a week in a vain attempt to flirt with Cindy Carrots, his longtime crush.

14 Great Machine Learning, Data Science, R , DataViz Cheat Sheets

1.Machine Learning on GitHub
2.Supervised Learning on GitHub
3.Cheat Sheet: Data Visualization with R
4.Cheat Sheet: Data Visualisation in Python
5.scikit-learn Algorithm Cheat Sheet
6.Vincent Granville’s Data Science Cheat Sheet – Basic
7.Vincent Granville’s Data Science Cheat Sheet – Advanced
8.Cheat Sheet – 10 Machine Learning Algorithms & R Commands
9.Microsoft Azure Machine Learning : Algorithm Cheat Sheet
10.Cheat Sheet – Algorithm for Supervised and Unsupervised Learning
11.Machine Learning and Predictive Analytics, on Dzone
12.ML Algorithm Cheat Sheet by Laurence Diane
13.CheatSheet: Data Exploration using Pandas in Python
14.24 Data Science, R, Python, Excel, and Machine Learning Cheat Sheets .

Whom should we sense in “social sensing” – analyzing which users work best for social media now-casting

In this paper, we ask “How does social sensing actually work?” or, more precisely, “Whom should we sense-and whom not-for optimal results?”. We investigate how different sampling strategies affect the performance of now-casting of two common offline indices: flu activity and unemployment rate. We show that now-casting can be improved by (1) applying user filtering techniques and (2) selecting users with complete profiles. We also find that, using the right type of user groups, now-casting performance does not degrade, even when drastically reducing the size of the dataset. More fundamentally, we describe which type of users contribute most to the accuracy by asking if “babblers are better”. We conclude the paper by providing guidance on how to select better user groups for more accurate now-casting.

Flavour of Physics Technical Write-Up: 1st place, Go Polar Bears

Vlad Mironov and Alexander Guschin of team Go Polar Bears took first place in the CERN LHCb experiment Flavour of Physics competition. Their model was best able to identify a rare decay phenomenon (τ- → μ+μ-μ- or τ → 3μ) to help establish proof of ‘new physics’. Below they share the technical highlights of their approach and solution.

Emojis in ggplot graphics

R user David Lawrence Miller has created an extension for R’s ggplot2 package that allows you to use emojis as plotting symbols. The emoGG package (currently only available on github) adds the geom_emoji geom to ggplot2, which uses an emoji code to identify the plotting symbol.

Estimating mixed graphical models

Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes variables of different domains. As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.

Interactive association rules exploration app

In a previous post, I wrote about what I use association rules for and mentioned a Shiny application I developed to explore and visualize rules. This post is about that app. The app is mainly a wrapper around the arules and arulesViz packages developed by Michael Hahsler.