It’s a complete tutorial on data wrangling or manipulation with R. This tutorial covers one of the most powerful R package for data wrangling i.e. dplyr. This package was written by the most popular R programmer Hadley Wickham who has written many useful R packages such as ggplot2, tidyr etc. It’s one of the most popular R package as of date. This post includes several examples and tips of how to use dply package for cleaning and transforming data.
Picking an analytic platform when first starting out in data science almost always means working with what we’re most comfortable. But as organizations grow larger there is a need for standardization and for selecting one, or a few analytic tools.
A friend asked me whether I can create a loop which will run multiple regression models. She wanted to evaluate the association between 100 dependent variables (outcome) and 100 independent variable (exposure), which means 10,000 regression models. Regression models with multiple dependent (outcome) and independent (exposure) variables are common in genetics.
In this winners’ interview, the first place Team Not-So-Random-Anymore discusses how their simple yet diverse feature sets helped them choose a stable winning ensemble robust to overfitting. Plus, domain experience along with lessons learned from past competitions contributed to the winning approach by Andriy Temko, Alexandre Barachant, Feng Li, and Gilberto Titericz Jr.
Regression is arguably the workhorse of statistics. Despite its popularity, however, it may also be the most misunderstood. Why? The answer might surprise you: There is no such thing as Regression. Rather, there are a large number of statistical methods that are called Regression or grounded on its fundamental idea: ‘Dependent Variable = Constant + Slope*Independent Variable + Error’. The Dependent Variable is something you want to predict or explain. In a Marketing Research context it might be Purchase Interest measured on a 0-10 rating scale. The Independent Variable is what you use to explain or predict the Dependent Variable. Continuing our consumer survey example, this could be a rating on an attribute such as Ease of Use using a 0-10 scale.
Deep learning is a recent trend in machine learning that models highly non-linear representations of data. In the past years, deep learning has gained a tremendous momentum and prevalence for a variety of applications (Wikipedia 2016a). Among these are image and speech recognition, driverless cars, natural language processing and many more. Interestingly, the majority of mathematical concepts for deep learning have been known for decades. However, it is only through several recent developments that the full potential of deep learning has been unleashed (Nair and Hinton 2010; Srivastava et al. 2014).
In two previous blog posts I discussed some techniques for visualizing relationships involving two or three variables and a large number of cases. In this tutorial I will extend that discussion to show some techniques that can be used on large datasets and complex multivariate relationships involving three or more variables.
Dual y-axes: yes or no? What about if one of them is also reversed, i.e. values increase from the top of the chart to the bottom? Judging by this StackOverflow question, hydrologists are fond of both of these things. It asks whether ggplot2 can be used to generate a “rainfall hyetograph and streamflow hydrograph”, which looks like this …
The survey package is one of R’s best tools for those working in the social sciences. For many, it saves you from needing to use commercial software for research that uses survey data. However, it lacks one function that many academic researchers often need to report in publications: correlations. The svycor function in jtools (more info) helps to fill that gap. An initial note, however, is necessary. The basic method behind this feature comes from a response to a question about calculating correlations with the survey package written by Thomas Lumley, the survey package author—he has not seen (to my knowledge) or endorsed this function. All that is good about this function should be attributed to Dr. Lumley; all that is wrong with it should be attributed to me (Jacob). With that said, let’s look at an example. First, we need to get a survey.design object. This one is built into the survey package.