Using Excel versus using R

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures.


Confidence Regions for Parameters in the Simplex

Consider here the case where, in some parametric inference problem, parameter theta is a point in the Simplex …


RcppParallel: Getting R and C++ to work (some more) in parallel

New best hope is (and has been) parallel processing. Even our smartphones have multiple cores, and most if not all retail PCs now possess two, four or more cores. Real computers, aka somewhat decent servers, can be had with 24, 32 or more cores as well, and all that is before we even consider GPU coprocessors or other upcoming changes.


Top 100 Big Data Experts to Follow

Maptive gives us another list of top Big Data Influencers to check out, including data-driven reasons as to why individuals are included.


Scheduling R Markdown Reports via Email

R Markdown is an amazing tool that allows you to blend bits of R code with ordinary text and produce well-formatted data analysis reports very quickly. You can export the final report in many formats like HTML, pdf or MS Words which makes it easy to share with others. And of course, you can modify or update it with fresh data very easily.


word2vec, LDA, and introducing a new hybrid algorithm: lda2vec

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she ‘used to wear scrubs to work’, and distill ‘taking a trip’ into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I’ll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.


Programming with R

Learn how to program in R, starting with making simple loops and functions in R and then continuing with building Shiny Apps and R packages for an effective data analysis or data visualization.


Exploratory Data Analysis for SPARK and SCALA

Spark library for doing exploratory data analysis on your data set in a scalable way.


Hitchhikers Guide to Azure Machine Learning Studio

Learn Azure ML Studio through this brief hands-on tutorial. This step-by-step guide will help you get a quick-start and grasp the basics of this Predictive Modeling tool.


List of unsolved problems in statistics

There are many longstanding unsolved problems in mathematics for which a solution has still not yet been found. The unsolved problems in statistics are generally of a different flavor; according to John Tukey, ‘difficulties in identifying problems have delayed statistics far more than difficulties in solving problems.’ A list of ‘one or two open problems’ (in fact 22 of them) was given by David Cox.


Statistics roadmap

Julia Computing has recently received funding from the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative. One of the main components of this project is to improve the statistics and data science functionality available in the Julia ecosystem. I thought it would be useful to set out our current plans in this regard. These may change as the work develops, but they represent what we think are the most potentially useful contributions to the community. First, I wish to acknowledge the excellent work that has been done so far. The breadth of functionality already available in Julia’s statistical ecosystem is remarkable for such a young language. In particular, some of the packages, such as Distributions.jl, MixedModels.jl and Gadfly.jl are genuinely some of the most cutting edge software in their respective domains. The overall aim of this work is to build a robust, flexible and high-performance platform to allow such excellent work to continue to develop.


stplanr 0.1.1

This short post, by myself and package co-author Richard Ellison, describes how stplanr can be used for transport research with a few simple examples from the package documentation. We hope that stplanr is of use to transport researchers and practitioners worldwide and encourage contributions to the development version hosted on GitHub.


A simple ANOVA

I was browsing Davies Design and Analysis of Industrial Experiments (second edition, 1967). Published by for ICI in times when industry did that kind of thing. It is quite an applied book. On page 107 there is an example where the variance of a process is estimated.


First step on GIS with R

The PM 2.5 checker written by R has been working nicely for me. I put a shortcut icon of this small script on my desktop PC, to check the air pollution when I run. The Ibaraki Prefecture Agency has been increasing watching points for PM 2.5 concentration from China. A problem is that I cannot picture to myself all the observation locations exactly.


Creating Calendars for Future’s Expiration

Lately I have been doing calendar analysis of various markets (future contracts). Not an overly complicated task, but has a few interesting angles and since I haven’t seen anything similar on the Net – here we go.


ggtern 2.0 now available

Recently ggplot2 received a severe makeover by releasing version 2.0, and in the spirit of improvement, I thought ggtern should also get an overhaul, so after a few-hundred hours of code review, here is what has changed:


S-shaped data: Smoothing with quasibinomial distribution

S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function. Cumulative distribution function of logistic distribution function is a logistic function, i.e., logit. To demonstrate this, in this short example, after generating a synthetic data, we will fit quasibinomial regression model to different observations.


10 Weather and Geography Charts Made in Python or r

Below are 10 charts made in R or Python by Plotly users on weather, maps and geography.


Kaggle: Walmart Trip Type Classification

Walmart Trip Type Classification was my first real foray into the world of Kaggle and I’m hooked. I previously dabbled in What’s Cooking but that was as part of a team and the team didn’t work out particularly well. As a learning experience the competition was second to none. My final entry put me at position 155 out of 1061 entries which, although not a stellar performance by any means, is just inside the top 15% and I’m pretty happy with that. Below are a few notes on the competition.