Three Things About Data Science You Won’t Find In the Books
In case you haven’t heard yet, Data Science is all the craze. Courses, posts, and schools are springing up everywhere. However, every time I take a look at one of those offerings, I see that a lot of emphasis is put on specific learning algorithms. Of course, understanding how logistic regression or deep learning works is cool, but once you start working with data, you find out that there are other things equally important, or maybe even more.
Causal Modeling for Data Science
A few weeks ago, I received a Facebook link for lecture notes ‘Research Design for Causal Inference’, from the Harvard graduate class, Government 2001, ‘Advanced Quantitative Research Methodology’, taught by professor Gary King. …
What is new in the vtreat library?
The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation. The idea is you supply (in R) an example general data.frame to vtreat’s designTreatmentsC method (for single-class categorical targets) or designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to prepare data frames for training and scoring.
RStudio v0.99 Preview: More Editor Enhancements
We’ve blogged previously about various improvements we’ve made to the source editor in RStudio v0.99 including enhanced code completion, snippets, diagnostics, and an improved Vim mode. Besides these larger scale features we’ve made lots of smaller improvements that we also wanted to highlight. You can try out all of these features now in the RStudio v0.99 preview release.
Introduction to Support Vector Machines
This tutorial introduces Support Vector Machines (SVMs), a powerful supervised learning algorithm used to draw a boundary between clusters of data.
Ultimate resource for understanding & creating data visualization
There are 3 fundamental changes driving penetration of data science industry:
• The amount of data generation and storage has become very cheap. Every smart phone comes with numerous sensors, which continuously generate data and companies continue to store it for future usage.
• Computing power has become cheap
• Numerous tools are available for an analyst to expedite his / her work. Compared to 3 years back, there are tools for faster data collection and cleaning, evolving algorithms and visualization tools.
These forces have also changed the process flow for an analyst. Effective data visualization is more critical component of data science process flow than ever before.
The tensor renaissance in data science
The natural question is: why use tensors when (large) matrices can already be challenging to work with? Proponents are quick to point out that tensors can model more complex relationships.
Paper to Digital in 200+ languages
Many of the world’s important sources of information – books, newspapers, magazines, pamphlets, and historical documents – are not digital. Unlike digital documents, these paper-based sources of information are difficult to search through or edit, or worse, completely inaccessible to some people. Part of the solution is scanning, getting a digital image of the page, but raw image pixels aren’t yet recognized as textual content from the computer’s point of view.
5 Reasons Data Analytics in the Cloud Will Take Center Stage in 2015
1. Cloud provides a more flexible deployment model for powerful open source software
2. Cloud makes analytical tools simpler to learn and easier to use
3. Cloud enables experts to tackle hard analytics problems through collaboration
4. Growing ecosystem of cloud-native business applications need a centralized platform for analysis
5. Cloud is the best place to effectively deploy an entire data pipeline
Cloud Bigtable Beta
Google Cloud Bigtable offers you a fast, fully managed, massively scalable NoSQL database service that’s ideal for web, mobile, and Internet of Things applications requiring terabytes to petabytes of data. Unlike comparable market offerings, Cloud Bigtable doesn’t require you to sacrifice speed, scale, or cost efficiency when your applications grow. Cloud Bigtable has been battle-tested at Google for more than 10 years – it’s the database driving major applications such as Google Analytics and Gmail.
Profiling Top Kagglers: KazAnova Currently #2 in the World
First up is KazAnova — Marios Michailidis — the current number 2 out of nearly 300,000 data scientists. Marios is a PhD student in machine learning at UCL and a senior data scientist at dunnhumby (organizer of the Kaggle competitions ‘Shopper Challenge’ and ‘Product Launch Challenge’).
A Link Between topicmodels LDA and LDAvis
Carson Sievert and Kenny Shirley have put together the really nice LDAvis R package. It provides a Shiny-based interactive interface for exploring the output from Latent Dirichlet Allocation topic models. If you’ve never used it, I highly recommend checking out their XKCD example (this paper also has some nice background). LDAvis doesn’t fit topic models, it just visualises the output. As such it is agnostic about what package you use to fit your LDA topic model. They have a useful example of how to use output from the lda package. I wanted to use LDAvis with output from the topicmodels package. It works really nicely with texts preprocessed using the tm package. The trick is extracting the information LDAvis requires from the model and placing it into a specifically structured JSON formatted object.
Digging up embedded plots
The following multi-panel graph, which graces the cover of the most recent issue of the Journal of Computational and Graphical Statistics ,JCGS, (Vol 24, Num 1, March 2015) is from the paper by Grolemund and Wickham entitled Visualizing Complex Data With Embedded Plots. The four plots are noteworthy for a couple or reasons:
1.They present superb example of how an embedded plot with its additional set of axes can pack more information into the same area required for a traditional scatter plot or heatmap
2.They provide clear and prominent testimony to the dreadful toll of civilian casualties from the war in Afghanistan.
Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously
The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast).