Standard deviation vs Standard error

I got often asked (i.e. more than two times) by colleagues if they should plot/use the standard deviation or the standard error, here is a small post trying to clarify the meaning of these two metrics and when to use them with some R code example.


Scholar indices (h-index and g-index) in PubMed with RISmed

Scholar indices are intended to measure the contributions of authors to their fields of research. Jorge E. Hirsch suggested the h-index in 2005 as an author-level metric intended to measure both the productivity and citation impact of the publications of an author. An author has index h if h of his or her N papers have at least h citations each, and the other (N-h) papers have no more than h citations each.


Control: The “Uncle Fester” of the Data Science Family (part 1–The Knowledge Pyramid)

Data Science is young, and so there are many views of what the pieces of the field are and how they fit together. Most of those views are hodgepodge, lacking cohesion: a ‘bag of tools’ approach for the most part. Amongst other issues, this makes it difficult to describe to the consumers of data science, what we–as data scientists, simulation scientists, and computational scientists–do.


Control: The “Uncle Fester” of the Data Science Family (part 2–Optimization)

Part 1 of this series provided a taxonomy of data science that aligns well with the Data/Information/Knowledge/Wisdom Pyramid. We broke data science into three complementary components:
1.Data Fusion: Transforms observations of the world into estimates of variables of enterprise interest.
2.Analytics: Applies business rules to variables of interest giving answers to enterprise questions.
3.Control: Uses the values of enterprise variables and answers to enterprise questions to inform decisions and actions to further enterprise goals.
For that discussion, we treated control and optimization as the same thing and used the word ‘control’ as shorthand for both. But we promised, at the end, to illuminate the distinctions between optimization and control, starting with a more structured discussion of optimization. That is our subject here.


Image Processing + Machine Learning in R: Denoising Dirty Documents Tutorial Series

Colin Priest finished 2nd in the Denoising Dirty Documents playground competition on Kaggle. He blogged about his experience in an excellent tutorial series that walks through a number of image processing and machine learning approaches to cleaning up noisy images of text. The series starts with linear regression, but quickly moves on the GBMs, CNNs, and deep neural networks. You’ll learn techniques like adaptive thresholding, canny edge detection, and applying median filter functions along the way. You’ll also use stacking, engineer a key feature, and create a strong final ensemble with the different models you’ve created throughout the series.


Microsoft’s new Data Science Virtual Machine

Earlier this week, Andrie showed you how to set up and provision your own virtual machine (VM) to run R and RStudio in Azure. Another option is to use the new Microsoft Data Science Virtual Machine, a pre-configured instance that includes a suite of tools useful to data scientists, including:
• Revolution R Open (performance-enhanced R)
• Anaconda Python
• Visual Studio Community Edition
• Power BI Desktop (with R capabilities)
• SQL Server Express (with R integration)
• Azure SDK (including the ability to run R experiments)


My note on multiple testing

It’s not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.


A First Attempt At Applying Ensemble Filters

This post will outline a first failed attempt at applying the ensemble filter methodology to try and come up with a weighting process on SPY that should theoretically be a gradual process to shift from conviction between a bull market, a bear market, and anywhere in between. This is a follow-up post to this blog post.


Learn to Build Powerful Machine Learning Models with Amazon Service

After using Azure ML last week, I received multiple emails to publish a tutorial on Amazon’s ML. Thankfully, some of my meetings got postponed and I got time to write this. Here is some more good news for you, I present you a tool which will make it even more simpler. It will just remove all the guess work you had to do with Azure ML in choosing model and splits. Obviously, I am talking about the Amazon ML tool. Unfortunately, this time you won’t get a trial pack but have to create your account giving up your credit card information. However, the tool is free to use and your credit card information is used only in case you breach the free tier. In this article, I’ve demonstrated a step by step tutorial to build machine learning model with Amazon. I’ve also shared a video tutorial at the end of this article. Let’s make our first machine learning model with Amazon ML tool.


Most Popular Open Source Projects by Google on GitHub

We ranked the most popular Google projects on Github by the most number of stars.


Introducing d3-shape

Say you’re building a new tool for studying data. What’s the most natural representation for specifying a visualization? A configurable chart? Abstract operators and coordinate systems? Graphical marks and visual encodings? Each abstraction offers its own advantages. For exploratory visualization, you may favor speed (efficiency) so that you can quickly test different views to discover patterns. For explanatory visualization, as in graphics for a wide audience, you may favor greater control over the output (expressiveness) to communicate insights more effectively. Regardless of the approach you chose, to implement your tool, you’ll need to actually draw something to the screen. And that means generating geometric shapes that represent data.


Dawkins on Saying “statistically, … “

If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information about an individual’s sex would be able to make a statistically more accurate prediction as to the person’s musical ability than an observer ignorant of the person’s sex. The emphasis is on the word “statistically”, and let us throw in an “other things being equal” for good measure. The observer might be provided with some additional information, say on the person’s education or upbringing, which would lead him to revise, or even reverse, his prediction based on sex. If females are statistically more likely than males to enjoy knitting, this does not mean that all females enjoy knitting, nor even that a majority do.


Principal Component Analysis

We review the two essentials of principal component analysis (“PCA”): 1) The principal components of a set of data points are the eigenvectors of the correlation matrix of these points in feature space. 2) Projecting the data onto the subspace spanned by the first k of these — listed in descending eigenvalue order — provides the best possible k -dimensional approximation to the data, in the sense of captured variance.


Beyond Beta: Relationships between partialPlot(), ICEbox(), and predcomps()

Machine Learning models generally outperform standard regression models in terms of predictive performance. However, these models tend to be poor in explaining how they achieve a particular result. This post will discuss three methods used to peak inside ‘black box’ models and the connections between them.


Jupyter, Zeppelin, Beaker: The Rise of the Notebooks

Standard software development practices for web, Saas, and industrial environments tend to focus on maintainability, code quality, robustness, and performance. Scientific programing in data science is more concerned with exploration, experimentation, making demos, collaborating, and sharing results. It is this very need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing. Notebooks are collaborative web-based environments for data exploration and visualization — the perfect toolbox for data science. In your favorite browser, you can run code, create figures, explain your thought process and publish your results. Notebooks help create reproducible, shareable, collaborative computational narratives. The idea of computer notebooks has been around for a long time, starting with the early days of Matlab and Mathematica in the mid-to-late-80s. Fast forward 15 years: IPython was just a toddler of a few hundred lines of code, when SageMath became available as a free and open source environment for scientific computing. The past few years haver seen the rise of IPython and its evolution into the Jupyter Project, as well as the emergence of new notebooks, Beaker and Zeppelin. In this article we look at what distinguishes these notebooks and how mature they are.


Exploring Virtual Reality Data Visualization with Gear VR

With the release of the Gear VR virtual reality headset by Samsung and Oculus, it feels like the future is here. It’s easy to see how a number of industries are going to be disrupted by this new media format over the next few years by virtual reality, including video gaming, film, and marketing – imagine an architect letting you tour around a design instead of just showing you a blueprint. But what about data science? The applications are much less clear than in entertainment and marketing, but it’s likely that virtual reality will enable some interesting new data visualizations that 2D images, even interactive ones, don’t provide.


The Topology Underlying the Brand Logo Naming Game: Unidimensional or Local Neighborhoods?

You can find the app on iTunes and Google Play. It’s a game of trivial pursuits – here’s the logo, now tell me the brand. Each item is scored as right or wrong, and the players must take it all very seriously for there is a Facebook page with cheat sheets for improving one’s total score. What would a psychometrician make of such a game based on brand logo knowledge? Are we measuring one’s level of consumerism (‘a preoccupation with and an inclination toward buying consumer goods’)? Everyone knows the most popular brands, but only the most involved are familiar with logos of the less publicized products. The question for psychometrics is whether they are able to explain the logos that you can identify correctly by knowing only your level of consumption.


How to Classify Images with TensorFlow

Prior to joining Google, I spent a lot of time trying to get computers to recognize objects in images. At Jetpac my colleagues and I built mustache detectors to recognize bars full of hipsters, blue sky detectors to find pubs with beer gardens, and dog detectors to spot canine-friendly cafes. At first, we used the traditional computer vision approaches that I’d used my whole career, writing a big ball of custom logic to laboriously recognize one object at a time. For example, to spot sky I’d first run a color detection filter over the whole image looking for shades of blue, and then look at the upper third. If it was mostly blue, and the lower portion of the image wasn’t, then I’d classify that as probably a photo of the outdoors.