This post shows how to serialize a C++ object into a R raw vector object—the base type used by the internal R serialization—and how to deserialize it.
Deep Learning has had a huge impact on computer science, making it possible explore new frontiers of research and to develop amazingly useful products that millions of people use every day. Our internal deep learning infrastructure DistBelief, developed in 2011, has allowed Googlers to build ever larger neural networks and scale training to thousands of cores in our datacenters. We’ve used it to demonstrate that concepts like “cat” can be learned from unlabeled YouTube images, to improve speech recognition in the Google app by 25%, and to build image search in Google Photos. DistBelief also trained the Inception model that won Imagenet’s Large Scale Visual Recognition Challenge in 2014, and drove our experiments in automated image captioning as well as DeepDream.
One thing that I’ve given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from trolling me on Twitter to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor. A couple of points he makes that I think are very relevant.
Lots of analyst misinterpret the term ‘boosting’ used in data science. Let me provide an interesting explanation of this term. Boosting grants power to machine learning models to improve their accuracy of prediction. Boosting algorithms are one of the most widely used algorithm in data science competitions. The winners of our last hackathons agree that they try boosting algorithm to improve accuracy of their models. In this article, I will explain how boosting algorithm works in very simple manner. I’ve also shared the Python codes below. I’ve skipped the intimidating mathematical derivations used in Boosting. Because, that wouldn’t have allowed me to explain this concept in simple terms.
Regression is one of the – maybe even the single most important fundamental tool for statistical analysis in quite a large number of research areas. It forms the basis of many of the fancy statistical methods currently en vogue in the social sciences. Multilevel analysis and structural equation modeling are perhaps the most widespread and most obvious extensions of regression analysis that are applied in a large chunk of current psychological and educational research. The reason for this is that the framework under which regression can be put is both simple and flexible. Another great thing is that it is easy to do in R and that there are a lot – a lot – of helper functions for it.
Blabr – ‘Scientific computing for the web’ – is a tool for creating interactive tables, plots, sliders, code, etc. in the browser as a blab (web lab). This page shows how to embed a blab in your own site. Presently there are two options: (a) embed a single layout box; or (b) embed a complete blab.
Variable importance graphs are great tool to see, in a model, which variables are interesting. Since we usually use it with random forests, it looks like it is works well with (very) large datasets. The problem with large datasets is that a lot of features are ‘correlated’, and in that case, interpretation of the values of variable importance plots can hardly be compared.
Variety, Velocity, Volume and Veracity are the four Vs for Big Data. Most of the technologies available have shown how to treat the Volume. However, due to the increasing number of streaming data sources, the Velocity problem is as relevant as never before. Moreover, Veracity and especially Variety problems have increased the difficulty of the challenge. This course focuses on two aspects of the Big Data problem, Velocity and Variety, and it shows how with streaming data and semantic technologies it is possible to enable efficient and effective stream processing for advanced application development.
I’ve just been looking at the historical relationship between the London Interbank Offered Rate (LIBOR) and government bond yields. LIBOR data can be found at Quandl and comes in CSV format, so it’s pretty simple to digest. The bond data can be sourced from the US Department of the Treasury. It comes as XML and requires a little more work.
Cybersecurity is a domain that really likes survey, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go ingo the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.
Sometimes we force our categories to be mutually exclusive and exhaustive even as the boundaries are blurring rapidly. Of course, I am speaking of cluster analysis and whether it makes sense to force everyone into one and only one of a set of discrete boxes. Diversity is diverse and requires a more expressive representation than possible in a game of twenty questions. ‘Is it this or that?’ is inadequate when it is a little of this and a lot of that.
Inference is THE big idea of statistics. This is where people come unstuck. Most people can accept the use of summary descriptive statistics and graphs. They can understand why data is needed. They can see that the way a sample is taken may affect how things turn out. They often understand the need for control groups. Most statistical concepts or ideas are readily explainable. But inference is a tricky, tricky idea. Well actually – it doesn’t need to be tricky, but the way it is generally taught makes it tricky.