This was a question recently posted on Quora: What are the best data science podcasts?. Users recommend the following ones:
There are many topics that you won’t learn in statistics classes. Some such as U-Statistics, stochastic geometry, fractal compression, and stochastic differential equations, are for post-graduate people. So it is OK to not find them in statistics curricula. Others like computational complexity and L^1 metrics (to replace R-squared and other outlier-sensitive L^2 metrics such as traditional variance) should be included, in my opinion. But the classic statistics curriculum is almost written in stone. You can buy any college textbooks – hundreds of them are devoted to statistics, dozens are published each year – they pretty much cover the same topics, and it has barely changed in decades, even though the theory was built well before the age of big data and modern computers.
A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how much funding gets allocated to these types of proposals. At the NIH, the largest funding agency of these types of grants, the success rate recently fell below 20% from a high above 35%. Part of the reason these percentages have fallen is to make room for large collaborative projects. Large projects seem to be increasing, and not just at the NIH. In Europe, for example, the Human Brain Project has an estimated cost of over 1 billion US\$ over 10 years. To put this in perspective, 1 billion dollars can fund over 500 NIH R01s. R01 is the NIH mechanism most appropriate for investigator initiated proposals.
Fraud has a significant impact on organizations of all sorts and sizes. Estimating the size of the impact in terms of financial losses is difficult and the resulting figures are typically rather sensitive to the underlying assumptions. Yet, calculating the returns of investing in a powerful fraud detection system can and definitely should be done to evaluate whether the system is delivering value to the organization as well as to quantify how much value. This article briefly introduces a formula to calculate ROI in this setting, indicating different sources of costs and benefits to be taken into account.
Referring to existing illustrations helps novice drawers to realize their ideas. In order to find such helpful references from a large image collection, we first build a semantic vector representation of illustrations by training convolutional neural networks. As the proposed vector space correctly reflects the semantic meanings of illustrations, users can efficiently search for references with similar attributes.
Given the HTML of ~337k websites served to users of StumbleUpon, identify the paid content disguised as real content.
Neural networks are generating a lot of excitement, as they are quickly proving to be a promising and practical form of machine intelligence. At Fast Forward Labs, we just finished a project researching and building systems that use neural networks for image analysis, as shown in our toy application Pictograph. Our companion deep learning report explains this technology in depth and explores applications and opportunities across industries.
These stand for One vs. All and All vs. All, in classification problems with more than 2 classes. To illustrate the idea, I’ll use the UCI Vertebral Column data and Letter Recognition Data, and analyze them using my regtools package.
The ‘safemode’ package provides a safemode() function that creates a “safe mode” session in R. In “safe mode”, all symbols have an “age” (a last-modified time stamp) and a set of dependent symbols, and a warning is issued whenever a symbol is used in an expression and its age exceeds the age of any of its dependents (i.e., there is warning whenever a “stale” symbol is used in an expression).
Common table expressions (CTEs, or “WITH clauses”) are a syntactic feature in SQL that makes it easier to write and use subqueries. They act as views or temporary tables that are only available during the lifetime of a single query. A more sophisticated feature is the “recursive CTE”, which is a common table expression that can call itself, providing a convenient syntax for recursive queries. This is very useful, for example, in following paths of links from record to record, as in graph traversal. This capability is supported in Postgres, and Microsoft SQL Server (Oracle has similar capabilities with a different syntax), but not in MySql. Perhaps surprisingly, it is supported in SQLite, and since SQLite is the default backend for sqldf, this gives R users a convenient way to experiment with recursive CTEs.
In this beginner’s level tutorial, you’ll learn how to install Shiny Server on an AWS cloud instance, and how to configure the firewall. It will take just a few minutes!
Visualising missing data is important when analysing a dataset. I wanted to make a plot of the presence/absence in a dataset. One package, Amelia provides a function to do this, but I don’t like the way it looks. So I made a ggplot version of what it did.
A couple of weeks ago I shared news that this site had made it onto the shortlist in the ‘Best Dataviz Website’ category for the 2015 Kantar Information is Beautiful Awards. I am enormously surprised and equally delighted to have discovered that visualisingdata.com has been announced as the winner! The ceremony took place in London this (Wednesday) evening but I am stranded over in Geneva on business so could not attend. In my absence, having been given a slight heads up by the organisers, I had time to prepare some form of award acceptance speech! Rather than send through a script of words or record a video I decided during my flights this morning to compile a quick rudimentary infographic – or infographankyou as I have termed it (sorry) – that expresses my surprise, delight and appreciation.