Query Autofiltering Revisited
In a previous blog post, I introduced the concept of ‘query autofiltering’, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at ‘query time’, we can answer questions right away and much more precisely than we could without techniques like this. A word about ‘precision’ here – precision means having fewer ‘false positives’ – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call ‘noise hits’, to come back as well. In the previous post, I explained why the search engine will often ‘do the wrong thing’ when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words.

Introducing the News Lab
It’s hard to think of a more important source of information in the world than quality journalism. At its best, news communicates truth to power, keeps societies free and open, and leads to more informed decision-making by people and leaders. In the past decade, better technology and an open Internet have led to a revolution in how news is created, distributed, and consumed. And given Google’s mission to ensure quality information is accessible and useful everywhere, we want to help ensure that innovation in news leads to a more informed, more democratic world. That’s why we’ve created the News Lab, a new effort at Google to empower innovation at the intersection of technology and media. Our mission is to collaborate with journalists and entrepreneurs to help build the future of media. And we’re tackling this in three ways: though ensuring our tools are made available to journalists around the world (and that newsrooms know how to use them); by getting helpful Google data sets in the hands of journalists everywhere; and through programs designed to build on some of the biggest opportunities that exist in the media industry today.

10 reasons why I love data and analytics
1. Data and analytics allow us to make informed decisions – and to stop guessing.
2. Who likes to argue?
3. Businesses need to make trade-offs.
4. It’s exciting.
5. It satisfies curiosity.
6. It’s mysterious.
7. It can be applied to many different domains.
8. What about money?
9. It makes so much sense.
10. Data Scientist is the sexiest job of the 21st century!

Journal of Statistical Software – Vol. 65
• The R Package groc for Generalized Regression on Orthogonal Components
• CompPD: A MATLAB Package for Computing Projection Depth
• simPH: An R Package for Illustrating Estimates from Cox Proportional Hazard Models Including for Interactive and Nonlinear Effects
• kml and kml3d: R Packages to Cluster Longitudinal Data
• The VGAM Package for Capture-Recapture Data Using the Conditional Likelihood
• frbs: Fuzzy Rule-Based Systems for Classification and Regression in R
• DTR: An R Package for Estimation and Comparison of Survival Outcomes of Dynamic Treatment
• PCovR: An R Package for Principal Covariates Regression
• Mann-Whitney Type Tests for Microarray Experiments: The R Package gMWT
• remote: Empirical Orthogonal Teleconnections in R
• DiceDesign and DiceEval: Two R Packages for Design and Analysis of Computer Experiments
• ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics
• LazySorted: A Lazily, Partially Sorted Python List
• MF Calculator: A Web-Based Application for Analyzing Similarity
• An Improved Evaluation of Kolmogorov’s Distribution

Trading Moving Averages with Less Whipsaws
Using a simple moving average to time markets has been a successful strategy over a very long period of time. Nothing to brag home about, but it cuts the drawdown of a buy and hold by about a half, sacrificing less than 1% of the CAGR in the process. In two words, simple yet effective.

Data Science 101: Choosing the Right NoSQL Database
NoSQL includes a wide range of different database technologies and was developed as a result of surging volume of data stored. Relational databases are not capable of coping with this huge volume and face agility challenges. This is where NoSQL databases have come into play and are popular because of their features. The presentation below covers the following topics to help you choose the right NoSQL database for your application:
1. Traditional databases
2. Challenges with traditional databases
3. CAP Theorem
4. NoSQL to the rescue
5. A BASE system
6. Choose the right NoSQL database