The Google Knowledge Graph
Explore your search: With a carousel at the top of the results page, you can get a more complete picture of what you’re curious about. Explore collections from the Knowledge Graph and browse lists of items, like or , that help you research a topic faster and more in depth than before.
A remedy for your health-related questions: health info in the Knowledge Graph
Think of the last time you searched on Google for health information. Maybe you heard a news story about gluten-free diets and pulled up the Google app to ask, “What is celiac disease?” Maybe a co-worker shook your hand and later found out she had pink eye, so you looked up “pink eye” to see whether it’s contagious. Or maybe you were worried about a loved one—like I was, recently, when my infant son Veer fell off a bed in a hotel in rural Vermont, and I was concerned that he might have a concussion. I wasn’t able to search and quickly find the information I urgently needed (and I work at Google!).
Predictive Analytics or Data Science?
I caught up with an old grad school friend a few weeks back. He’s a top-notch statistician who’s built a successful career working in quants departments of large insurance and health care companies. With little simplification, I’d characterize his role over the last 20 years as a predictive modeling expert. His work is primarily “big iron” — revolving on Teradata, Oracle and SAS. Besides being a senior statistician, he’s also a more-than-capable data integration and statistical programmer. In the past few years especially, we’ve had “discussions” on the differences between data science (DS) and statistics/machine learning as disciplines. He’s characterized DS as little more than a trumped up moniker marketed by the newest analytics generation to brand themselves with a sexy statistics job title – for work that’s indistinguishable from what he’s been doing for years.
Why you should start by learning data visualization and manipulation
One of the biggest issues that comes up when I talk to people who want to get started learning data science is the following: “ I don’t know where to get started!”. Recently, I argued that R is the best programming language to learn when you’re getting started with data science. While this helps you select a programming language, it still doesn’t tell you what skills to focus on. Just like when you select a programming language, selecting the skills to start with can be overwhelming. Again, I want to be direct: learn data visualization first and then learn data manipulation.
10 things statistics taught us about big data analysis
1. If the goal is prediction accuracy, average many prediction models together.
2. When testing many hypotheses, correct for multiple testing.
3. When you have data measured over space, distance, or time, you should smooth.
4. Before you analyze your data with computers, be sure to plot it.
5. Interactive analysis is the best way to really figure out what is going on in a data set.
6. Know what your real sample size is.
7. Unless you ran a randomized trial, potential confounders should keep you up at night.
8. Define a metric for success up front.
9. Make your code and data available and have smart people check it.
10.Problem first not solution backward.
Extending Agile and DevOps to Big Data
An agile environment is one that’s adaptive and promotes evolutionary development and continuous improvement. It fosters flexibility and champions fast failures. Perhaps most importantly, it helps software development teams build and deliver optimal solutions as rapidly as possible. That’s because in today’s competitive market chock-full of tech-savvy customers used to new apps and app updates every day and copious amounts of data with which to work, IT teams can no longer respond to IT requests with months-long development cycles. It doesn’t matter if the request is from a product manager looking to map the next rev’s upgrade or a data scientist asking for a new analytics model.
Journal of Statistical Software – Vol. 63
• Software for Spatial Statistics
• micromap: A Package for Linked Micromaps
• micromapST: Exploring and Communicating Geospatial Patterns in US State Data
• RgoogleMaps and loa: Unleashing R Graphics Power on Map Tiles
• plotKML: Scientific Visualization of Spatio-Temporal Data
• ads Package for R: A Fast Unbiased Implementation of the K-function Family for Studying Spatial Point Patterns in Irregular-Shaped Sampling Windows
• ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data
• Analysis, Simulation and Prediction of Multivariate Random Fields with Package RandomFields
• Analysis of Random Fields Using CompRandFld
• Parallelizing Gaussian Process Calculations in R
• Pitfalls in the Implementation of Bayesian Hierarchical Modeling of Areal Count Data: An Illustration Using BYM and Leroux Models
Scheduling R Tasks via Windows Task Scheduler
This post will allow you to impress your boss with you’re strong work ethic by enabling Windows R users to schedule late night tasks. Picture it, your boss gets an email at 1:30 in the morning with the latest company data as a beautiful report. I’m quite sure Linux and Mac users are able to do this rather easily via cron. Windows users can do this via the Task Scheduler. Users can also interface the task scheduler via the command line as well.
2015 Survey of Data Scientists Reveals Strategic Insights
• “Data science” is a new term for something that’s been around for a while.
• Messy, disorganized data is the number one obstacle holding data scientists back.
• There are not enough data scientists.
• Data scientists want more support from their companies.
• Data scientists use a diverse toolkit dominated by open source.
• The most in-demand data science skill set is programming and coding.
Enhancing R for Distributed Computing
Over the last two decades, R has established itself as the most-used open source tool in data analysis. R’s greatest strength is its user community, which has collectively contributed thousands of packages that extend R’s use in everything from cancer research to graph analysis. But as users in these and many other areas embrace distributed computing, we need to ensure that R continues to be easy for people to write, share, and contribute code. When it comes to distributed computing, though, while R has many packages that provide parallelism constructs, it has no standardized API. Each package has its own syntax, parallelism techniques, and operating systems that they support. Unfortunately, this makes it difficult for users to write distributed programs for themselves, or make contributions that extend easily to other scenarios.