Graph databases are powering mission-critical applications

While most people associate graphs with social media analysis, there are a wide range of applications — including recommendations, fraud detection, I.T. operations, and security — that are routinely framed using graphs. This wide variety of use cases has led to rise to many interesting tools for storing, managing, visualizing, and analyzing massive graphs. The important thing to note is that graph databases are not limited to reporting and analytics, but are also being used to power mission critical applications. In this episode of the O’Reilly Data Show, I sat down with Emil Eifrem, CEO and co-founder of Neo Technology. We talked about the early days of NoSQL, applications of graph databases, cloud computing, and company culture in the U.S. and Sweden


Confidence intervals: What they are and are not

Over at the Psychonomic Society Featured Content blog, there are several new articles outlining some of our work on confidence intervals published previously in Psychonomic Bulletin & Review. In a three-part series, Steve Lewandosky and Alexander Etz lay out our case for why confidence intervals are not what people think they are. I’ve written enough about confidence intervals lately, so I’ll just link you to their articles.


Just When You Thought You Understood RDBMS and NoSQL Databases – This Happens

Just when you thought you had a handle on the database market it fragments again. Here’s an overview to help you keep up.


The 5 Ways Data Scientists Keep Learning After College

1. Go to events and join communities
2. Focus on asking the right question, not how to use the right tool
3. Participate in Kaggle competitions
4. Take online courses
5. Keep reading books, blogs, and articles


Wikipedia Deploys AI to Expand Its Ranks of Human Editors

Aaron Halfaker just built an artificial intelligence engine designed to automatically analyze changes to Wikipedia. Wikipedia is the online encyclopedia anyone can edit. In crowdsourcing the creation of an encyclopedia, the not-for-profit website forever changed the way we get information. It’s among the ten most-visited sites on the Internet, and it has swept tomes like World Book and Encyclopedia Britannica into the dustbin of history. But it’s not without flaws. If anyone can edit Wikipedia, anyone can mistakenly add bogus information. And anyone can vandalize the site, purposefully adding bogus information. Halfaker, a senior research scientist at the Wikimedia Foundation, the organization that oversees Wikipedia, built his AI engine as a way of identifying such vandalism.


Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet

Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.


Sentiment Analysis 101

Sentiment analysis is a term that refers to the use of natural language processing, text analysis, and computational linguistics in order to ascertain the attitude of a speaker or writer toward a specific topic. Basically, it helps to determine whether a text is expressing sentiments that are positive, negative, or neutral. Sentiment analysis is an excellent way to discover how people, particularly consumers, feel about a particular topic, product, or idea. The origin of sentiment analysis can be traced to the 1950s, when sentiment analysis was primarily used on written paper documents. Today, however, sentiment analysis is widely used to mine subjective information from content on the Internet, including texts, tweets, blogs, social media, news articles, reviews, and comments. This is done using a variety of different techniques, including NLP, statistics, and machine learning methods. Organizations then use the information mined to identify new opportunities and better target their message toward their target demographics. The Obama Administration even uses sentiment analysis to predict public response to its policy announcements.


What can global temperature data tell us?

Debates about anthropogenic climate change often centre around data on changes in global temperatures over the last few decades. There are good scientific reasons to look at this data, but it also plays a prominent role in political advocacy, sometimes fairly, sometimes not so fairly. This is the first in a series of posts in which I’ll discuss what this data can and cannot tell us, and examine some recent papers concerning whether or not there has been a “pause” in global warming over the last 10 to 20 years, and if so, what it might mean.


Feature Selection with caret’s Genetic Algorithm Option

If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners. There are occasions, however, when you find yourself for one reason or another committed to classifier that doesn’t automatically narrow down that list of predictor variables that some sort of automated feature selection might seem like a good idea. If you are an R user then the caret package offers a whole lot machinery that might be helpful. Caret offers both filter methods and wrapper methods that include recursive feature estimation, genetic algorithms (GAs) and simulated annealing. In this post, we will have a look at a small experiment with caret’s GA option. But first, a little background.


Estimating the mean and standard deviation from the median and the range

While preparing the data for a meta-analysis, I run into the problem that a few of my sources did not report the outcome of interest as means and standard deviations, but rather as medians and range of values. After looking around, I found this interesting paper which derived (and validated through simple simulations), simple formulas that can be used to convert the median/range into a mean and a variance in a distribution free fashion.


Hack the Proton. A data-crunching game from the Beta and Bit series

I’ve prepared a short console-based data-driven R game named ,,The Proton Game’’. The goal of a player is to infiltrate Slawomir Pietraszko’s account on a Proton server. To do this, you have to solve four data-based puzzles. The game can be played by beginners as well as heavy users of R. Survey completed by people who completed the beta version of this game shows that the game gives around 15 minutes of fun to people experienced in R and up to around 60 minutes to people that just start programming and using R. More details about the results from beta-version are presented on the plot on the bottom.