Fraud detection in retail with graph analysis
Fraud detection is all about connecting the dots. We are going to see how to use graph analysis to identify stolen credit cards and fake identities.

R Functions for Exploratory Analysis, Data Frame Merging & Map Displays
Given below is a list of R functions for quickly exploring the key attributes of the data set. The data set is based on car prices & insurance prices.

Prediction intervals for Random Forests
An aspect that is important but often overlooked in applied machine learning is intervals for predictions, be it confidence or prediction intervals. For classification tasks, beginning practitioners quite often conflate probability with confidence: probability of 0.5 is taken to mean that we are uncertain about the prediction, while a prediction of 1.0 means we are absolutely certain in the outcome. But there are two concepts being mixed up here. A prediction of 0.5 could mean that we have learned very little about a given instance, due to observing no or only a few data points about it. Or it could be that we have a lot of data, and the response is fundamentally uncertain, like flipping a coin.

Logistic Regression using Theano
Ipython notebook

Statistical and Mathematical Functions with DataFrames in Spark
In this blog post, we walk through some of the important functions, including:
1. Random data generation
2. Summary and descriptive statistics
3. Sample covariance and correlation
4. Cross tabulation (a.k.a. contingency table)
5. Frequent items
6. Mathematical functions

• HDFS
• Map Reduce
• Avro
• Apache Thrift
• Hive and Hue
• Pig
• JaQL
• Sqoop
• OOZIE
• ZooKeeper
• HBase
• Cassandra
• Flume
• Mahout
• Fuse
• Whirr
• Giraph
• Chukwa
• Drill
• Impala (Cloudera)

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia
In Natural Language Processing, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases ‘Ottawa’ and ‘Canada’ is ‘is the capital of’. These extracted relations could be used in a variety of applications ranging from Question Answering to building databases from unstructured text.

Parallelism, R, and OpenMP
on wrathematics

Generalized linear models, abridged
Generalized linear models (GLMs) are indispensable tools in the data science toolbox. They are applicable to many real-world problems involving continuous, yes/no, count and survival data (and more). The models themselves are intuitive and can be used for inference and prediction. A few very high quality free and open source software implementations are available (in particular within R), as are several first-rate commercial ones (Revolution Analytics, SAS, Stata). Despite the wide applicability of GLMs and the availability of high-quality reference implementations, we’ve found it hard to find good high-performance free and open source software implementations geared to solving large problems. Moreover, we’ve had trouble finding succinct references that deal with core implementation details that could help us build our own high-performance implementations. This note grew out of our own desire to better understand how to go about solving generalized linear models in practice. We highlight aspects of GLM implementations that we find particularly interesting. We present some reference implementations stripped down to illuminate core ideas; often with just a few lines of code. Finally, we discuss details that enable the development of effective distributed parallel implementations suitable for solution of large-scale problems. We develop our ideas in our favorite language, R, but they are easily adapted to other languages. Python and Boost/C++ are particularly well-equipped for good GLM implementations in our opinion.

Air Pollution (PM10 and PM2.5) in Different Cities using Interactive Charts
Gardiner Harris, who is a South Asia correspondent of the New York Times, shared a personal story of his son’s breathing troubles in New Delhi, India, in a recent dispatch titled Holding Your Breath in India. In this post, I use data from the World Health Organization’s Website to identify and map cities where the air quality is worse than the acceptable levels, measured by the annual mean concentration of particulate matter (PM10 and PM2.5). A link to this was provided in the New York Times article. I use many packages from the R-Studio, Ramnath Vaidyanathan and Kenton Russel team, among others, in this process. The code for this entire post can be found on GitHub at http://…/Airpollutionpm.

How To Analyze Data: Seven Modern Remakes Of The Most Famous Graphs Ever Made
Graphs can be beautiful, powerful tools. Graphs help us explore and explain the world. For hundreds of years, humans have used graphs to tell stories with data. To pay homage to the history of data visualization and to the power of graphs, we’ve recreated the most iconic graphs ever made.