Introducing practical and robust anomaly detection in a time series
Both last year and this year, we saw a spike in the number of photos uploaded to Twitter on Christmas Eve, Christmas and New Year’s Eve (in other words, an anomaly occurred in the corresponding time series). Today, we’re announcing AnomalyDetection, our open-source R package that automatically detects anomalies like these in big data in a practical and robust way.

Become Data Literate
Sign up to receive a new dataset and fun problems every two weeks. Improving your sense with data is as easy as trying our problems! The first one is available now!

deep net highlights from 2014
looking back at all the literature that was published on the topic of deep learning in 2014 is quite overwhelming. at times it has felt that nearly every week a new paper from some heavyweights was being circulated via arxiv. trying to do a review of all that work would be a fool’s errand, but i thought it might be worthwhile to summarize some of the threads that i found particularly interesting. below i group papers by broader topic.

Talk to R
Here’s a neat demo from Yihui Xie: you can talk to this R graph and customize it with voice commands.

7 Interesting Big Data and Analytics Trends for 2015
Here’s a list of what Timo Elliott finds the most interesting trends in analytics in 2015.
1. More Magic
2. Datafication
3. Multipolar Analytics
4. Fluid Analysis
5. Community
6. Analytic Ecosystems
7. Data Privacy

Creating a custom soil attribute plot using ggmap
I love creating spatial data visualizations in R. With the ggmap package, I can easily download satellite imagery which serves as a base layer for the data I want to represent. In the code below, I show you how to visualize sampled soil attributes among 16 different rice fields in Uruguay.

Random Test/Train Split is not Always Enough
Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split.

Kalman filter example visualised with R
At the last Cologne R user meeting Holger Zien gave a great introduction to dynamic linear models (dlm). One special case of a dlm is the Kalman filter, which I will discuss in this post in more detail. I kind of used it earlier when I measured the temperature in my room. Over the last week I came across the wonderful quantitative economic modelling site quant-econ.net, designed and written by Thomas J. Sargent and John Stachurski. The site not only provides access to their lecture notes, including the Kalman filer, but also code in Python and Julia. I will take their example of the Kalman filter and go through it with R. I particularly liked their visuals of the various steps of the Kalman filter.

MapReduce simply explained
Big Data is a set of technologies that allows users to store data and compute leveraging multiple machines as a single entity. Think of it as a poor man’s super computer.

Precision, Recall, AUCs and ROCs
I occasionally like to look at the ongoing Kaggle competitions to see what kind of data problems people are interested in (and the discussion boards are a good place to find out what techniques are popular.) Each competition includes a way of scoring the submissions, based on the type of problem. An interesting one that I’ve seen for a number of classification problems is the area under the Receiver Operating Characteristic (ROC) curve, sometimes shortened to the ROC score or AUC (Area Under the Curve). In this post, I want to discuss some interesting properties of this scoring system, and its relation to another similar measure – precision/recall.

Text Analysis 101; A Basic Understanding for Business Users: Document Classification
The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content.

Probable Points and Credible Intervals, Part 2: Decision Theory
This is a continuation of Probable Points and Credible Intervals, a series of posts on Bayesian point and interval estimates. In Part 1 we looked at these estimates as graphical summaries, useful when it’s difficult to plot the whole posterior in good way. Here I’ll instead look at points and intervals from a decision theoretical perspective, in my opinion the conceptually cleanest way of characterizing what these constructs are.

R: Numeric Representation of Date Time
I’ve been playing around with date times in R recently and I wanted to derive a numeric representation for a given value to make it easier to see the correlation between time and another variable.

Lessons from next-generation data wrangling tools
… some of the lessons these companies have learned extend beyond data preparation.
1. Scalability ~ data variety and size
2. Empower domain experts
3. Consider DSLs and visual interfaces
4. Intelligence and automation
5. Don’t forget about replication

Doing Scatterplots in R
In this lesson, we see how to use qplot to create a simple scatterplot.

Common Pitfalls in Machine Learning
Over the past few years I have worked on numerous different machine learning problems. Along the way I have fallen foul of many sometimes subtle and sometimes not so subtle pitfalls when building models. Falling into these pitfalls will often mean when you think you have a great model, actually in real-life it performs terribly. If your aim is that business decisions are being made based on your models, you want them to be right!
1. Traditional Overfitting
2. More training data
3. Simpler Predictor Function
4. Regularisation
5. Integrate over many predictors
6. Parameter Tweak Overfitting
7. Choice of measure
8. Resampling Bias and Variance
9. Bad Statistics
10. Information Leakage
11. Feature Selection Leakage
12. Detecion
13. Label Randomisation
14. Human-loop overfitting
15. Non-Stationary Distributions
16. Sampling

New R Package: metricsgraphics
metricsgraphics is an ‘htmlwidget’ interface to the MetricsGraphics.js D3 chart library. The current htmlwidget wrapper for it is minimaly functional and does not provide support for metricsgraphics histograms and provides nascent support for metricsgraphics’ best feature – time series charts.

Using rvest to Scrape an HTML Table
I recently had the need to scrape a table from wikipedia. Normally, I’d probably cut and paste it into a spreadsheet, but I figured I’d give Hadley’s rvest package a go.

How is decision making changing in retail as a result of data and analytics?
Retail executives say they would use data and analytics more for big decision making if the quality, accuracy or completeness of the data were higher (44%). Yet for all of the challenges, 66% have already changed their organisations’ approach to decision making as a result of data and analytics, and another 32% plan to do so.

Build Intelligent Business Assistants on Graph Engine in SAP HANA
Cognitive Computing represents a fundamental shift in the programming world. Deep dive into the concept of Cognitive Computing, and discover how SAP technologies enable to tap into Big Data from a new perspective

7 Traps to Avoid Being Fooled by Statistical Randomness
Randomness is all around us. Its existence sends fear into the hearts of predictive analytics specialists everywhere — if a process is truly random, then it is not predictable, in the analytic sense of that term. Randomness refers to the absence of patterns, order, coherence, and predictability in a system.

10 Easy Steps to a Complete Understanding of SQL
Too many programmers think SQL is a bit of a beast. It is one of the few declarative languages out there, and as such, behaves in an entirely different way from imperative, object-oriented, or even functional languages (although, some say that SQL is also somewhat functional).

Japan’s birth rate problem is way worse than anyone imagined
Japan’s population shrank by its largest amount on record in 2014. Roughly 1.001 million people were born and 1.269 million people died last year, leaving the country with 268,000 fewer people overall.

Latest MOOCs on Data Science
As a programmer stepping into the world of data science, I’m following some Massive Open Online Courses (MOOCs) on various provider websites.

Three Types Of Analytic Talent You Need
Three more fundamental talents, which if missing or out of balance you should get after immediately. We can call these capabilities “Experiencers,” “Optimizers” and “Builders”

Looking Forward: Big Data in 2015
In 2015 we are going to see some significant progress in the automation of insight discovery through Machine Learning and soft AI technologies. There will not be broad operationalization of either of these approaches, but there will be material progress by some key leaders. This in turn will set the stage for a very interesting 2016 as those early leaders leverage their data infrastructure and start to do real competitive damage to the rest of the field.
#1 – This is the year that enterprises begin to adopt Machine Learning
#2 – Better Models Matter
#3 – Better Models Do Not Mean True AI
#4 – Soft AI Will Make A Difference in 2015
#5 – Automation Makes a Splash

9 tips for effective data mining
1. Think carefully about which projects you take on.
2. Use as much data as you can from as many places as possible.
3. Don’t just use internal customer data.
4. Have a clear sampling strategy.
5. Always use a holdout sample.
6. Spend time on ‘throwaway’ modelling.
7. Refresh your model regularly.
8. Make sure your insights are meaningful to other people.
9. Use your model in the real world.

Understanding Linear Regression
Although Linear Regression is arguably one of the most popular analytical techniques, I believe it isn’t understood well. Several fundamental assumptions are violated during application. The objective of this note is to provide an overview of the assumptions and possible fixes.

How to analyze smartphone sensor data with R and the BreakoutDetection package
Yesterday, Jörg has written a blog post on Data Storytelling with Smartphone sensor data. Here’s a practical approach on how to analyze smartphone sensor data with R. In this example I will be using the accelerometer smartphone data that Datarella provided in its Data Fiction competition.

Interactive visualizations with R – a minireview
In this post I have reviewed some of the most common interactive visualization packages in R with simple example plots along with some comments and experiences. Here are the packages included:
¦ ggplot2 – one of the best static visualization packages in R
¦ ggvis – interactive plots from the makers of ggplot2
¦ rCharts – R interface to multiple javascript charting libraries
¦ plotly – convert ggplot2 figures to interactive plots easily
¦ googleVis – use Google Chart Tools from R

Lockheed Martin Introduces StreamFlow Open Source Platform for Analytics
Software developers at Lockheed Martin have designed a platform to make big data analysis easier for developers and non-developers and are open sourcing the project on GitHub.
The StreamFlow software project is designed to make working with Apache Storm, a free and open source distributed real-time computation system, easier and more productive. A Storm application ingests significant amounts of data through the use of topologies, or set of rules that govern how a network is organized. These topologies categorize the data streams into understandable pipelines.

Multivariate analysis of death rate on the map of Europe
Eurostat has information on death rates by cause and NUTS 2 region. I am trying to get this visually displayed on the map. To get there I map all causes to three dimensions via a principal components analysis. These three dimensions are subsequently translated in RGB colors and placed in the map of Europe.

Debunking Big Data Myths. Again
It seems like almost every business-oriented article features something to do with big data. Most of these opinions claim it’s the key to revolutionizing the way business is done. Yet, despite our deeper understanding, and improved analytical tools, there are still a lot of misconceptions about big data.
It’s important to note that myths change with understanding. As we learn more, certain misunderstandings will fade away, but also give rise to new questions and concerns. The following are some of the more current myths surrounding big data.

Some Applications of Item Response Theory in R
The typical introduction to item response theory (IRT) positions the technique as a form of curve fitting. We believe that a latent continuous variable is responsible for the observed dichotomous or polytomous responses to a set of items (e.g., multiple choice questions on an exam or rating scales from a survey). Literally, once I know your latent score, I can predict your observed responses to all the items. Our task is to estimate that function with one, two or three parameters after determining that the latent trait is unidimensional. In the process of measuring individuals, we gather information about the items. Those one, two or three parameters are assessments of each item’s difficulty, discriminability and sensitivity to noise or guessing.

Fundamental methods of Data Science: Classification, Regression And Similarity Matching
In this post I will be discussing the 3 fundamental methods in data science. These methods are basis for extracting useful knowledge from data, and also serve as a foundation for many well known algorithms in data science. I won’t be getting into the mathematical details of these methods; rather I am going to focus on how these methods are used to solve data centric business problems.

Extended Kalman filter example in R
Last week’s post about the Kalman filter focused on the derivation of the algorithm. Today I will continue with the extended Kalman filter (EKF) that can deal also with nonlinearities. According to Wikipedia the EKF has been considered the de facto standard in the theory of nonlinear state estimation, navigation systems and GPS.

Fast Non-Standard Data Structures for Python
Python provides great built-in types like dict, list, tuple and set; there are also array, collections, heapq modules in the standard library; this article is an overview of external lesser known packages with fast C/C++ based data structures usable from Python.

R: Vectorising All the Things
…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

Simulated Annealing Feature Selection
As previously mentioned, caret has two new feature selection routines based on genetic algorithms (GA) and simulated annealing (SA). The help pages for the two new functions give a detailed account of the options, syntax etc. The package already has functions to conduct feature selection using simple filters as well as recursive feature elimination (RFE). RFE can be very effective. It initially ranks the features, then removes them in sequence, starting with the least importance variables. It is a greedy method since it never backtracks to reevaluate a subset. Basically, it points itself in a single direction and proceeds. The main question is how far to go in that direction.

Secure your Shiny apps (against SQL injection)
Shiny takes inputs from UI elements and sends them to the server, where the application can access them as R variables. While Shiny has security measures in place, as in any typical web application, it remains the developer’s responsibility to sanitize the inputs before using them. For example, Shiny has no way to protect you if you are using an input in a SQL query such as select … from … where field = ‘input’. Someone manipulating the websocket communication can craft a specially-formatted input that can force the database to execute a query that it is not supposed to do, termed an SQL injection. This might give an attacker access to private data or the ability to do other nefarious things, and it is a common security issue.

(R Programming) Plotting with Color: qplot
In this lesson, let’s see how to use qplot to map symbol colour to a categorical variable.

(R Programming) Plotting with Color Part 2: qplot
In the last lesson, we saw how to use qplot to map symbol colour to a categorical variable. Now we see how to control symbol colours and create legend titles.

9 ways Data Science can improve your business
1. Reducing customer attrition (a.k.a. ‘churn’)
2. Acquiring new customers: Lead scoring
3. Cross-selling products
4. Optimizing products and pricing
5. Increasing engagement
6. Predicting demand
7. Automating tasks
8. Making enterprise apps predictive
9. Predictive maintenance

New Toy: SAS University Edition
So I started using SAS University Edition which is a FREE version of SAS software. Again it’s FREE, and that’s the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS without paying for the expensive price tag, especially for single user like me. The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS Studio running on Safari.

Temporal Databases: Why you should care and how to get started
A temporal database is a database with built-in support for handling data involving time.

The Problem with Recommenders
Recommender systems select products for customers based on past experience. Sometimes the product choices are called “items” or “content”, and other nouns are substituted for the customer. The most familiar case is that of a retail businesses – many customers with many products. Amazon, Netflix and Pandora famously bet their businesses on this technology. When you see guidance to other products, a recommender is at work. For newly minted data scientists, the subject is covered in basic training. For marketers, these systems are a holy grail, a pearl without price, the silver bullet of digital engagement and personalization. Customers are expressing themselves with choices. On some websites, the customers contribute reviews, provide ratings and share advice. The recommender system is a mediator in a very personal action.

Deep Learning can be easily fooled
It is almost impossible for human eyes to label the images below to be anything but abstract arts. However, researchers found that Deep Neural Network will label them to be familiar objects with 99.99% confidence. The generality of DNN is questioned again.

Statistics, Predictive Modeling & Data Mining Info Kit
Describe. Compare. Predict. Uncover hidden relationships in your data for more informed decision making. You’ve already mastered data collection and analysis; it’s now time to take the next step. Find out how to challenge assumptions, spot patterns and reveal potential solutions to problems that otherwise would not be visible. The statistical discovery paradigm of JMP offers the intrinsic synergy between visualization and modeling. No matter what the shape and size of your data, so long as it fits in memory, JMP will allow you to get the most from it, whatever your current level of statistical expertise. Register for our complimentary info kit to learn more.

R: Featuring Engineering for a Linear Model
I previously wrote about a linear model I created to predict how many people would RSVP ‘yes’ to a meetup event and having not found much correlation between any of my independent variables and RSVPs was a bit stuck. As luck would have it I bumped into Antonios at a meetup a month ago and he offered to take a look at what I’d tried so far and give me some tips on how to progress.

A Brief Overview of Deep Learning
Deep Learning is really popular these days. Big and small companies are getting into it and making money off it. It’s hot. There is some substance to the hype, too: large deep neural networks achieve the best results on speech recognition, visual object recognition, and several language related tasks, such as machine translation and language modeling.