An Introduction to Change Points (packages: ecp and BreakoutDetection)
A forewarning, this post is me going out on a limb, to say the least. In fact, it’s a post/project requested from me by Brian Peterson, and it follows a new paper that he’s written on how to thoroughly replicate research papers. While I’ve replicated results from papers before (with FAA and EAA, for instance), this is a first for me in terms of what I’ll be doing here. In essence, it is a thorough investigation into the paper “Leveraging Cloud Data to Mitigate User Experience from ‘Breaking Bad’”, and follows the process from the aforementioned paper. So, here we go.

Interpreting Interactions when Main Effects are Not Significant
If you have significant a significant interaction effect and non-significant main effects, would you interpret the interaction effect? It’s a question I get pretty often, and it’s a more straightforward answer than most. There is really only one situation possible in which an interaction is significant, but the main effects are not: a cross-over interaction.

Consider this: Big Data and the Analytics Data Store
Since the publication of the article Aligning Big Data, which basically laid out a draft view of DW 3.0 Information Supply Framework and placed Big Data within a larger framework, I have been asked on a number of occasions recently to go into a little more detail with regards to the Analytics Data Store (ADS) component. This is an initial response to those requests.

The 5 Essential Skills Any Data Scientist Needs
• Business skills
• Analytical skills
• Computer science
• Statistics/ mathematics
• Creativity

R: A beautiful story about NYC weather
Inspired by Tufte’s classic visualization of New York City weather in 2013, Brad Boehmke used the R language to create a similar story about the weather in Dayton, Ohio in 2014.

How to reliably access network resources in R
It’s frustrating when an application unexpectedly dies due to a network timeout or unavailability of a network resource. Veterans of distributed systems know not to rely on network-based resources, such as web services or databases, since they can be unpredictable. So what is a data scientist supposed to do when you must use these resources in her analysis/application?

Easy error propagation in R
In a previous post I demonstrated how to use R’s simple built-in symbolic engine to generate Jacobian and (pseudo)-Hessian matrices that make non-linear optimization perform much more efficiently. Another related application is Gaussian error propagation.

Building and deploying large-scale machine learning pipelines
There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you’re almost done. Of course, in practice, most machine learning projects can’t be reduced to simple optimization problems. Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines. Decisions at one stage affect things that happen downstream, so interactions between parts of a pipeline are an area of active research.

A first look at Spark
Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month’s Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.

Four Famous Laws & How You Can Visualize Them
Famous laws, theorems, and observations are often named after the person who proposed the idea. These “eponymous laws” can be graphed. We do so in this post, using our free, web-based product.

Practical introduction to Shiny – workshop write-up
I recently delivered a workshop on a practical introduction to shiny, an R package that enables development, testing and deployment of interactive web applications. Delivered at the University of Sydney’s Institute for Transport and Logistics Studies (ITLS), it was designed for people who are a) fairly new to R (which can seem intimidating) and b) completely new to shiny. This article provides resources for people wanting to apply shiny to real-world applications and some context which explains the motivations behind running the workshop. The pdf tutorial, example code to create and modify your own apps and a place to contribute to this free teaching resource is available at the following GitHub repository:

Dark Data: The Mysterious Force that Holds the Corporate Universe Together
The total amount of data in every organization is far, far greater than anyone including, most crucially, their Information Technology group knows about. Moreover this “missing” data that can’t be seen and currently can’t be made use of is also the very stuff that holds the organization together. This is what we call “Dark Data.”

Data Science 101: Random Forests
The Random forests machine learning algorithm is a popular ensemble method used by many data scientists to achieve good predictive performance in the classification regime. Fully understanding the nuances of this statistical learning technique is paramount to getting the most out of this algorithm – unfortunately, this means math. The presentation below is from machine learning course CPSC 540 at The University of British Columbia, and takes a detailed view of Random forests by Dr. Nando de Freitas, adjunct professor at UBC Computer Science and a full-time professor at Oxford. If you want to excel at data science you need to master Random forests and this lecture is a great resource for this purpose.

RSS feeds for statistics and related journals
I’ve now resurrected the collection of research journals that I follow, and set it up as a shared collection in feedly. So anyone can easily subscribe to all of the same journals, or select a subset of them, to follow on feedly.

Microsoft Acquires Revolution Analytics To Bolster Its Analytics Services
Microsoft today announced that it has acquired Revolution Analytics, an open-source analytics company with a strong focus on the highly popular R programming language for statistical computing. Microsoft says that it made this acquisition “to help more companies use the power of R and data science to unlock big data insights with advanced analytics.” The two companies did not disclose the financial details of the transaction.

Revolution Analytics joins Microsoft
On behalf of the entire Revolution Analytics team I am excited to announce that Revolution Analytics is joining forces with Microsoft to bring R to even more enterprises. Microsoft announced today that it will acquire Revolution Analytics.

Microsoft Buying Revolution Analytics For Deeper Data Analysis
Microsoft acquisition of Revolution Analytics, an R-language-focused advanced analytics firm, will bring customers tools for prediction and big-data analytics.

Microsoft to acquire Revolution Analytics: Now this gets interesting…
I’m excited about Microsoft acquiring Revolution R. Microsoft has created business friendly tools for many decades and this is a great opportunity for them to bring the power of R to common users. Waiting to hear more from Microsoft…

Assertive R programming in dplyr/magrittr pipelines
assertr works by adding two new verbs to the pipeline, verify and assert, and a couple of predicate functions. Early on in the pipeline, you make certain assertions about how the data should look. If the data conform to these assertions, then we go on with the pipeline. If not, the verbs produce errors that terminate any further pipeline computations. The benefit of the verbs, over the truth assurance functions already in R (like stopifnot) is that they needn’t interrupt the flow of the pipeline.

Predictive Analytics for the Masses – Data Science-as-a-Service
Historically, the cost and effort associated with data science have placed it beyond the reach of most firms. Curating a robust data repository, deploying sophisticated analytics packages, and building a team of data scientists are daunting challenges that few companies are well equipped to address. Fortunately, DSaaS (data science-as-a-service) mitigates many of these issues. A new generation of applications, born in the cloud and driven by data science, will allow virtually any company to subscribe to predictive services that span a broad range of disciplines from sales and marketing to supply chain optimization. This movement will expose firms to new levels of insight that will fuel the next era of business productivity. Outlined below are three aspects of DSaaS that will make data science far more accessible than it has been historically.

Text Analysis 101: Document Classification
Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort.

Building credibility for your analytics team—and why it matters
Here are a few ways to build credibility for your analytics team—and ensure you keep it.
• Start Small
Takeaway: Early mistakes can be deadly. Start simple, win often, and build towards more challenging problems.
• Know your audience
Takeaway: Effective analysis is as much about communication as it is about data. Communicate in a way that connect with your audience.
• Don’t be a House
Takeaway: Your data is a conversation starter, not a diagnosis.
• Be Transparent
Takeaway: Show, don’t tell, how you go to your answers in ways everyone can understand.
• Everyone does it differently

Text Analysis 101; A Basic Understanding for Business Users: Clustering and Unsupervised Methods
For this blog, we will focus on Unsupervised Document Classification. Unsupervised ML techniques differ from supervised in that they do not require a training dataset and in the case of documents, the categories are not known in advance. For example, let’s say we have a large number of emails that we want to analyze as part of an eDiscovery Process. We may have no idea what the emails are about or what topics they deal with and we want to automatically find out what are the most common topics present in the dataset. Unsupervised techniques such as Clustering can be used to automatically discover groups of similar documents within a collection of documents.

Convolutional neural networks
Neural networks have been around for a number of decades now and have seen their ups and downs. Recently they’ve proved to be extremely powerful for image recognition problems. Or, rather, a particular type of neural network called a convolutional neural network has proved very effective. In this post, I want to build off of the series of posts I wrote about neural networks a few months ago, plus some ideas from my post on digital images, to explain the difference between a convolutional neural network and a classical (is that the right term?) neural network.

The Unreasonable Effectivness Of Deep Learning

Financial Firms Embrace Predictive Analytics
For financial markets firms, efficiency is becoming as important a differentiator as speed and scale. As a result, firms are delving deeper into predictive analytics to realize faster time to value and improve operational performance and decision outcomes.

How to Create and Publish R package on CRAN
Step-by-Step Guide

Building [Security] Dashboards w/R & Shiny + shinydashboard
Jay & I cover dashboards in Chapter 10 of Data-Driven Security (the book) but have barely mentioned them on the blog. That’s about to change with a new series on building dashboards using the all-new shinydashboard framework developed by RStudio. While we won’t duplicate the full content from the book, we will show different types of dashboards along with the R code used to generate them.

A Statistician’s ‘Big Tent’ View on Big Data and Data Science

Introducing the Knowledge Graph: things, not strings
Search is a lot about discovery—the basic human need to learn and broaden your horizons. But searching still requires a lot of hard work by you, the user. So today I’m really excited to launch the Knowledge Graph, which will help you discover new information quickly and easily. Take a query like [taj mahal]. For more than four decades, search has essentially been about matching keywords to queries. To a search engine the words [taj mahal] have been just that—two words. But we all know that [taj mahal] has a much richer meaning. You might think of one of the world’s most beautiful monuments, or a Grammy Award-winning musician, or possibly even a casino in Atlantic City, NJ. Or, depending on when you last ate, the nearest Indian restaurant. It’s why we’ve been working on an intelligent model—in geek-speak, a “graph”—that understands real-world entities and their relationships to one another: things, not strings. The Knowledge Graph enables you to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, buildings, geographical features, movies, celestial objects, works of art and more—and instantly get information that’s relevant to your query. This is a critical first step towards building the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do.

List of Good Free Programming and Data Resources

Online Learning Perceptron
Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).

Docker and Enabling Analytic Workflows
It’s been a few months now since I presented on the ways in which Mango are using Docker at EARL 2014 and a lot’s happened since then, so I thought I take this opportunity to post a brief overview of the current state of Docker in the R/Data Science communities.

Why you should learn R first for data science
Over and over, when talking with people who are starting to learn data science, there’s a frustration that comes up: “ I don’t know which programming language to start with.” And it’s not just programming languages, it’s also software systems like Tableau, SPSS, etc. There is an ever widening range of tools and programming languages and it’s difficult to know which one to select.

SAP Launches Big Data Rapid-Deployment Solution powered by SAP HANA
SAP SE (NYSE: SAP) today announced the launch of the SAP HANA Big Data Intelligence rapid-deployment solution to simplify and accelerate Big Data initiatives for business. SAP Rapid Deployment solutions simplify the implementation of SAP solutions in the cloud, on-premise or in hybrid landscapes with predictable business outcomes at lower cost. They accelerate the deployment of new solutions with key technology and business features to help companies solve business problems and go live fast. Emerging technology themes like the Internet of Things (IoT) and the Smart Factory, along with the continued expansion of enterprise technology systems and the growing magnitude of customer interaction data, are opening new opportunities for companies to innovate closer to markets and customers.

List of genetic algorithm applications
This is a list of Genetic Algorithm (GA) applications

Don’t use Bandit Algorithms – they probably won’t work for you
Several years back, I wrote an article advocating in favor of using bandit algorithms. In retrospect, the article I wrote was incorrect, and I should have phrased it differently. I made no mathematical mistakes in the article. Every fact I said is true. But the implications of this article and the way it has been interpreted by others is deeply wrong, and I’m going to take the opportunity now to correct what I said. Before I get into the details, I want to make a particular point very clear. If you don’t understand all the math in this post, don’t use bandit algorithms at all. Just use ordinary A/B tests to make product decisions. Go use optimizely or visual website optimizer to set up and deliver your A/B test, ignore the statistics they give you, and use Evan Miller’s Awesome A/B Tools to determine when to stop the test. Go read all his articles on the topic, particularly How Not to Run an A/B Test. Note also that in this post, I’m discussing fairly standard published bandit algorithms such as Thompson Sampling and UCB1. There are techniques to resolve many of the issues I discuss within the bandit framework – unfortunately, many of these techniques are not published, and most of the ones I know of I’ve had to figure out myself.

Data Science 101: Machine Learning – The Basics
The next installment of insideBIGDATA’s Data Science 101 series comes from our friends over at LinkedIn. For our readers wanting to get their feet wet with machine learning, this “LinkedIn Tech Talks” episode by Ron Bekkerman introduces this burgeoning area. Enjoy Ron’s unique perspective on the subject!

Check your return types when modeling in R
Just a warning: double check your return types in R, especially when using different modeling packages.

Clustering idea for very large datasets
The idea is to start by sampling 1% (or less) of the 100,000,000 entries, and perform clustering on these pairs of keywords, to create a “seed” or “baseline” cluster structure.

Goodness of fit test in R
As a data scientist, occasionally, you receive a dataset and you would like to know what is the generative distribution for that dataset. In this post, I aim to show how we can answer that question in R. To do that let’s make an arbitrary dataset that we sample from a Gamma distribution. To make the problem a little more interesting, let add Gaussian noise to simulate measurement noise …

Computing Recommendations at Extreme Scale with Apache Flink and Google Compute Engine
We implemented the popular Alternating Least Squares (ALS) algorithm for matrix factorization on top of Apache Flink’s Scala API. ALS is an iterative algorithm that alternatingly assumes one of the factor matrices as fixed and computes the other matrix, minimizing the Root Mean Square Error (RMSE) of the solution over the iterations.

Introducing Odata
Our world is awash in data. Vast amounts exist today, and more is created every year. Yet data has value only if it can be used, and it can be used only if it can be accessed by applications and the people who use them. Allowing this kind of broad access to data is the goal of the Open Data Protocol, commonly called just OData. This paper provides an introduction to OData, describing what it is and how it can be applied. The goal is to illustrate why OData is important and how your organization might use it.

Parallel Programming with GPUs and R
You’ve heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we’ll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.

Sequence of shopping carts in-depth analysis with R – Sequence of events
This is the third part of the sequence of shopping carts in-depth analysis. We processed initial data in the required format, did the exploratory analysis and started the in-depth analysis in the first post. Finally, we used cluster analysis for creating customer segments in the second post. As I mentioned in the first post, the sequence can be presented as either state or an event. We dealt with sequences of states until then, which helped us to find some patterns in customers behavior, including time lapses between purchases, and to create the dummy variable ‘nopurch’ for customers who left us with high probability. Here, we will focus on analyzing sequences of events that can be helpful as well. We will cover how to find patterns of events. For instance, we will find events that occur systematically together and in the same order, relationships with customers’ characteristics (typical differences in event sequences between men and women), and association rules between event subsequences.

BI & Analytic Trends for Business Value
Key business value trends include:
1) Operation BI becomes more sophisticated
2) Storyboarding becomes best practice for BI design
3) Embedded BI within business processes explodes
4) In-memory, columnar becomes BI’s preferred destination
5) Spreadsheets become accepted part of BI Portfolio
6) Use of external data continues to expand
7) BI applications created faster for business people
8) BI expands into small and midmarket enterprises
9) Data blending and wrangling come out of shadows

7500 companies hiring data scientists
This is an update to our December 2013 article: 6000 companies hiring data scientists. Microsoft and IBM still dominate, but we’ve seen some shift over the last 12 months:
¦ Accenture, Google and Cognizant are gaining traction, but overall, rankings for top companies have barely changed
¦ The top 20 companies now amount to 7.5% of data scientists, versus 10% in December 2013. This is a significant change, proving that data science adoption is exploding faster in small companies.
¦ We have now 7,500 companies in our listing, versus 6,000 in December: a 25% growth over the last 12 months.

In-Store Analytics for Success
In this special guest feature, Jim Shea of First Insight shares his perspectives for how data analytics is transforming the retail industry past several key pain points. Jim Shea is Chief Marketing Officer at First Insight, a solution provider empowering companies to drive new product success by introducing the right products at the right price. Jim leads marketing communications, demand generation, product management and business development.

Become a Big Data Expert with Free, Comprehensive Hadoop Online Training Courses
Hadoop On-Demand Training offers full-length courses on a range of Hadoop technologies for developers, data analysts and administrators. Designed in a format that meets your convenience, availability and flexibility needs, these courses will lead you on the path to becoming a certified Hadoop professional.

Word2Vec is based on an approach from Lawrence Berkeley National Lab
Google silently did something revolutionary on Thursday. It open sourced a tool called word2vec, prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning.