Sessionizing Log Data Using dplyr [Follow-up]
Last week, I wrote a blog post showing how to sessionize log data using standard SQL. The main idea of that post is that if your analytics platform supports window functions (like Postgres and Hive do), you can make quick work out of sessionizing logs. Here’s the winning query:
Using and interpreting different contrasts in linear models in R
When building a regression model with categorical variables with more than two levels (ie “Cold”, “Freezing”, “Warm”) R is doing internally some transformation to be able to compute regression coefficient. What R is doing is that it is turning your categorical variables into a set of contrasts, this number of contrasts is the number of level in the variable (3 in the example above) minus 1. Here I will present three ways to set the contrasts and depending on your research question and your variables one might be more appropriate than the others.
Plan to Quit? Big Data Might Tell Your Boss Before You Do
Big Data lets Netflix make personalized movie suggestions, helps advertisers better target consumers and can now inform your boss that you are about to quit. That’s according to Workday, a company which says its new software, called Workday Insight Applications, gives bosses insight beyond their own powers of perception by gathering data and sending notifications to management that an employee may be on their way out. The program is currently undergoing several months of testing by VMware, also a Palo Alto–based software company. VMware is helping Workday to discover what, if any, changes could be made to the product.
How Google “Translates” Pictures into Words Using Vector Space Mathematics
Google engineers have trained a machine-learning algorithm to write picture captions using the same techniques it developed for language translation.
Python Packages For Data Mining
In very next post I am going to wet your hands to solve one interesting datamining problem using python programming language. so in this post I am going to explain about some powerful Python weapons( packages )
Running R in Parallel (the easy way)
Like a lot of folks, I have a love/hate relationship with R. One topic that I’ve seen people struggle with is parallel computing, or more directly “How do I process data in R when I run out of RAM”. But fear not! R actually has some easy to use parallelization packages!
Some basics for base graphics
K-means clustering is not a free lunch
I recently came across this question on Cross Validated, and I thought it offered a great opportunity to use R and ggplot2 to explore, in depth, the assumptions underlying the k-means algorithm.
Learn Statistics and R online from Harvard
Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You’ll just need a backround in basic math and programming to follow along and complete homework in the R language.
A heatmap is a powerful way to visualize data. Given a matrix of data each value is represented by a color. The implementation of the heatmap algorithm is expensive in computation terms: for each grid’s pixel you need to compute its color from a set of known values. As you could imagine, it is not feasible to implement it on the client side because map rendering would be really slow.
Summarize Opinions with a Graph
How does the saying go? Opinions are like bellybuttons, everybody’s got one? So let’s say you have an opinion that NOSQL is not for you. Maybe you read my blog and think this Graph Database stuff is great for recommendation engines and path finding and maybe some other stuff, but you got really hard problems and it can’t help you. I am going to try to show you that a graph database can help you solve your really hard problems if you can frame your problem in terms of a graph. Did I say “you”? I meant anybody, specially Ph.D. students. One trick is to search for “graph based approach to” and your problem.
Natural Language Analytics made simple and visual with Neo4j
I was really impressed by this blog post on Summarizing Opinions with a Graph from Max. The blog post explains an really interesting approach by Kavita Ganesan which uses a graph representation of sentences of review content to extract the most significant statements about a product.
Lambda Architecture for Big Data
An increasing number of systems are being built to handle the Volume, Velocity and Variety of Big Data, and hopefully help gain new insights and make better business decisions. Here, we will look at ways to deal with Big Data’s Volume and Velocity simultaneously, within a single architecture solution.
Data Acceleration: Turning Technology Into Solutions
In September, I described a range of components that are crucial to building a high-speed data architecture, such as big data platforms, complex event processing, ingestion, in-memory databases, cache clusters and appliances. These components cannot function in isolation. In this article, I outline four fundamental combinations of these components to create solutions that enable data movement, processing and interactivity at high speed. Think in terms of technology stacks. These stacks share ‘common layers’ – the points at which data enters and leaves the data architecture – and are designed to deliver the same outcome: the most effective exploitation of big data for the enterprise.
Aligning Big Data
In order to bring some semblance of simplicity, coherence and integrity to the Big Data debate I am sharing an evolving model for pervasive information architecture and management. This is an overview of the realignment and placement of Big Data into a more generalized architectural framework, an architecture that integrates data warehousing (DW 2.0), business intelligence and statistical analysis. The model is currently referred to as the DW 3.0 Information Supply Framework, or DW 3.0 for short.
Data Science 101: Using Statistics to Predict AB Testing
The Wikipedia fundraising team was performing up to 100 AB tests per week. It wasn’t enough to find the gains needed. The team needed to use statistics to interpret their AB tests accurately, but also to estimate smallest acceptable sample sizes to increase testing frequency. They were not comfortable with trusting methods proposed by other practitioners or academics who could not prove that their methods would work accurately for their data.
Why Topological Data Analysis Works
Topological data analysis has been very successful in discovering information in many large and complex data sets. In this post, I would like to discuss the reasons why it is an effective methodology. One of the key messages around topological data analysis is that data has shape and the shape matters. Although it may appear to be a new message, in fact it describes something very familiar.
Multiple Comparisons with BayesFactor, Part 1
One of the most frequently-asked questions about the BayesFactor package is how to do multiple comparisons; that is, given that some effect exists across factor levels or means, how can we test whether two specific effects are unequal. In the next two posts, I’ll explain how this can be done in two cases: in Part 1, I’ll cover tests for equality, and in Part 2 I’ll cover tests for specific order-restrictions.
Introducing: Orthogonal Nonlinear Least-Squares Regression in R
With this post I want to introduce my newly bred ‘onls’ package which conducts Orthogonal Nonlinear Least-Squares Regression (ONLS):
The Changing Nature of Predictive Analytics in the Enterprise
Today, an increasing number of institutional clients are looking for solutions, strategies and roadmaps to implement Big Data and Predictive Analytics initiatives within their own organizations. While the exact nature of the solutions and recommendations may differ from client to client, based on a number of factors, like the industry they operate in, the size of their operations and business model, there are common threads that can be applied to their needs. While looking for these common threads, I came across an interesting white paper titled “Standards in Predictive Analytics” by James Taylor (CEO, Decision Management Solutions) in which he shares his thoughts on the subject. This blog post summarizes some of the key points that the author makes, along with some of my own thoughts from my engagements with both mid and large sized clients, within and outside the US, which I hope you find useful.
A bit more on testing
If you liked Nina Zumel’s article on the limitations of Random Test/Train splits you might want to check out her recent article on predictive analytics product evaluation hosted by our friends at Fliptop.
SAS PROC MCMC example in R: Logistic Regression Random-Effects Model
In this post I will run SAS example Logistic Regression Random-Effects Model in four R based solutions; Jags, STAN, MCMCpack and LaplacesDemon. To quote the SAS manual: ‘The data are taken from Crowder (1978). The Seeds data set is a 2 x 2 factorial layout, with two types of seeds, O. aegyptiaca 75 and O. aegyptiaca 73, and two root extracts, bean and cucumber. You observe r, which is the number of germinated seeds, and n, which is the total number of seeds. The independent variables are seed and extract.’ The point of this exercise is to demonstrate a random effect. Each observation has a random effect associated with it. In contrast the other parameters have non-informative priors. As such, the models are not complex.
Top SlideShare Presentations on Big Data, updated
REST APIs and crawling offer two different ways to gather big data presentations from SlideShare, but they provide different results and lead to a very different view of the data. We examine why and find a useful data science lesson.
Introduction to python for data mining
Python is a great language for data mining. It has a lot of great libraries for exploring, modeling, and visualizing data. To get started I would recommend downloading the Anaconda Package. It comes with most of the libraries you will need and provides and IDE and package manager.
Python: Learn data science in your browser. For free.
Start analyzing data immediately. No signup required!
How to Predict Where Will Next Disaster Strike?
In Geospatial Intelligence they gave a weird assignment: one need to mark the location on the world map where the next international natural disaster will occur O_o. This is not and easy task by any means and the lecturer suggested to use one’s ‘gut feeling’ if one’s knowledge is insufficient (I suppose it is close to impossible to find someone who can make such a prediction taking into account all the types of the disasters). Though the link to the International Disasters Database was given, so I accepted the challenge (to make a data-driven prediction). To predict the exact location of the next disaster one would need a lot of data – far more that you can get out of that database so my goal was to make prediction at the country level. (BTW the graphs from my post about disasters seems to be based on the data from this database – I saw one of them at that site)
Fun with GitHub’s map tools
After discovering GitHub’s map visualization feature I needed to give it a shot on the only GPS dataset I had available, my runs from RunKeeper. Unfortunately, the RunKeeper files were in GPX while GitHub expects either geoson or topjson. A short Python script later and I was able to convert the GPX data into geojson. The other hiccup I encountered was that the generated geojson file was too large for GitHub to visualize. My 232 runs contained 162,071 latitude/longitude pairs which turned into a 4MB file – not massive but large enough for GitHub to refuse to visualize it. The simplest solution was to generate multiple files but that made it impossible to see all my runs on a single map. The other solution was to see if converting to topojson would reduce the file size. That helped but I wasn’t able to find the right balance between compression and quality and ended up with a hybrid approach – two files, one per running year, each in topojson.
Image Classification with Convolutional Neural Networks – my attempt at the NDSB Kaggle Competition
Every year during the holidays I find myself a little technical project outside of my normal SAP world to try out something new, or look into innovation topics that may not have immediate relevance to my job, but might in the longer term. This vacation period – after focusing in 2014 a lot on Big Data and predictive analytics topics – I decided to take a look at deep learning and neural networks.
R in Business Intelligence
In my opinion R is fully capable (even more than pandas) to serve as engine for BI related processes. R has naturally (developed for decades) broad range of statistical tools available (multiple repositories with thousands of packages). I will skip this enormous feature of R and just focus on simple BI case of extraction, transformation, loading and presentation. Below are listed packages which directly address the steps in basic BI process.
Probabilistic Techniques, Data Streams and Online Learning
I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.
Count-Min Sketch – A Data Structure for Stream Mining Applications
In today’s age of Big Data, streaming is one of the techniques for low latency computing. Besides the batch processing infrastructure of map/reduce paradigm, we are seeing a plethora of ways in which streaming data is processed at near real time to cater to some specific kinds of applications. Libraries like Storm, Samza and Spark belong to this genre and are starting to get their share of user base in the industry today. This post is not about Spark, Storm or Samza. It’s about a data structure which is one of the relatively new entrants in the domain of stream processing, which is simple to implement, but has already proved to be of immense use in serving a certain class of queries over huge streams of data. I have been doing some readings about the application of such structures and thought of sharing them with the readers of my blog.
Streaming Data Analysis and Online Learning by John Myles White
In this talk, “Streaming Data Analysis and Online Learning,” John Myles White of Facebook surveys some basic methods for analyzing data in a streaming manner. He focuses on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks, discussing some basic implementation issues and demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization. He also describes how these methods allow ML systems to adapt to user data in real-time. This talk was recorded at the New York Open Statistical Programming meetup at Knewton.
How Not To Run An A/B Test
If you run A/B tests on your website and regularly check ongoing experiments for significant results, you might be falling prey to what statisticians call repeated significance testing errors. As a result, even though your dashboard says a result is statistically significant, there’s a good chance that it’s actually insignificant. This note explains why.
What is A/B Testing?
A/B testing is a simple way to test changes to your page against the current design and determine which ones produce positive results. It is a method to validate that any new design or change to an element on your webpage is improving your conversion rate before you make that change to your site code.
Communicating Risk and Uncertainty
David Spiegelhalter gave a fascinating talk on Communicating Risk and Uncertainty to the Public & Policymakers at the Grantham Institute of the Imperial College in London last Tuesday. In a very engaging way David gave many examples and anecdotes from his career in academia and advisory. I believe his talk will be published on the Grantham Institute’s YouTuble channel, so I will only share a few highlights and thoughts that stuck in my mind here.
The Role Ontology plays in Big Data
Ontology claims to be to applications what Google was to the web. Instead of integrating the many different enterprise applications within an organization to obtain, for example, a 360 degrees view of customers, Ontology enables users to search a schematic model of all data within the applications. They extract relevant data from a source application, such as a CRM system, big data applications, files, warranty documents etc. These extracted semantics are linked into a search graph instead of a schema to give users the results needed.
Big data: the great exemplar of digital changes
Numerous changes and innovations have come to life recently. The pace of digital revolution is unimaginable concerning it keeps on increasing. There is no doubt most of approaching digital changes are potentially disruptive to older habits, businesses, beliefs. Unconditionally they are changing former way of life on the globe. They push whole humanity into something very new and completely unknown. In my opinion the majority of digital innovations carry similar features as they direct the similar issues and evoke the same questions Big Data is one of them
R: Visualizing Bivariate Shrinkage
Here is a snippet to reproduce a similar bivariate shrinkage plot in ggplot2, adding a color coded probability density surface and contours for the estimated multivariate normal distribution of random effects, using the same sleep study data that Bates used.
Confidence vs. Credibility Intervals
Tomorrow, for the final lecture of my Mathematical Statistics course, I will try to illustrate – using Monte Carlo simulations – the difference between classical statistics, and the Bayesien approach.
The High Cost of Maintaining Machine Learning Systems
Google researchers warn of the massive ongoing costs for maintaining machine learning systems. We examine how to minimize the technical debt.
Can noise help separate causation from correlation?
How to tell correlation from causation is one of the key problems in data science and Big Data. New Additive Noise Models methods can do it with over 65% accuracy, opening new breakthrough possibilities.
Eigenvectors and eigenvalues Explained Visually
Eigenvalues/vectors are instrumental to understanding electrical circuits, mechanical systems, ecology and even Google’s PageRank algorithm. Let’s see if visualization can make these ideas more intuitive.
Visualizing Representations: Deep Learning and Human Beings
In a previous post, we explored techniques for visualizing high-dimensional data. Trying to visualize high dimensional data is, by itself, very interesting, but my real goal is something else. I think these techniques form a set of basic building blocks to try and understand machine learning, and specifically to understand the internal operations of deep neural networks.
A brief introduction to caretEnsemble
caretEnsemble is a package for making ensembles of caret models. You should already be somewhat familiar with the caret package before trying out caretEnsemble. caretEnsemble has 3 primary functions: caretList, caretEnsemble and caretStack. caretList is used to build lists of caret models on the same training data, with the same re-sampling parameters. caretEnsemble and caretStack are used to create ensemble models from such lists of caret models. caretEnsemble uses greedy optimization to create a simple linear blend of models and caretStack uses a caret model to combine the outputs from several component caret models.
Principal Component Analysis in 3 Simple Steps
Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a “black box,” and we are going to unravel its internals in 3 basic steps.
Big Data Projects: How to Choose NoSQL Databases
So, you’ve succumbed to the buzz and now you’re looking around trying to make heads or tails of the mass amounts of information out there hyped up as “big data.” Or perhaps you’re even ready to start your own internal project to get your existing applications on the bandwagon. In either case, terrific! Your decision is a good one. Unfortunately, now comes the flurry of potentially overwhelming questions:
• Where do I start?
• What are my expectations?
• What does big data mean to my company?
• What does big data mean in the context of our applications
• How do I assess my application needs?
• How do I know or determine if big data solutions will work for us?
R: Seasonal Periods
I have two large time series data. One is separated by seconds intervals and the other by minutes. The length of each time series is 180 days. I’m using R (3.1.1) for forecasting the data. I’d like to know the value of the “frequency” argument in the ts() function in R, for each data set. Since most of the examples and cases I’ve seen so far are for months or days at the most, it is quite confusing for me when dealing with equally separated seconds or minutes. According to my understanding, the “frequency” argument is the number of observations per season. So what is the “season” in the case of seconds/?minutes? My guess is that since there are 86,400 seconds and 1440 minutes a day, these should be the values for the “freq” argument. Is that correct?