Real-Time Analytics and the Internet of Things are a Perfect Match
The Internet of Things is here and is here to stay. With billions of devices already connected and another trillion coming our way in the next decade, organizations should prepare themselves for a data flood. A data flood that can generate real-time insights in how the organization is performing. Using a plethora of data sources, such as smart meters, in-store sensors, medical devices, automotive sensors and wearables, organization can start to optimize their business.
8 Reasons Apache Spark is So Hot
1. Spark replaces MapReduce.
2. Spark can use HDFS.
3. Spark can use YARN.
4. Spark can be deployed, but not fully monitored or managed.
5. Spark enables analytics workflows.
6. Spark uses memory differently
7. Spark uptake is significant.
8. Spark’s results are jaw-droppingly impressive.
How to Use R for Connecting Neo4j and Tableau (A Recommendation Use Case)
Year is just a little bit more than two months old and we got the good news from Tableau – beta testing for version 9.0 started. But it looks like that one of my most favored features didn’t manage to be part of the first release – the Tableau Web Data Connector (it’s mentioned in Christian Chabot keynote at 01:18:00 you can find here). The connector can be used to access REST APIs from within Tableau. Instead of waiting for the unknown release containing the Web Data Connector, I will show in this post how you can still use the current version of Tableau together with R to build your own “Web Data Connector”. Specifically, this means we connect to an instance of the graph database Neo4j using Neo4js REST API. However, that is not the only good news: our approach that will create a life connection to the “REST API data source” goes beyond any attempt that utilizes Tableaus Data Extract API, static tde files that could be loaded in Tableau.
The Value of Data, Part 2: Building Valuable Datasets
Data is incredibly valuable. It helps create superior products, it forms a barrier to entry, and it can be directly monetized. This post is the second in a 3-part series about making data a core part of a startup’s business plan.
The Data Engineering Ecosystem
An Interactive Map
When not to use Gaussian Mixture Model (EM clustering)
An universally used generative unsupervised clustering is Gaussains Mixture Model (GMM) which is also known as “EM Clustering”. The idea of GMM is very simple: for a given dataset, each point is generated by linearly combining multiple multivariate Gaussians. In other words, the idea of the EM clustering is that there are K clusters and points in j-th cluster are following a normal distribution with mean µj and covariance matrix Sj. Each point xi in the dataset has a soft assignment to the K clusters. This soft assignment is determined by pj N(xi|µj, Sj). One can convert this soft probabilistic assignment into membership by picking up the most likely clusters (cluster with highest probability of assignment). Similar to other clustering algorithm, the GMM has some assumptions about the format and shape of the data. If those criteria is not meet the performance might drop significantly. The point of this post is to investigate the performance of EM clustering in the following scenarios:
• Non-Gaussian dataset: as it is clear from the formulation, GMM assumes an underlying Gaussian generative distribution. However, many practical datasets do not satisfy this assumption. I study effect of non-Gaussian dataset in two cases:
1. The number of clusters is known
2. The number of clusters is unknown
• Uneven cluster sizes. When clusters do not have even sizes there is a high chance that small cluster gets dominated by the large one. For this post I am using the R EMCluster package.
Supervised Classification, beyond the logistic
In our data-science class, after discussing limitations of the logistic regression, e.g. the fact that the decision boundary line was a straight line, we’ve mentioned possible natural extensions. Let us consider our (now) standard dataset …
Supervised Classification, discriminant analysis
Another popular technique for classification (or at least, which used to be popular) is the (linear) discriminant analysis, introduced by Ronald Fisher in 1936. Consider the same dataset as in our previous post …
Why the Ban on P-Values? And What Now?
Just recently, the editors of the academic journal Basic and Applied Social Psychology have decided to ban p-values: that’s right, the nexus for inferential decision making… gone! This has created quite a fuss among anyone who relies on significance testing and p-values to do research (especially those, presumably, in social psychology who were hoping to submit a paper to that journal any time soon). The Royal Statistical Society even shared six interesting letters from academics to see how they felt about the decision.
Getting Data From One Online Source
Why might one need to fetch data from a URL?
• You want to share your code with someone who isn’t familiar with R and you want to avoid the inevitable explanation of how to change the file path at the beginning of the file. (“Make sure you only use forward slashes!”)
• The data at the URL is constantly changing and you want your analysis to use the latest each time you run it.
• You want the code to just work when it’s run from another machine with another directory tree.
• You want to post a completely repeatable analysis on your blog and you don’t want it to begin with “go to http://www.blahblahblah.com, download this data, and load it into R”.
Whatever your reason may be, it’s a neat trick, but it’s not one I use so often that I can just rattle off the code for it from memory. So here’s my template. I hope it can help someone else.
Text bashing in R for SQL
Fairly often, a coworker who is strong in Excel, but weak in writing code will come to me for help in special details about customers in their datasets. Sometimes the reason is to call, email, or snail mail a survey, other times to do some classification grouping on the customer. Whatever the reason, the coworker has a list of ID numbers and needs help getting something out of a SQL database. When it isn’t as simple as just adding quotes and commas to the cells in Excel before copying all the ID’s into the WHERE clause of a very basic SELECT statement, I often fall back to R and let it do the work of putting together the SELECT statement and querying the data.
Visualising a Classification in High Dimension
So far, when discussing classification, we’ve been playing on my toy-dataset (actually, I should no claim it’s mine, it is inspired by the one used in the introduction of Boosting, by Robert Schapire and Yoav Freund). But in ral life, there are more observations, and more explanatory variables.With more than two explanatory variables, it starts to be more complicated to visualise. For instance, consider …
Beautiful HTML tables of linear models
Beautiful HTML tables of linear models In this blog post I’d like to show some (old and) new features of the sjt.lm function from my sjPlot-package. These functions are currently only implemented in the development snapshot on GitHub. A package update is planned to be submitted soon to CRAN. There are two new major features I added to this function: Comparing models with different predictors (e.g. stepwise regression) and automatic grouping of categorical predictors. There are examples below that demonstrate these features. The sjt.lm function prints results and summaries of linear models as HTML-table. These tables can be viewed in the RStudio Viewer pane, web browser or easily exported to office applications.
R Basics
For exploring data and doing open-ended statistical analysis on it, nothing beats the R language. Over the years, this open-source tool has come to dominate the way we do analysis and visualization; It has attracted a rich and varied collection of third-party libraries that has given it remarkable versatility: But how do you get started? Casimir explains how to get started, and get familiar with the way it works.