Revolution R now available with SQL Server Community Preview

The new SQL Server 2016 is now available as part of the Community Technical Preview program, and as presaged it embeds connectivity with the R language and the big-data statistical algorithms of Revolution R Enterprise. SQL Server 2016 includes two ways of calling R. The first is by embedding R code direcly in a SQL Stored Procedure, which can then be called by other applications to embed charts or tables created by R using data in the database.

An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.

Time Series Analysis using R – forecast package

In today’s blog post, we shall look into time series analysis using R package – forecast. Objective of the post will be explaining the different methods available in forecast package which can be applied while dealing with time series analysis/forecasting.

7 tools in every data scientist’s toolbox

1. Tree based methods
2. Linear (regularized) models
3. Quantifying confidence: hypothesis testing, confidence- and prediction intervals
4. Resampling methods: bootstrapping, cross validation, Monte Carlo
5. Finding hidden groups: (centroid-based) clustering
6. Feature selection
7. Measuring performance: metrics, loss functions, measures of relevance

The New Data Engineering Ecosystem: Trends and Rising Stars

Over the past few years, there has been an exponential increase in the amount of data available to individuals, companies, and the general public. This has spurred a surge of new ‘Big Data’ technologies – from distributed processing frameworks like Hadoop and Spark, to a wide variety of NoSQL databases – that help us understand the data around us. Each tool has its own strengths and weaknesses, and there is no ‘one-size-fits-all’ solution for every use case. Understanding these technologies and how they fit together can be one of the biggest challenges for someone new to the field of data.

26 Things I Learned in the Deep Learning Summer School

1. The need for distributed representations
2. Local minima are not a problem in high dimensions
3. Derivatives derivatives derivatives
4. Weight initialisation strategy
5. Neural net training tricks
6. Gradient checking
7. Motion tracking
8. Syntax or no syntax? (aka, “is syntax a thing?”)
9. Distributed vs Distributional
10. The state of dependency parsing
11. Theano
12. Nvidia Digits
13. Fuel
14. Multimodal linguistic regularities
15. Taylor series approximation
16. Computational intensity
17. Minibatches
18. Training on adversarial examples
19. Everything is language modelling
20. SMT had a rough start
21. The State of Neural Machine Translation
22. MetaMind classifier demo
23. Optimising gradient updates
24. Theano profiling
25. Adversarial nets framework
26. arXiv.org numbering

Must Read Books for Beginners on Big Data, Hadoop and Apache Spark

In this article, I’ve listed some of the best books (which I perceive) on Big Data, Hadoop and Apache Spark. These books are must for beginners keen to build a successful career in big data. Books demand discipline and persistence. I had neither. Until I picked a book and read it cover to cover. If you haven’t done it yet, it’s your turn now. The books listed above comprises of all the knowledge essential to take your first step in big data. Technologies like Hadoop, Apache Spark are in huge demand across the world. Companies have data, they even have technologies, but they don’t have skilled manpower to work on them.

Digital Watch IG Barometer

The IG Barometer methodology presents a quantitative summary of the main developments in the IG arena based on computational text and data-mining approaches. The IG Barometer is based on the statistical modeling of large collections of textual documents. These collections – called text corpora – are obtained by querying various online media sources with IG specific keywords and search terms, retrieving the most relevant IG news, articles, papers, etc. Thus, the IG Barometer reflects the status of the debate as represented in media – not human expert judgment on particular IG issues.

Graphical Modeling Mimics Selective Attention: Customer Satisfaction Ratings

As shown by the eye tracking lines and circles, there is more on the above screenshot than we can process simultaneously. Visual perception takes time, and we must track where the eye focuses by recording sequence and duration. The ‘50% off’ and the menu items seems to draw the most attention, suggesting that the viewers were not men. But what if the screen contained a correlation matrix? The 23 mobile phone customer satisfaction ratings from an earlier post will serve as an illustration. The R code to access the data, calculate the correlation matrix and produce the graph can be found at the end of this blog entry.

Controlling a Remote R Session from a Local One

Say you have an Amazon EC2 instance running and you want to be able to control your R session running there from your local R session. At heart, this is not a new idea for the R community. You can already control remote R sessions easily with Shiny or RStudio server, for instance. Well now you can also try the experimental remoter package, available on github. So while this isn’t really tackling an unsolved problem, I think this approach, for better or worse, is unique.

Mapping with ggplot: Create a nice choropleth map in R

This post shows how to use ggplot to map a choropleth map from shapefiles, as well as change legend attributes and scale, color, and titles, and finally export the map into an image file.

Piecewise linear trends

I prepared the following notes for a consulting client, and I thought they might be of interest to some other people too.