Feeding Graph databases – a third use-case for modern log management platforms.

The motivation for collecting log data in a modern log management platform usually falls into 2 distinct categories — one being the ability to gain insight from the collected data (for purposes such as alerting,reporting, trending,root-cause analysis,security operations,etc) and the other being compliance or for some organizations even a combination of both. The emerging third use-case would be linked-data analyses. Let me explain …

Sentiment Analysis on Donald Trump using R and Tableau

Recently, the presidential candidate Donal Trump has become controversial. Particularly, associated with his provocative call to temporarily bar Muslims from entering the US, he has faced strong criticism. Some of the many uses of social media analytics is sentiment analysis where we evaluate whether posts on a specific issue are positive or negative. We can integrate R and Tableau for text data mining in social media analytics, machine learning, predictive modeling, etc., by taking advantage of the numerous R packages and compelling Tableau visualizations.

Google scholar scraping with rvest package

In this post, I will show how to scrape google scholar. Particularly, we will use the ‘rvest’ R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and their affiliations. “rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages”, wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.

New Year Resolutions for a Data Scientist

Beginner Level
1. Start with a programming language. Either R or Python
2. Learn Statistics and Mathematics
3. Enroll in one MOOC at a time (Most Difficult)
4. Engage, Discover and Socialize in Industry
Intermediate Level
1. Understand and Build your Machine Learning Skills
2. Focus on Ensemble and Boosting Algorithms
3. Explore Spark, NoSQL and other Big Data Tools
4. Educate Community Members
5. Participate in Data Science Competitions
Advanced Level
1.Build a Deep Learning Model
2. Give Back to Community
3. Explore Reinforcement Learning
4. Rank in Top 50 on Kaggle

Attention and Memory in Deep Learning and nlp

A recent trend in Deep Learning are Attention Mechanisms. In an interview, Ilya Sutskever, now the research director of OpenAI, mentioned that Attention Mechanisms are one of the most exciting advancements, and that they are here to stay. That sounds exciting. But what are Attention Mechanisms? Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans. Human visual attention is well-studied and while there exist different models, all of them essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution”, and then adjusting the focal point over time.

bayes.js: A Small Library for Doing MCMC in the Browser

Bayesian data analysis is cool, Markov chain Monte Carlo is the cool technique that makes Bayesian data analysis possible, and wouldn’t it be coolness if you could do all of this in the browser? That was what I thought, at least, and I’ve now made bayes.js: A small JavaScript library that implements an adaptive MCMC sampler and a couple of probability distributions, and that makes it relatively easy to implement simple Bayesian models in JavaScript.

Deep learning: Turning data into artificial intelligence

A sense of excitement is brewing in Silicon Valley’s artificial intelligence (AI) community. The race to develop AI is on, and news of major breakthroughs are capturing our imagination and steadily working their way into products we use everyday. The race is unlike those we’re familiar with. In traditional markets, a company’s intellectual property is its comparative advantage. The formula for Coca-Cola, for instance, is worth millions and has never been made public. But, in the AI community, intellectual property is far less tangible. Keeping AI methodology and technology secret is frowned upon and will earn a disqualification in most deep learning competitions (a popular pastime for enthusiasts and industry experts). Instead, the comparative advantage major tech companies, such as Google, Facebook and Microsoft, have over their peers is the enormous trove of information sitting in their data centres. From these data, they extract useful insights into the human condition and reach ever closer to AI. The process they use to extract those insights is known as deep learning.

Topic Modeling in R

As a part of Twitter Data Analysis, So far I have completed Movie review using R & Document Classification using R. Today we will be dealing with discovering topics in Tweets, i.e. to mine the tweets data to discover underlying topics- approach known as Topic Modeling.

12 Useful Pandas Techniques in Python for Data Manipulation

Python is fast becoming the preferred language for data scientists – and for good reasons. It provides the larger ecosystem of a programming language and the depth of good scientific computation libraries. If you are starting to learn Python, have a look at learning path on Python. Among its scientific computation libraries, I found Pandas to be the most useful for data science operations. Pandas, along with Scikit-learn provides almost the entire stack needed by a data scientist. This article focuses on providing 12 ways for data manipulation in Python. I’ve also shared some tips & tricks which will allow you to work faster. I would recommend that you look at the codes for data exploration before going ahead. To help you understand better, I’ve taken a data set to perform these operations and manipulations.

24 Uses of Statistical Modeling (Part II)

13. Simulations
14. Churn Analysis
15. Inventory management
16. Optimum Bidding
17. Optimum Pricing
18. Indexation
19. Search Engines
20. Cross-Selling
21. Clinical trials
22. Multivariate Testing
23. Queuing Systems
24. Supply Chain Optimization

Regression With Splines: Should we Care About non-Significant Components?

Following the course of this morning, I got a very interesting question from a student of mine. The question was about having non-significant components in a splineregression. Should we consider a model with a small number of knots and all components significant, or one with a (much) larger number of knots, and a lot of knots non-significant? My initial intuition was to prefer the second alternative, like in autoregressive models in R. When we fit an AR(6) model, it’s not really a big deal if most coefficients are not significant (but the last one). It’s won’t affect much the forecast. So here, it might be the same. With a larger number of knots, we should be able to capture small bumps that we’ll never capture with a smaller number.

How Much Did It Rain? II, Winner’s Interview: 1st place, PuPa (aka Aaron Sim)

Aaron Sim took first place in our recent How Much Did It Rain? II competition. The goal of the challenge was to predict a set of hourly rainfall levels from sequences of weather radar measurements. Aaron and his research lab supervisor were in the midst of developing deep learning tools for their own research when the competition was launched. There was sufficient overlap in the statistical tools and datasets to make the competition a great ground for testing their approach on a new dataset. In this blog, Aaron shares his background, competition experience and methodology, and biggest takeaways (hint: Kaggle competitions are anything but covert). To read a more detailed technical analysis, take a look at his personal blog post on GitHub.

Extract Google Trends Data with Python

Anyone who has regularly worked with Google Trends data has had to deal with the slightly tedious task of grabbing keyword level data and reformatting the spreadsheet provided by Google. After looking for a seamless way to pull the data, I came upon the PyTrends library on GitHub, and sought to put together some quick user defined functions to manage the task of pulling daily and weekly trends data.

Multidimensional Scaling with R (from “Mastering Data Analysis with R”)

Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.

Some programming language theory in R

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).