Introductory guide to Information Retrieval using kNN and KDTree

I love cricket as much as I love data science. A few years back (on 16 November 2013 to be precise), my favorite cricketer – Sachin Tendulkar retired from International Cricket. I spent that entire day reading articles and blogs about him on the web. By the end of the day, I had read close to 50 articles about him. Interestingly, while I was reading these articles – none of the websites suggested me articles outside of Sachin or cricket. Was it a co-incidence? Surely not. I was being suggested the next article based on what I was currently reading. The technique behind this process is known as “Information Retrieval”. In this article, I would take you through the basics of Information Retrieval and two common algorithms used to implement it, KNN and KD Tree. By end of this article, you will be able to create your own information retrieval systems, which can be implemented in any digital library / search. Let’s get going!


America’s ‘Retail Apocalypse’ Is Really Just Beginning

The so-called retail apocalypse has become so ingrained in the U.S. that it now has the distinction of its own Wikipedia entry. The industry’s response to that kind of doomsday description has included blaming the media for hyping the troubles of a few well-known chains as proof of a systemic meltdown. There is some truth to that. In the U.S., retailers announced more than 3,000 store openings in the first three quarters of this year.


Recommender Engine – Under The Hood

Many of us are bombarded with various recommendations in our day to day life, be it on e-commerce sites or social media sites. Some of the recommendations look relevant but some create range of emotions in people, varying from confusion to anger. There are basically two types of recommender systems, Content based and Collaborative filtering. Both have their pros and cons depending upon the context in which you want to use them.


Spark DataFrames: Exploring Chicago Crimes

This is the second blog post on the Spark tutorial series to help big data enthusiasts prepare for Apache Spark Certification from companies such as Cloudera, Hortonworks, Databricks, etc. The first one is here. If you want to learn/master Spark with Python or if you are preparing for a Spark Certification to show your skills in big data, these articles are for you. Previously, we visualized thefts in Chicago by using ggplot2 package in R. In this tutorial, we will analyze crimes data from data. The dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago since 2001.


Top 10 Deep Learning Frameworks

1. Tensorflow
2. Keras
3. Caffe
4. Torch
5. PyTorch
6. Deeplearning4j
7. MXNet
8. Microsoft Cognitive Toolkit
9. deeplearn.js
10. BigDL


GDPR: Achieving Compliance, Earning Trust

Following the privacy rules set out in the EU’s GDPR isn’t just about compliance; it shows customers and others that they can trust your company.


Compressing information through the information bottleneck during deep learning

Read an article in Quanta Magazine (New theory cracks open the black box of deep learning) about a talk (see 18: Information Theory of Deep Learning, YouTube video) done a month or so ago given by Professor Naftali (Tali) Tishby on his theory that all deep learning convolutional neural networks (CNN) exhibit an “information bottleneck” during deep learning. This information bottleneck results in compressing the information present, in for example, an image and only working with the relevant information. The Professor and his researchers used a simple AI problem (like recognizing a dog) and trained a deep learning CNN to perform this task. At the start of the training process the CNN nodes at the top were all connected to the next layer, and those were all connected to the next layer and so on until you got to the output layer. Essentially, the researchers found that during the deep learning process, the CNN went from recognizing all features of an image to over time just recognizing (processing?) only the relevant features of an image when successfully trained.


Information Retrieval Document Search Using Vector Space Model in R

In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query.


Survival Analysis for Business Analytics

We compare survival analysis to other predictive techniques, and provide examples of how it can produce business value, with a focus on Kaplan-Meier and Cox Regression methods which have been underutilized in business analytics.


October Kaggle Dataset Publishing Awards Winners’ Interview

This interview features the stories and backgrounds of the October winners of our $10,000 Datasets Publishing Award-Zeeshan-ul-hassan Usmani, Etienne Le Quéré, and Felipe Antunes.


Quick Round-Up – Visualising Flows Using Network and Sankey Diagrams in Python and R

Got some data, relating to how students move from one module to another. Rows are student ID, module code, presentation date. The flow is not well-behaved. Students may take multiple modules in one presentation, and modules may be taken in any order (which means there are loops…).


Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn

Customer churn is a problem that all companies need to monitor, especially those that depend on subscription-based revenue streams. The simple fact is that most organizations have data that can be used to target these individuals and to understand the key drivers of churn, and we now have Keras for Deep Learning available in R (Yes, in R!!), which predicted customer churn with 82% accuracy. We’re super excited for this article because we are using the new keras package to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set! As for most business problems, it’s equally important to explain what features drive the model, which is why we’ll use the lime package for explainability. We cross-checked the LIME results with a Correlation Analysis using the corrr package. We’re not done yet. In addition, we use three new packages to assist with Machine Learning (ML): recipes for preprocessing, rsample for sampling data and yardstick for model metrics. These are relatively new additions to CRAN developed by Max Kuhn at RStudio (creator of the caret package). It seems that R is quickly developing ML tools that rival Python. Good news if you’re interested in applying Deep Learning in R! We are so let’s get going!!