Concept of Cluster Analysis in Data Science

What does your business do with the huge volumes of data collected daily? For business, the huge volumes of data collected daily can be demanding and time-consuming. Gathering, analyzing and reporting this type of information and discovering the most important data from the report can be supported through clustering it all. Clustering can help businesses to manage their data better – image segmentation, grouping web pages, market segmentation, and information retrieval are four examples. For retail businesses, data clustering helps with customer shopping behaviour, sales campaigns, and customer retention. In the insurance industry, clustering is regularly employed in fraud detection, risk factor identification and customer retention efforts. In banking, clustering is used for customer segmentation, credit scoring and analyzing customer profitability. In this blog, we will understand cluster analysis in detail. We will also look at implementing cluster analysis in python and visualise results in the end!

Foundations of Data Science

Overview Computer science as an academic discipline began in the 1960s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context-free languages, and computability. In the 1970s, the study of algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect, and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks as central aspects of daily life presents both opportunities and challenges for theory.


If you are sampling data generated from a physical phenomenon, you will get noise. Noise can be added to the signal by the sensor measuring it, or it can be inherent to the stochasticity of the process that generates the data. I recently had to handle one such noisy data stream generated by a vehicle engine and needed to figure out a way to filter out the noise. Due to the physical nature of the signal generation process, the sampling frequency was not constant, thereby precluding any frequency-based noise filtering technique. I needed to find a way to filter out the noise and recreate the signal for further processing.

Behind The Models: Cholesky Decomposition

André-Louis Cholesky is a bit of an oddity among mathematicians: his work was published posthumously after he died in battle during WWI. He discovered the linear algebra method that carries his name through his work as a late 19th century map maker, but it continues to be an efficient trick that fuels many machine learning models. This article will discuss the mathematical underpinnings of the method, and show two applications to linear regression and Monte-Carlo simulation.

Implementing Knowledge Graphs in Enterprises – Some Tips and Trends

1. Don’t try to put the cart before the horse: realize that efficient data preparation (and thus interoperable standards) and data quality, especially in the enterprise environment, are a basic requirement for all applications of artificial intelligence.
2. The development of competences and experts in the field of artificial intelligence must take place at least parallel to the process of every technological decision, but not at the end of the implementation of an AI strategy. Outsourcing must not be part of this strategy.
3. ‘Not to boil the ocean’, in other words: small, agile, consecutive pilot projects alone are not enough to develop an AI strategy. Parallel to the pilot phase, a more far-reaching strategy should be developed together with the management to promote cross-departmental, process-independent and data-driven decision-making and activities.
4. Projects based on knowledge graphs are more multidisciplinary than many may think. Accordingly, teams must be developed that can cover expertise in the areas of database technologies, IT security, user experience, data visualization, knowledge modeling, data governance and compliance, etc. Accordingly, the specification and management of expectations at the beginning of the new initiatives is of utmost importance.
5. Make the difference clear: Graph technologies are not just a slightly better search technology. Knowledge graphs can be used to address a large number of severe data management problems. The focus should be on it, right from the start!

Deep Learning Explainability: Hints from Physics

Nowadays, artificial intelligence is present in almost every part of our lives. Smartphones, social media feeds, recommendation engines, online ad networks, and navigation tools are some examples of AI-based applications that already affect us every day. Deep learning in areas such as speech recognition, autonomous driving, machine translation, and visual object recognition has been systematically improving the state of the art for a while now. However, the reasons that make deep neural networks (DNN) so powerful are only heuristically understood, i.e. we know only from experience that we can achieve excellent results by using large datasets and following specific training protocols. Recently, one possible explanation was proposed, based on a remarkable analogy between a physics-based conceptual framework called renormalization group (RG) and a type of neural network known as a restricted Boltzmann machine (RBM).

A Complete Machine Learning Project Walk-Through in Python: Part One

Reading through a data science book or taking a course, it can feel like you have the individual pieces, but don’t quite know how to put them together. Taking the next step and solving a complete machine learning problem can be daunting, but preserving and completing a first project will give you the confidence to tackle any data science problem. This series of articles will walk through a complete machine learning solution with a real-world dataset to let you see how all the pieces come together.

Estimators, Loss Functions, Optimizers – Core of ML Algorithms

In order to understand how a machine learning algorithm learns from data to predict an outcome, it is essential to understand the underlying concepts involved in training an algorithm. I assume you have basic machine learning understanding and also basic knowledge of probability and statistics. If not please go through my earlier posts here and here. This post has some theory and maths involved so bear with me and once you read till the end, it will make complete sense as we connect the dots.

What’s Linear About Logistic Regression

There’s already a bunch of amazing articles and videos on Logistic Regression, but it was a struggle for me to understand the connection between the probabilities and the linearity of Logistic, so I figured I would document it here for myself and for those who might be going through the same thing. This will also shed some light on where the ‘Logistic’ part of Logistic Regression comes from!

Decentralizing AI: Dreamers vs. Pragmatists

The decentralization of artificial intelligence(AI) is one of the most fascinating technology trends of the moment and one that can become the foundation of a sustainable path towards artificial intelligence. The emergence of trends such as federated learning, blockchain technologies or secured-encrypted computations have provided a viable technological path for the creation of decentralized AI applications. However, most of today’s applications of decentralized AI remain highly theoretical exercises or constrained to very self-contained use cases. Despite the obvious benefits of the decentralization of machine knowledge, the path to its practical implementation is not trivial and it might very well not happen. Today, I would like to provide a pragmatic perspective about a possible path towards the adoption of decentralized AI technologies based on the current realities of the AI and blockchain ecosystems.

Understanding FAISS

A few weeks back, I stumbled upon FAISS – Facebook’s library for similarity search for very large datasets. My interest piqued, and a few hours of digging around on the internet led me to a treasure trove of knowledge. In this post, I hope to pen down (or rather type down) few basic concepts associated with the library.

Comparing Word Embeddings

We apply word embeddings because they have been shown to improve quality of results in NLP / ML / AI tasks. A somewhat naïve conception is that they work by broadening the narrow path laid out by a word, sentence or document with a context averaged over a huge corpus of text. Commonly, word embeddings are generated using various implementations of Word2Vec or GloVe. For those who want to dig deeper, there exists a plethora of articles describing the mathematics behind their internal workings and their underlying algorithms. In the following let’s have a closer look on the common bag of words (CBOW) and the skip-gram (SG) model implemented in Word2Vec. We already know that in training CBOW we try to predict a word from its context while in training SG we try to predict the context words. As a consequence CBOW fares better with big training data and more frequent words while SG works well also with smaller amounts of training data and better represents the less frequent words. Moreover, SG takes much (in my approximation: ‘window size’ times) more time to train than CBOW.

Genetic algorithm vs. Backtracking: N-Queen Problem

A few months ago, I got familiar with genetic algorithms. I started to read about it and I was pretty amazed by it. One of the most famous problems solved by genetic algorithms is the n-queen problem. I implemented my genetic solver, plus the famous old backtracking solver using python 3. I implemented a Chess class (backtracking solver) and a GeneticChess class (genetic solver). These classes both have an attribute board which is a two dimension list. Each row of the list is filled with N zeros.

How to Deploy Machine Learning Models

The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems. It is only once models are deployed to production that they start adding value, making deployment a crucial step. However, there is complexity in the deployment of machine learning models. This post aims to at the very least make you aware of where this complexity comes from, and I’m also hoping it will provide you with useful tools and heuristics to combat this complexity. If it’s code, step-by-step tutorials and example projects you are looking for, you might be interested in the Udemy Course ‘Deployment of Machine Learning Models’. This post is not aimed at beginners, but don’t worry. You should be able to follow along if you are prepared to follow the various links. Let’s dive in…