Data-Science Observability For Executives
Observability for data-science (DS) is a new and emerging field, which is sometimes mentioned in tandem with MLOps or AIOps. New offerings are being developed by young startups to address the lack of monitoring and alerts for everything data-science. However, they are mostly addressing data-scientists or -engineers, which are, of course, the first personas that feel the pain of managing multiple models. However, I will try to argue that the impact of data-science observability should be aimed toward decision-makers such as high & mid-level managers, the people who are responsible for spending, funding, managing and most importantly are accountable for the impact of the data-science-operation on the company’s clients, business, product, sales, and lets not forget the company’s bottom line.
An Introduction to Discretization Techniques for Data Scientists
Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.
Traditional AI vs. Modern AI.
The evolution of Artificial Intelligence and the new wave of ‘Future AI’
Automate a Data Science Workflow – Movie Reviewer Sentiment Analysis
I’m very resistant to point and click solutions. And I think my resistance is in good faith and for good reasons. We’ve waited a long time for drag and drop solutions for web apps, mobile apps, and a whole host of other things. But fundamentally, I think a solution like Knime is perfect for letting the user introduce the perfect amount of flexibility and simplification as necessary. For me, boxing up all my steps in a Data Science workflow has given me a whole new level of management to my projects.
Catch Me if You Can: Outlier Detection
Outlier detection is an interesting data mining task that is used quite extensively to detect anomalies in data. Outliers are points that exhibit significantly different properties than the majority of the points. To this end, outlier detection has very interesting applications such as credit card fraud detection (suspicious transactions), traffic management (drunk/rash driving) or Network Intrusions (hacks) etc. Due to the time-critical nature of these applications, there is a need for scalable outlier detection techniques. In this project, we will aim to detect outlier in a Taxi Dataset (Beijing), using a technique that only uses spatio-temporal characteristics to detect outliers in very large datasets. We will be using the geo-coordinates and the timestamps collected by the GPS on these taxis.
BERT Visualization in Embedding Projector
This story shows how to visualize pre-trained BERT embeddings in Tensorflow’s Tensorboard Embedding Projector. The story uses around 50 unique sentences and their BERT embeddings generated with TensorFlow Hub BERT models.
Multi-Label Text Classification
In the case of binary classification we just ask a yes/no type of question. If there are multiple possible answers and only one to be chosen, then it’s multiclass classification. In our example we can’t really select only one label, I would say that all of them match the photo. The goal of multi-label classification is to assign a set of relevant labels for a single instance. However, most of widely known algorithms are designed for a single label classification problems. In this article four approaches for multi-label classification available in scikit-multilearn library are described and sample analysis is introduced.
Features correlations: data leakage, confounded features and other things that can make your Deep Learning model fail
As it seems from the plot that boss is showing, the more employees have shaved heads, the more the company sales increase. If you where that boss, would you consider the same action on your employees? Probably not. In fact, you recognize that there is no causality between the two sets of events, and their behaviour is similar just by chance. More clearly: the shaved heads do not cause the sales. So, we just spotted the existence of at least two possible categories of correlations: without and with causality. We also agreed that only the second one is interesting, while the other is useless, when not misleading. But let’s dive deeper.
Talking with BERT
The growth of knowledge and research around language models has been amazing in the past few years. For BERT especially, we have seen some incredible uses for this massive pre-trained language model on tasks like text classification, prediction, and question answering. I’ve recently written about how some have researched some of the limitations of BERT when performing certain language tasks. Further, I did some testing on my own with creating a question-answering system to get a feel for how it could be used. It has been great to see and try in practice some of the many capabilities of language models.
SQL vs noSQL: Two Approaches to ETL Applications
In this article I will explore the differences in SQL and noSQL ETL pipelines. This article will focus on the transfer and load techniques -> that is what happens once the data has been loaded into the application. The extraction part is simple. It involves reading files and some basic data wrangling. By comparing how the datasets are divided post extraction you can see how the choice of database impacts the application architecture. You can also identify the strengths and weaknesses of choosing particular databases.
Why Kernelized Support Vector Machine (SVM) is MLs most beautiful Algorithm?
Machine learning has more than a few beautiful algorithms that are helping data scientists and researchers transform business models and societies altogether. It has a package of both supervised and unsupervised algorithms that can be trained as the requirements of the problem. Even though a majority of new-age machine learning applications are moving towards exploring deep learning theories and neural networks, a lot can be done with existing algorithms. One class of such a beautiful machine learning algorithms are the support vector machines. Even though people don’t use these much since the advent of neural networks, they still have a lot of scopes in research and getting answers to complex problems. The beauty of support vector machines lies in the fact that they can be reproduced using maximum likelihood estimates and understood in terms of a Bayesian classifier. Tweaks in the optimal Bayesian classifier with the appropriate assumption of a prior can give you a classic support vector machine. Support vector machines are a supervised learning algorithm when it comes to classification, but with progress in research, their scope for unsupervised clustering methods is being explored.
A Creative Approach Towards Feature Selection
Feature selection is one of the most important things when it comes to feature engineering. We need to reduce the number of features so that we can interpret the model better, make it less computational stressful to train the model, remove redundant effects and make the model generalise better. In some case feature selection becomes extremely important or else the input dimensional space is too big making it difficult for the model to train.
Exploratory Data Analysis …A topic that is neglected in Data Science Projects
Exploratory Data Analysis (EDA) is the first step in your data analysis process developed by ‘John Tukey’ in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. By the name itself, we can get to know that it is a step in which we need to explore the data set.
A Comprehensive Guide To Data Imputation
In the real world, missing data is a nearly inevitable problem. Only a special few can swerve it -usually through large investments in data collection. This issue is crucial because the way we handle missing data has a direct impact on our findings, and it also plows into time management. Therefore, it should always be a priority to handle missing data properly, which can be much harder than it seems. The difficulty arises as we realize that not all missing data is created equal just because it all looks the same – a blank space – and that different types of missing data must be handled differently. In this article, we review the types of missing data, as well as basic and advanced methods to tackle them.
Build Data Pipelines with Apache Airflow
Originally created at Airbnb in 2014, Airflow is an open-source data orchestration framework that allows developers to programmatically author, schedule, and monitor data pipelines. Airflow experience is one of the most in-demand technical skills for Data Engineering (another one is Oozie) as it is listed as a skill requirement in many Data Engineer job postings. In this blog post, I will explain core concepts and workflow creation in Airflow, with source code examples to help you create your first data pipeline using Airflow.