Evaluating Machine Learning Models Fairness and Bias.

Evaluating machine learning models for bias is becoming an increasingly common focus for different industries and data researchers. Model Fairness is a relatively new subfield in Machine Learning. In the past, the study of discrimination emerged from analyzing human-driven decisions and the rationale behind those decisions. Since we started to rely on predictive ML models to make decisions for different industries such as insurance and banking, we need to implement strategies to ensure the fairness of those models and detect any discriminative behaviour during predictions.

Generating Synthetic Classification Data using Scikit

Data generators help us create data with different distributions and profiles to experiment on. If you are testing various algorithms available to you and you want to find which one works in what cases, then these data generators can help you generate case specific data and then test the algorithm. For example you want to check whether gradient boosting trees can do well given just 100 data-points and 2 features? Now either you can search for a 100 data-points dataset, or you can use your own dataset that you are working on. But how would you know if the classifier was a good choice, given that you have so less data and doing cross validation and testing still leaves fair chance of overfitting? Or rather you could use generated data and see what usually works well for such a case, a boosting algorithm or a linear model.

Speed up predictions on low-power devices using Neural Compute Stick and OpenVINO

The Neural Compute Stick, by Intel, is able to accelerate Tensorflow neural network inferences on the edge, improving performances by 10x factor.
In this article, we will explore the procedure required to:
1. Convert a Tensorflow model to NCS compatible one, using OpenVINO Toolkit by Intel
2. Install a light version of OpenVINO on Raspberry, to run inferences onboard
3. Test and deploy the converted model on Raspberry

The analytics translator The Must-Have Role for AI-Driven Organizations.

» Many organizations have not seen return on their investment after developing their data and AI capabilities.
» It’s imperative to account for all of the phases of an AI solution life-cycle. Find the right business problems to solve in the Ideation phase, discover if there is a viable business model during an Experimentation phase, and scale up in an Industrialization phase.
» Actively involving the business in every step of the process and putting them in the driver’s seat is a critical element to success with data and AI.
» The analytics translator enables the execution of your company’s AI strategy by finding the right use cases, liaising between business and data experts, and embedding AI solutions into your organization.
» To be successful, an analytics translator needs deep business und

Towards Automatic Text Summarization: Extractive Methods

For those who had academic writing, summarization – the task of producing a concise and fluent summary while preserving key information content and overall meaning – was if not a nightmare, then a constant challenge close to guesswork to detect what the professor would find important. Though the basic idea looks simple: find the gist, cut off all opinions and detail, and write a couple of perfect sentences, the task inevitably ended up in toil and turmoil.

Clean a complex dataset for modelling with recommendation algorithms

Recently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis. To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community. And what better forum for my first ever series of posts than one of my favourite data science blogs!

Unit Tests in R

I am collecting here some notes on testing in R. There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is in fact not true, as R itself has a capable testing facility in ‘R CMD check’ (a command triggering R checks from outside of any given integrated development environment).

How to Automatically Determine the Number of Clusters in your Data – and more

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don’t exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy. For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm.

Visually explore Probability Distributions with vistributions

We are happy to introduce the vistributions package, a set of tools for visually exploring probability distributions.

Zotero hacks: unlimited synced storage and its smooth use with rmarkdown

Here is a bit refreshed translation of my 2015 blog post, initially published on Russian blog platform habr.com. The post shows how to organize a personal academic library of unlimited size for free. This is a funny case of a self written manual which I came back to multiple times myself and many many more times referred my friends to it, even non-Russian speakers who had to use Google Translator and infer the rest from screenshots. Finally, I decided to translate it adding some basic information on how to use Zotero with rmarkdown.

Variance decomposition and price segmentation in Insurance

Variance decomposition and price segmentation in Insurance

On the poor performance of classifiers in insurance models

Each time we have a case study in my actuarial courses (with real data), students are surprised to have hard time getting a ‘good’ model, and they are always surprised to have a low AUC, when trying to model the probability to claim a loss, to die, to fraud, etc. And each time, I keep saying, ‘yes, I know, and that’s what we expect because there a lot of ‘randomness’ in insurance’. To be more specific, I decided to run some simulations, and to compute AUCs to see what’s going on. And because I don’t want to waste time fitting models, we will assume that we have each time a perfect model. So I want to show that the upper bound of the AUC is actually quite low ! So it’s not a modeling issue, it is a fondamental issue in insurance !

5 Amazing Deep Learning Frameworks Every Data Scientist Must Know! (with Illustrated Infographic)

Table of Contents
1. What is a Deep Learning Framework?
2. TensorFlow
3. Keras
4. PyTorch
5. Caffe
6. Deeplearning4j
7. Comparing these Deep Learning Frameworks

Better Parallelization with Numba

Based on a geocoordinate problem posed on stackoverflow, I implemented solutions utilizing Numba: 500x faster on multiple cores, 7500x faster on GPU (RTX 2070)

Why Artificial Intelligence Needs to breath on Blockchain ?

Consider placing an AI bot on a blockchain and initiate the phase of deep learning, what would the end result be? Would it be detrimental to the survival of human race or would it lead to a never ending loop that removes third parties from transactions making it easier for for everyone to procure goods and services In theory, the combination of both Blockchain and AI fuse to create a foundation that can foster the change in current methods of transactions. It’s adoption rate are sluggish, the determinants of this adoption rate are more to do with the human adaptability within the financial culture along with the complexity of this conventional way of transactions.