Transfer learning & The art of using Pre-trained Models in Deep Learning

Neural networks are a different breed of models compared to the supervised machine learning algorithms. Why do I say so? There are multiple reasons for that, but the most prominent is the cost of running algorithms on the hardware.

Building Trust in Machine Learning Models (using LIME in Python)

More and more companies are now aware of the power of data. Machine Learning models are increasing in popularity and are now being used to solve a wide variety of business problems using data. Having said that, it is also true that there is always a trade-off between accuracy of models & its interpretability. In general, if accuracy has to be improved, data scientists have to resort to using complicated algorithms like Bagging, Boosting, Random Forests etc. which are “Blackbox” methods. It is no wonder that many of the winning entries in Kaggle or Analytics Vidhya competitions tend to use algorithms like XGBoost, where there is no requirement to explain the process of generating the predictions to a business user. On the other hand, in a business setting, simpler models that are more interpretable like Linear Regression, Logistic Regression, Decision Trees, etc. are used even if the predictions are less accurate. This situation has got to change – the trade-off between accuracy & interpretability is not acceptable. We need to find ways to use powerful black-box algorithms even in a business setting and still be able to explain the logic behind the predictions intuitively to a business user. With increased trust in predictions, organisations will deploy machine learning models more extensively within the enterprise. The question is – “How do we build trust in Machine Learning Models”?

A Big Data Cheat Sheet: From Narrow AI to General AI

By now, you must have noticed; Artificial Intelligence(AI) is a buzzword even outside of the tech industry. Dozens, maybe hundreds of startups globally, present themselves as AI. In other discussions, you might come across claims that humanity has not yet discovered AI. How can both of these claims be true?
Let’s take a deeper dive into what AI actually means in 2017, the significance of related terms such as:
• Narrow AI
• Artificial General Intelligence(AGI)
• Conscious AI.

The machine learning paradox

To train a machine learning system, you start with a lot of training data: millions of photos, for example. You divide that data into a training set and a test set. You use the training set to ‘train’ the system so it can identify those images correctly. Then you use the test set to see how well the training works: how good is it at labeling a different set of images? The process is essentially the same whether you’re dealing with images, voices, medical records, or something else. It’s essentially the same whether you’re using the coolest and trendiest deep learning algorithms, or whether you’re using simple linear regression. But there’s a fundamental limit to this process, pointed out in Understanding Deep Learning Requires Rethinking Generalization. If you train your system so it’s 100% accurate on the training set, it will always do poorly on the test set and on any real-world data. It doesn’t matter how big (or small) the training set is, or how careful you are. 100% accuracy means that you’ve built a system that has memorized the training set, and such a system is unlikely to indentify anything that it hasn’t memorized. A system that works in the world can’t be completely accurate on the training data, but by the same token, it will never be perfectly accurate in the real world, either.

A tidy model pipeline with twidlr and broom

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

A Partial Remedy to the Reproducibility Problem

Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden lab-to-lab variability. In addition, there is the occasional revelation of outright fraud.

Shiny: data presentation with an extra

Shiny is an application based on R/RStudio which enables an interactive exploration of data through a dashboard with drop-down lists and checkboxes—programming-free. The apps can be useful for both the data analyst and the public.

Bland-Altman/Tukey Mean-Difference Plots using ggplot2

A very useful data visualisation tool in science, particularly in medical and sports settings, is the Bland-Altman/Tukey Mean-Difference plot. When comparing two sets of measurements for the same variable made by different instruments, it is often required to determine whether the instruments are in agreement or not. Correlation and linear regression can tell us something about the bivariate relationship which exists between two sets of measurements. We can identify the strength, form and direction of a relationship but this approach is not recommended for comparative analyses. The Bland-Altman plot’s first use was in 1983 by J.M Bland and D.G Altman who applied it to medical statistics. The Tukey Mean-Difference Plot was one of many exploratory data visualisation tools created by John Tukey who, interestingly, also created the beloved boxplot.

Evaluate your model with R Exercises

There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient. Now that even an average computer give us incredible computing power, it’s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models.