Exploring Categorical Data With Inspectdf

I often find myself viewing and reviewing dataframes throughout the course of an analysis, and a substantial amount of time can be spent rewriting the same code to do this. inspectdf is an R package designed to make common exploratory tools a bit more useful and easy to use. In particular, it’s very powerful be able to quickly see the contents of categorical features. In this article, we’ll summarise how to use the inspect_cat() function from inspectdf for summarising and visualising categorical columns.

Text Classification in Python

This article is the first of a series in which I will cover the whole process of developing a machine learning project. In this article we focus on training a supervised learning text classification model in Python. The motivation behind writing these articles is the following: as a learning data scientist who has been working with data science tools and machine learning models for a fair amount of time, I’ve found out that many articles in the internet, books or literature in general strongly focus on the modeling part. That is, we are given a certain dataset (with the labels already assigned if it is a supervised learning problem), try several models and obtain a performance metric. And the process ends there.

A new Tool to your Toolkit, KL Divergence at Work

In my previous post, we got a thorough understanding of Entropy, Cross-Entropy, and KL-Divergence in an intuitive way and also by calculating their values through examples. In case you missed it, please go through it once before proceeding to the finale. In this post, we will apply these concepts and check the results in a real dataset. Also, it will give us good intuition on how to use these concepts in modeling various day-to-day machine learning problems. So, let’s get started.

A Language, not a Letter: Learning Statistics in R

This online collection of tutorials was created by graduate students in psychology as a resource for other experimental psychologists interested in using R for statistical analyses and graphics. Each chapter was created to provide an overview of how to code a particular topic in the R language. Who is this book for? This book was designed for psychologists already familiar with the statistics they need to utilize, but who have zero experience programming and working in R. Many of the authors of these tutorials had never used R prior to taking the course in which this collection of tutorials was created. In one semester, they were able to gain enough proficiency in R to independently create one of the tutorials included here.

What the Evidence Shows About the Impact of the GDPR After One Year

The General Data Protection Regulation (GDPR), the new privacy law for the European Union (EU), went into effect on May 25, 2018. One year later, there is mounting evidence that the law has not produced its intended outcomes; moreover, the unintended consequences are severe and widespread. This article documents the challenges associated with the GDPR, including the various ways in which the law has impacted businesses, digital innovation, the labor market, and consumers. Specifically, the evidence shows that the GDPR:
• Negatively affects the EU economy and businesses
• Drains company resources
• Hurts European tech startups
• Reduces competition in digital advertising
• Is too complicated for businesses to implement
• Fails to increase trust among users
• Negatively impacts users’ online access
• Is too complicated for consumers to understand
• Is not consistently implemented across member states
• Strains resources of regulators

modelDown is now on CRAN!

The modelDown package turns classification or regression models into HTML static websites. With one command you can convert one or more models into a website with visual and tabular model summaries. Summaries like model performance, feature importance, single feature response profiles and basic model audits. The modelDown uses DALEX explainers. So it’s model agnostic (feel free to combine random forest with glm), easy to extend and parameterise.

How your smartphone tells your story: A dive into Android activity data

I was really excited when Google announced their Digital Wellbeing program, back in May 2018, especially Dashboard. It tracks all your app interactions on the phone and even helps you to limit app usage by setting time restrictions on different apps. But as of October 2018, Google still hasn’t rolled out that feature to all Android P users and is in beta even for Pixel users. So I decided to check out my own statistics with the data available at hand.

Causal vs. Statistical Inference

Causal inference, or the problem of causality in general, has received a lot of attention in recent years. The question is simple, is correlation enough for inference? I am going to state the following, the more informed uninformed person is going to pose a certain argument that looks like this: Causation is nothing else than really strong correlation. I hate to break it to you if this is your opinion, but no it is not, it is most certainly not. I can see that it is relatively easy to get convinced that it is, but once we start thinking about it a bit we are easily going to come to the realization that it is not. If you are still convinced otherwise after reading this article, please contact me for further discussion because I would be interested in your line of thought.

How To Run Jupyter Notebooks in the Cloud

When starting out in data science, DevOps tasks are the last thing you should be worrying about. Trying to master all (or most) aspects of data science requires a tremendous amount of time and practice. Nevertheless, if you should happen to attend a boot camp or some other type of school, it is very likely that you are going to have to complete group projects sooner or later. However, coordinating these without any DevOps knowledge can prove to be quite the challenge. How do we share code? How do we deal with very expensive computations? How do we make sure everyone is using the same environment? Questions like these can easily stall the progress of any data science project.

Code free Data Science with Microsoft Azure Machine Learning Studio

In the last weeks, months and even years a lot of tools arose that promise to make the field of data science more accessible. This isn’t an easy task considering the complexity of most parts of the data science and machine learning pipeline. None the less many libraries and tools including Keras, FastAI, and Weka made it significantly easier to create a data science project by providing us with an easy to use high-level interface and a lot of prebuilt components.

Creating New Scripts with StyleGAN

I applied StyleGAN to images of Unicode characters to see if it could invent new characters.

Exploratory Data Analysis Tutorial in Python

One of the most important skills that every Data Scientist must master is the ability to explore data properly. Thorough exploratory data analysis (EDA) is essential in order to ensure the integrity of your gathered data and performed analysis. The example used in this tutorial is an exploratory analysis of historical SAT and ACT data to compare participation and performance between SAT and ACT exams in different States. By the end of this tutorial, we will have gained data-driven insight into potential issues regarding standardized testing in the United States. The focus of this tutorial is to demonstrate the exploratory data analysis process, as well as provide an example for Python programmers who want to practice working with data. For this analysis, I examined and manipulated available CSV data files containing data about the SAT and ACT for both 2017 and 2018 in a Jupyter Notebook. Exploring data through well-constructed visualizations and descriptive statistics is a great way to become acquainted with the data you’re working with and formulate hypotheses based on your observations.

Hypothesis Testing – An Introduction

Hypothesis are our assumptions about the data which may or may not be true. In this post we’ll discuss about the statistical process of evaluating the truthiness of a hypothesis – this process is known as hypothesis testing. Most of the statistical analysis has its genesis in comparing two types of distributions: population distribution and sample distribution. Let’s understand these terms through an example – Suppose we want to statistically test our hypothesis that on average, the performance of students in a standard aptitude test has improved in the last decade. We’re given a dataset containing the marks (maximum marks = 250) of 100 randomly selected students who appeared in the exam in 2009 and 2019.

An ‘Equation-to-Code’ Machine Learning Project Walk-Through – Part 3 SGD

Detailed explanation to implement Stochastic Gradient Descent (SGD) and Mini-Batch Gradient Descent from scratch in Python.

This AI Can Detect Image Manipulation, And Might Well Be Our Savior

Researchers from Adobe and UC Berkeley have unveiled an interesting way to combat the spread of image manipulation – using AI to spot edited photos. The AI was trained to recognize instances where the Face-Aware Liquify feature of Photoshop was used to edit images. The feature enables you to easily tweak and exaggerate facial features, for example, widening the eyes or literally turning a frown into a smile. This particular feature is popular when it comes to editing faces, and was chosen because the effects can be extremely subtle. The results were astonishing. While human faces could spot the edits just 53% of the time (only a little over chance), the AI sometimes achieved results as high as 99%. Part of the reason the AI performs so much better than the human eye is that it can also access low-level image data, as opposed to simply relying on visual cues. So, why is this important?

Beginner’s Guide to BERT for Multi-classification Task

The purpose of this article is to provide a step-by-step tutorial on how to use BERT for multi-classification task. BERT ( Bidirectional Encoder Representations from Transformers), is a new method of pre-training language representation by Google that aimed to solve a wide range of Natural Language Processing tasks. This model is based on unsupervised, deeply bidirectional system and managed to achieve state-of-the-art results when it was first released to the public in 2018.

The W3H of AlexNet, VGGNet, ResNet, Inception

In this tutorial, I will quickly go through the details of four of the famous CNN architectures and how they differ from each other by explaining their W3H (When, Why, What and How).

A Comprehensive guide on handling Missing Values

Most of the real world data contains missing values. They occur due to many reasons like some observations were not recorded and corruption of data.