Introduction to Geospatial Data in Python

In this tutorial, you will get to know the two packages that are popular to work with geospatial data: geopandas and Shapely. Then you will apply these two packages to read in the geospatial data using Python and plotting the trace of Hurricane Florence from August 30th to September 18th.

Visualizations for correlation matrices in R

When working with data it is helpful to build a correlation matrix to describe data and the associations between variables. In this article, you learn how to use visualizations for correlation matrices in R.

Building a Question-Answering System from Scratch – Part 1

As my Masters is coming to an end, I wanted to work on an interesting NLP project where I can use all the techniques(not exactly) I have learned at USF. With the help of my professors and discussions with the batch mates, I decided to build a question-answering model from scratch. I am using the Stanford Question Answering Dataset (SQuAD). The problem is pretty famous with all the big companies trying to jump up at the leaderboard and using advanced techniques like attention based RNN models to get the best accuracy. All the GitHub repositories that I found related to SQuAD by other people have also used RNNs.

Introducing gratia

I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Wood’s mgcv package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with brms, mgcv is still my go-to GAM package. The only down-side to mgcv is that it is not very tidy-aware and the ggplot-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses mgcv a lot but also prefers to do my plotting with ggplot2, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is gratia, and development has progressed to the stage where I am ready to talk a bit more about it. gratia is an R package for working with GAMs fitted with gam(), bam() or gamm() from mgcv or gamm4() from the gamm4 package, although functionality for handling the latter is not yet implement. gratia provides functions to replace the base-graphics-based plot.gam() and gam.check() that mgcv provides with ggplot2-based versions. Recent changes have also resulted in gratia being much more tidyverse aware and it now (mostly) returns outputs as tibbles. In this post I wanted to give a flavour of what is currently possible with gratia and outline what still needs to be implemented.

RStudio 1.2 Preview: Plumber Integration

The RStudio 1.2 Preview Release makes it even easier to create RESTful Web APIs in R using the plumber package. plumber is a package that converts your existing R code to a web API using a handful of special one-line comments.

Zombies & Model Rot (with ML Engine + DataStore)

So you’ve deployed your machine learning model to the cloud and all of your apps and services are able to fetch predictions from it, nice! You can leave that model alone to do its thing forever… maybe not. Most machine learning models are modeling something about this world, and this world is constantly changing. Either change with it, or be left behind!

up – the Ultimate Plumber

up is the Ultimate Plumber, a tool for writing Linux pipes in a terminal-based UI interactively, with instant live preview of command results. The main goal of the Ultimate Plumber is to help interactively and incrementally explore textual data in Linux, by making it easier to quickly build complex pipelines, thanks to a fast feedback loop. This is achieved by boosting any typical Linux text-processing utils such as grep, sort, cut, paste, awk, wc, perl, etc., etc., by providing a quick, interactive, scrollable preview of their results.

Generative Adversarial Networks – Paper Reading Road Map

This summer, I have worked on Generative Adversarial Networks (GANs) through my research internship. At first, I did not know much about this model, so the very first weeks of my internship included a lot of paper reading. To help the others who want to learn more about the technical sides of GANs, I wanted to share some papers I have read in the order that I read them. For a less technical introduction to Generative Adversarial Networks, have a look at here. Before reading these papers, I recommend you to revise the basics of deep learning if you are not familiar with them. I also believe that the mathematics behind some of these papers can be very difficult, so you can skip those parts if you don’t feel comfortable with them.

Beyond Reinforcement Learning: Teaching AI to Solve Super-Human Tasks by Amplifying Weak Teachers

Most of the relevant tasks that humans perform during the course of long-term period are hard to specify in objectives. Consider processes such as achieving a scientific breakthrough, designing an economic policy or assembling an NBA-championship winning team. Those processes are hard to model in discrete goals and its performance evaluation is the result of constant experimentation and adjustment. From the complexity standpoint, those tasks can be considered ‘beyond human scale'(BHS) in the sense that it can seldom be planned by a single human given their massive observation space. In the machine learning space, we rely on training signals for guiding the learning process of an algorithm. However, BHS tasks don’t have an objectively training signal given their continuous nature and complexity. How can we train machine learning systems to solve BHS tasks? This is the subject of a new paper published by researchers of artificial intelligence(AI) powerhouse OpenAI. The paper proposes a method called Iterated Amplification to progressively build a training signal for complex problems by combining solutions for easier problems.

Intuitive Machine Learning for Engineers

A platform for discovering, sharing, and discussing easy to use and pre-trained machine learning models.

Introducing ReviewNB: Visual Diff for Jupyter Notebooks

I’m excited to announce ReviewNB – Jupyter Notebook Diff for GitHub, a tool that helps in version controlling Jupyter Notebooks. I have been working on this for the past few months and pretty happy with how it turned out.

Calculating Gradient Descent Manually

Here’s our problem. We have a neural network with just one layer (for simplicity’s sake) and a loss function. That one layer is a simple fully-connected layer with only one neuron, numerous weights w?, w?, w?…, a bias b, and a ReLU activation. Our loss function is the commonly used Mean Squared Error (MSE). Knowing our network and our loss function, how can we tweak the weights and biases to minimize the loss?

LSTM Neural Network for Industry 4.0 Condition-based Monitoring

Industry 4.0 is a current trend. This concept aims at creating the so-called smart industry, where elements of the system communicate and cooperate with each other and with humans in an online fashion through the Internet of Services [1]. It is surely one of the biggest endeavors industries will face, in order to keep their response time competitive in the market, while offering good-quality products and services to different and demanding customers.

How to Do Data Science in your Company to Get The Most Out of it.

Young companies soak up Lean Development, Customer Centricity, and Data Driven Design from the beginning. However, the way to utilize data up to its full potential is still very difficult. This step is even more difficult for traditional enterprises. Equipped with multiples of a startup’s ressources, the use of data seems much more difficult. Due to neccessary changes in the company culture and in the technological repertoire, data initiatives get burried without success (This is still the case even though simple data driven decision making has been shown to increase stock value by 5-6% years ago according to Brynjolfsson 2011).

Finding the Cost Function of Neural Networks

Terence Parr and Jeremy Howard’s paper, The Matrix Calculus You Need for Deep Learning, provided a lot of insights into how deep learning libraries like Tensorflow or PyTorch really worked. Without understanding the math behind deep learning, all we are doing is writing a few lines of abstract code?-?build a model, compile it, train it, evaluate it?-?without really learning to appreciate all the complex intricacies that support all these functions. This (and the next few) articles will be the reflections of what I learned from Terence and Jeremy’s paper. This article will introduce the problem, which we will solve in the upcoming articles. I’ll explain most of the math and add in some of my insights, but for more information, definitely check out the original paper.

Demystifying Statistical Analysis 1: A Handy Cheat Sheet

The choice of statistical analysis to use is mostly governed by the type of variables in a dataset, the number of variables that the analysis needs to be conducted on, and the number of levels/categories within a variable. As long as we have a good understanding of the data that we are dealing with, the selection of statistical analysis to use should not be too intimidating. I have always wondered why most statistics textbooks do not contain a table that consolidates all the different statistical analyses at a glance, how they relate to one another and when to use each of them. A quick search on the internet uncovers a number of similar cheat sheets, but none of them presents the information in a way that was intuitive to me.

Inforence: effective fault localization based on information-theoretic analysis and statistical causal inference

In this paper, a novel approach, Inforence, is proposed to isolate the suspicious codes that likely contain faults. Inforence employs a feature selection method, based on mutual information, to identify those bug-related statements that may cause the program to fail. Because the majority of a program faults may be revealed as undesired joint effect of the program statements on each other and on program termination state, unlike the state-of-the-art methods, Inforence tries to identify and select groups of interdependent statements which altogether may affect the program failure. The interdependence amongst the statements is measured according to their mutual effect on each other and on the program termination state. To provide the context of failure, the selected bug-related statements are chained to each other, considering the program static structure. Eventually, the resultant cause-effect chains are ranked according to their combined causal effect on program failure. To validate Inforence, the results of our experiments with seven sets of programs include Siemens suite, gzip, grep, sed, space, make and bash are presented. The experimental results are then compared with those provided by different fault localization techniques for the both single-fault and multi-fault programs. The experimental results prove the outperformance of the proposed method compared to the state-of-the-art techniques.

What is Model-Agnostic Meta-learning (MAML) ?

Artificial intelligence is trying to learn how to learn from the way humans learn. We can quickly and easily recognize a new object from just seeing one or few pictures of it, or even from only reading about it without having ever seen it before. We can learn quickly a new skill as well as master many different tasks. This seems easy for the human intelligence but for machines, it is quite a challenge to overcome. Berkeley AI Research Lab published a research paper introducing Model-Agnostic Meta-learning MAML which is a simple solution, yet so powerful and game-changing in meta-learning (or learning to learn). The objective of this article is to give you a clear understanding of this approach.