Docker for Data Science

Docker is a tool that simplifies the installation process for software engineers. Coming from a statistics background I used to care very little about how to install software and would occasionally spend a few days trying to resolve system configuration issues. Enter the god-send Docker almighty. Think of Docker as a light virtual machine (I apologise to the Docker gurus for using that term). Generally someone writes a *Dockerfile* that builds a *Docker Image* which contains most of the tools and libraries that you need for a project. You can use this as a base and add any other dependencies that are required for your project. Its underlying philosophy is that if it works on my machine it will work on yours.

Enterprise AI: Learning from the evolution of Robotic Process Automaton

Learning from the evolution of RPA, in this post, we explore the wider implications for Enterprise AI i.e. the deployment of Artificial Intelligence to the Enterprise

What Is AI? – Artificial Intelligence For Beginners

Do you remember the first time that you saw R2D2 and C3P0 from Star Wars? These two robots exhibited human-like behavior as they interacted with people and the world around them. How about when the whole world was subject to machine control in The Matrix? That’s a pretty frightening concept. These movies, like many others, have their own depictions of what Artificial Intelligence looks like, and means to us as a society. The term Artificial Intelligence has been popularized in books and movies to depict futuristic settings where machines take over the world, or live with us side by side as if they were humans. But did you know that Artificial Intelligence is actually here with us, today? That we as a society and a culture are already embracing Artificial Intelligence in our daily lives, everyday? We no longer live in a world where Artificial Intelligence is just a story in our books, or on the big screen when we go to the movies. But if Artificial Intelligence is already here, what is it really like? How does it work? How does Artificial Intelligence actually compare to the fictional representations of AI in our culture?

Infrastructure and Development for Data Science

There is certainly no single infrastructure that suits all the requirements perfectly, although the standard development, acceptance, production infrastructure works well, nonetheless you have to be aware that the requirements a Data Science team has on the environment infrastructure differ from a software development team and adjust your infrastructure accordingly. This includes giving the Data Scientists more freedom for ad-hoc tasks and not restricting them with software development rules, when they are not necessary. Otherwise we are limiting the productivity of our Data Science team.

Web Scraping for Data Science with Python

For those who are not familiar with programming or the deeper workings of the web, web scraping often looks like a black art: the ability to write a program that sets off on its own to explore the Internet and collect data is seen as a magical and exciting ability to possess. In Web Scraping for Data Science with Python Web Scraping for Data Science with Python, we set out to provide a concise though thorough and modern guide to web scraping, using Python as our programming language. In addition, this book is written with a data science audience in mind.

What is the downside of deep reinforcement learning? When shouldn’t it be used?

Reinforcement learning describes the set of learning problems where an agent must take actions in an environment in order to maximize some defined reward function. Unlike supervised deep learning, large amounts of labeled data with the correct input output pairs are not explicitly presented. Most of the learning happens online, i.e. as the agent actively interacts with its environment over several iterations, it eventually begins to learn the policy describing which actions to take to maximize the reward.

How self-service data avoids the dangers of “shadow analytics”

Without the proper cataloging, curation, and security that self-service data platforms allow, companies are left vulnerable to cybersecurity threats and misinformation.

archivist: An R Package for Managing, Recording and Restoring Data Analysis Results

Everything that exists in R is an object (Chambers 2016). This article examines what would be possible if we kept copies of all R objects that have ever been created. Not only objects but also their properties, meta-data, relations with other objects and information about context in which they were created. We introduce archivist, an R package designed to improve the management of results of data analysis. Key functionalities of this package include: (i) management of local and remote repositories which contain R objects and their meta-data (objects’ properties and relations between them); (ii) archiving R objects to repositories; (iii) sharing and retrieving objects (and their pedigree) by their unique hooks; (iv) searching for objects with specific properties or relations to other objects; (v) verification of object’s identity and context of its creation. The presented archivist package extends, in a combination with packages such as knitr and the function Sweave, the reproducible research paradigm by creating new ways to retrieve and validate previously calculated objects. These new features give a variety of opportunities such as: sharing R objects within reports or articles; adding hooks to R objects in table or figure captions; interactive exploration of object repositories; caching function calls with their results; retrieving an object’s pedigree (i.e., information about how the object was created); automated tracking of the performance of considered models, restoring R packages to the state in which the object was archived.

The British Ecological Society’s Guide to Reproducible Science

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

When two trends fuse: PyTorch and recommender systems

In the last few years, we have experienced the resurgence of neural networks owing to availability of large data sets, increased computational power, innovation in model building via deep learning, and, most importantly, open source software libraries that ease use for non-researchers. In 2016, the rapid rise of the TensorFlow library for building deep learning models allowed application developers to take state-of-the-art models and put them into production. Deep learning-based neural network research and application development is currently a very fast moving field. As such, in 2017 we have seen the emergence of the deep learning library PyTorch. At the same time, researchers in the field of recommendation systems continue to pioneer new ways to increase performance as the number of users and items increases. In this post, we will discuss the rise of PyTorch, and how its flexibility and native Python integration make it an ideal tool for building recommender systems.

Fundamentals of Deep Learning – Introduction to Recurrent Neural Networks

Let me open this article with a question – “working love learning we on deep”, did this make any sense to you? Not really – read this one – “We love working on deep learning”. Made perfect sense! A little jumble in the words made the sentence incoherent. Well, can we expect a neural network to make sense out of it? Not really! If the human brain was confused on what it meant I am sure a neural network is going to have a tough time deciphering such text. There are multiple such tasks in everyday life which get completely disrupted when their sequence is disturbed. For instance, language as we saw earlier- the sequence of words define their meaning, a time series data – where time defines the occurrence of events, the data of a genome sequence- where every sequence has a different meaning. There are multiple such cases wherein the sequence of information determines the event itself. If we are trying to use such data for any reasonable output, we need a network which has access to some prior knowledge about the data to completely understand it. Recurrent neural networks thus come into play. In this article I would assume that you have a basic understanding of neural networks, in case you need a refresher please go through this article before you proceed.

Building a simple Sales Revenue Dashboard with R Shiny & ShinyDashboard

One of the beautiful gifts that R has got (that Python misses) is the package – Shiny. Shiny is an R package that makes it easy to build interactive web apps straight from R. Making Dashboard is an imminent wherever Data is available since Dashboards are good in helping Business make insights out of the existing data. In this post, We will see how to leverage Shiny to build a simple Sales Revenue Dashboard.

Understanding common misconceptions about p-values

A p-value is the probability of the observed, or more extreme, data, under the assumption that the null-hypothesis is true. The goal of this blog post is to understand what this means, and perhaps more importantly, what this doesn’t mean. People often misunderstand p-values, but with a little help and some dedicated effort, we should be able explain these misconceptions. Below is my attempt, but if you prefer a more verbal explanation, I can recommend Greenland et al. (2016).

Essential Guide to keep up with AI/ML/CV/UNameIt

These fields are booming these days. In order not to become rusty, one has to constantly follow the updates. Here is the essential guide on how to keep up with the important news/papers/discussions/tutorials. This guide is by no means an exhaustive one so contributions are truly welcome. I believe you have a lot to add!