Extracting Data from Multiple Data Sources. If you’re in tech, chances are you have data to manage. Data grows fast, gets more complex and harder to manage as your company scales. The management wants to extract insights from the data they have, but they do not have the technical skills to do that. They then hire you, the friendly neighbourhood data scientist/engineer/analyst or whatever title you want to call yourself nowadays ( does not matter you’ll be doing all the work anyways ) to do exactly just that. You soon realise that in order for you to provide insights, you need to have some kind of visualisation to explain your findings and monitor them over time. For these data to be up to date, you need to extract, transform, load them into your preferred database from multiple data sources in a fixed time interval ( hourly , daily, weekly, monthly). Companies have workflow management systems for tasks like these. They enable us to create and schedule jobs internally. Everyone has their own preference, but I would say many of the old WMS tend to be inefficient, hard to scale, lack of a good UI, no strategy for retry, not to mention terrible logs that makes troubleshooting / debugging a nightmare. It causes unnecessary wastage of effort and time. Here is an example of a traditional crontab used to schedule jobs.
There are different loss functions available for different objectives. In this mini blog, I will take you through some of the very frequently used loss functions, with a set of examples. This blog is designed by keeping the Keras and Tensorflow framework in the mind.
Tensorflow inspires developers to experiment with their exciting AI ideas in almost any domain that comes to mind. There are three well known factors in the ML community that make up a good Deep Neural Network model do magical things.
• Model Architecture
• High quality training data
• Sufficient Compute Capacity
• Model Architecture
• High quality training data
• Sufficient Compute Capacity
When we recently held a 24h hackathon aimed at helping an NGO fight human trafficking, the composition of the competing teams could not have been more different: There were a number of teams with a heavy background in analytics consulting. And then there was my team: One data engineer, and three designers. The goal of the event was to generate insights from data provided by the client. The game was on. The client brief was the nail. Each team had its hammer. It might have been hard to see what such a design heavy group had to do with what seemed like an obvious data analytics challenge, but we did not despair. Instead, the challenge allowed us to illustrate some valuable lessons about how combining data- and design-driven approaches can generate unique insights that previously had been overlooked. The key to getting there was in looking for the stories that came to the surface at the intersection of analytics and ethnography.
Football( soccer for Americans) is by far the most popular sport in the world. With over 4 billion fans worldwide, football has proven to transcend generations, geo-political rivalries and even war conflicts. That passion has transferred into the video game space, in which games like FIFA regularly ranked among the most popular videogames worldwide. Despite its popularity, football is one of the games that have proven resilient artificial intelligence(AI) techniques. The complexity of environments like FIFA often result on a nightmare for AI algorithms. Recently, researchers from the Google Brain team open sourced Google Research Football, a new environment that leverages reinforcement learning to teach AI agents how to master the most popular sport in the world. The principles behind Google Research Football were outlined in a research paper that accompanied the release.
Foreword: The following post is intended to be an introduction with some math. Although it includes math, I’m by no means an expert. I struggled to learn the basics of probabilistic programming, and hopefully this helps someone else make a little more sense out of the world. If you find any errors, please drop a comment and help us learn. Cheers! In data science we are often interested in understanding how data was generated, mainly because it allows us to answer questions about new and/or incomplete observations. Given a (often noisy) dataset though, there are an infinite number of possible model structures that could have generated the observed data. But not all model structures are made equally – that is, certain model structures are more realistic than others based on our assumptions of the world. It is up to the modeler to choose the one that best describes the world they are observing.
Correlation does not imply causation’. We hear this sentence over and over again. But what does that actually mean? This small analysis uncovers this topic with the help of R and simple regressions, focusing on how alcohol impacts health.
In the last few years, the machine-learning community has blossomed, applying the technology to challenges like food security and health care.
Every single successful data science project revolves around finding accurate correlations between the input and target variables. However more than often, we oversee how crucial correlation analysis is. It is recommended to perform correlation analysis before and after data gathering and transformation phases of a data science project. This article focuses on the important role correlations play in the data science projects and concentrates on the real world FinTech examples. Lastly it explains how we can model the correlations the right way.
This repository contains examples of popular machine learning algorithms implemented in Python with mathematics behind them being explained. Each algorithm has interactive Jupyter Notebook demo that allows you to play with training data, algorithms configurations and immediately see the results, charts and predictions right in your browser. In most cases the explanations are based on this great machine learning course by Andrew Ng. The purpose of this repository is not to implement machine learning algorithms by using 3rd party library one-liners but rather to practice implementing these algorithms from scratch and get better understanding of the mathematics behind each algorithm. That’s why all algorithms implementations are called ‘homemade’ and not intended to be used for production.
In the past couple of years, big companies hurried to publish their machine learning based time series libraries. For example, Facebook released Prophet, Amazon released Gluon Time Series, Microsoft released Time Series Insights and Google released Tensorflow time series. This popularity shows that machine learning based time series prediction is in high demand. This article introduces the recently released Tensorflow time series library from Google. This library uses probabilistic models to describe time series. In my 7 years’ experience in algorithmic trading, I’ve established the habit of analysing the inner workings of time series libraries. So I digged into the source code of this library to understand the Tensorflow team’s take on time series modelling.
The ability to answer factoid questions is a key feature of any dialogue system. Formally speaking, to give an answer based on the document collection covering wide range of topics is called open-domain question answering (ODQA). The ODQA task combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer span from those articles). An ODQA system can be used in many applications. Chatbots apply ODQA to answer user requests, while the business-oriented Natural Language Processing (NLP) solutions leverage ODQA to answer questions based on internal corporate documentation. The picture below shows a typical dialogue with an ODQA system.
MCMC is a pretty hard topic to wrap your head around but examples do help a lot. Last time I wrote an article explaining MCMC methods intuitively. In that article, I showed how MCMC chains could be used to simulate from a random variable whose distribution is partially known i.e. we don’t know the normalizing constant. I also told how MCMC can be used to Solve problems with a large state space. But didn’t give an example. I will provide some real-world use cases in this post. If you don’t really appreciate MCMC till now, I hope I will be able to pique your interest by the end of this blog post. This post is about understanding MCMC Methods with the help of some Computer Science problems.
End-to-end (E2E) learning refers to training a possibly complex learning system represented by a single model (specifically a Deep Neural Network) that represents the complete target system, bypassing the intermediate layers usually present in traditional pipeline designs.