Airflow and superQuery

What is the cost?’ A question asked so frequently in the tech world that every person at a small start-up shudders slightly when it is asked. The answer of which invariably is: ‘We’re not sure’. One of the best tools for scheduling workflows in the data engineering world is Apache Airflow. This tool has taken many a business out of the inflexible cron scheduling doldrums into riding the big data waves on the high seas of Directed Acyclic Graphs (DAGs). Of course this means that large globs of data are being moved into and out of databases and with this glorious movement often comes unavoidable costs. One such database, a supercomputer if you will, is called Google BigQuery. It is the flagship of the Google Cloud offering and allows data processing at the Petabyte scale. It is very good at making you worry less about the power of the database infrastructure and more about the quality of your analysis and data flow problems in need of solving. One key factor to consider with BigQuery is how open any individual or organisation is to driving up costs spent on scanning data on this platform. Even the most savvy of data engineers will tell you in angst about their errors in scanning across data they didn’t really want to and pushing their business’ monthly analysis bill over the budget.

Top 5 Reasons to Move Enterprise Data Science Off the Laptop and to the Cloud

We live in a world that is inundated with data. Data science and machine learning (ML) techniques have come to the rescue in helping enterprises analyze and make sense of these large volumes of data. Enterprises have hired data scientists – people who apply scientific methods to data to build mathematical software models – to generate insights or predictions that enable data-driven business decisions. Typically, data scientists are experts in statistical analysis and mathematical modeling who are proficient in programming languages such as R or Python.

R 3.5.3 now available

The R Core Team announced yesterday the release of R 3.5.3, and updated binaries for Windows and Linux are now available (with Mac sure to follow soon). This update fixes three minor bugs (to the functions writeLines, setClassUnion, and stopifnot), but you might want to upgrade just to avoid the ‘package built under R 3.5.4’ warnings you might get for new CRAN packages in the future.

‘X affects Y’. What does that even mean?

On my last post I gave an intuitive demonstration of what’s causal inference and how it’s different than classic ML. After receiving some feedback I realize that while the post was easy to digest, some confusion remains. In this post I’ll delve a bit deeper into what the ‘causal’ in Causal Inference actually means.

What is the Difference Between AI and Machine Learning

Artificial Intelligence and Machine Learning have empowered our lives to a large extent. The number of advancements made in this space has revolutionized our society and continue making society a better place to live in. In terms of perception, both Artificial Intelligence and Machine Learning are often used in the same context which leads to confusion. AI is the concept in which machine makes smart decisions whereas Machine Learning is a sub-field of AI which makes decisions while learning patterns from the input data. In this blog, we would dissect each term and understand how Artificial Intelligence and Machine Learning are related to each other.

The importance of Graphing Your Data – Anscombe’s Clever Quartet!

Francis Anscombe’s seminal paper on ‘Graphs in Statistical’ analysis (American Statistician, 1973) effectively makes the case that looking at summary statistics of data is insufficient to identify the relationship between variables. He demonstrates this by generating four different data sets (Anscombe’s quartet) which have nearly identical summary statistics. His data have the same mean and variance for x and y, same correlations between x and y, and same regression coefficients on the linear projection of x on y. (There are certainly additional summary statistics less widely reported such as kurtosis or least absolute deviations/median regression which were not reported which would have indicated differences between the data.) Yet even with these differences, without graphing the data, any analysis would likely be missing the mark.

R and labelled data: Using quasiquotation to add variable and value labels

Labelling data is typically a task for end-users and is applied in own scripts or functions rather than in packages. However, sometimes it can be useful for both end-users and package developers to have a flexible way to add variable and value labels to their data. In such cases, quasiquotation is helpful. This vignette demonstrate how to use quasiquotation in sjlabelled to label your data.

Unsupervised Classification Project: Building a Movie Recommender with Clustering Analysis and K-Means

The goal of this project is to find out the similarities within groups of people in order to build a movie recommending system for users. We are going to analyze a dataset from Netflix database to explore the characteristics that people share in movies’ taste, based on how they rate them.

Dockerizing Python Flask app and Conda environment

Use Docker to package your Python Flask app and your Conda environment. This post will describe how to dockerize your Python Flask app and recreate your Conda Python environment. So you are developing a Python Flask app, and you have set up a Conda virtual environment up on your local machine to run your app. Now you want to put your Python Flask app in a Docker image. Wouldn’t it be nice if you could export your current Conda environment as a .yml file, describing what Python version your application is using and which Python libraries are required to run your application. Furthermore, use the exported .yml file to build a similar environment in a Docker image and run your Flask app in that environment. The following will describe exactly how you can acomplishe all the above mentioned.

Virtual, Headless, and Distributed (Oh My!)

This post empowers the Pythonista, with a complete framework to explore the world of data on the internet?-?all behind randomized proxy servers in a fast parallelized sequence, while protecting your company’s immutable IP from curious eyes, and other potential trolls. With this new outlet, the reader is requested to take all measures, and to not abuse the privilege of their acquired ghost-ninja skills, to not tax any such services inappropriately, nor unethically. The user takes all responsibility for implementing (of course) and all risks associated with running the attached code.

Getting started with NLP using the PyTorch framework

PyTorch is one of the most popular Deep Learning frameworks that is based on Python and is supported by Facebook. In this article we will be looking into the classes that PyTorch provides for helping with Natural Language Processing (NLP).

It’s OK to use spreadsheets in data science

With all the great sophisticated data tools that exist out there these days, it’s easy to think that spreadsheets are too primitive for use in serious data science work. The fact that there’s literally 20+ years of literature cautioning people about the evils of spreadsheets makes it sound like a ‘real data professional’ should know better than to use such antiquated things. But it’s probably the greatest Swiss army chainsaw for data for the sorts of ugly work that no one ever wants to admit they have to do every day. In an ideal world they wouldn’t be necessary, but when there’s a combination of tech debt, time pressure, poor data quality, and stakeholders who don’t know anything but spreadsheets, they’re invaluable.

Image-to-Image Translation

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration,season transfer and photo enhancement.

EM Algorithm Explained in One Picture

The EM algorithm finds maximum-likelihood estimates for model parameters when you have incomplete data. The ‘E-Step’ finds probabilities for the assignment of data points, based on a set of hypothesized probability density functions; The ‘M-Step’ updates the original hypothesis with new data. The cycle repeats until the parameters stabilize.

Top 10 Artificial Intelligence Trends in 2019

1. Automation of DevOps to achieve AIOps
2. The Emergence of More Machine Learning Platforms
3. Augmented Reality
4. Agent-Based Simulations
5. IoT
6. AI Optimized Hardware
7. Natural Language Generation
8. Streaming Data Platforms
9. Driverless Vehicles
10. Conversational BI and Analytics