Distilled News

Learning in Graphs with Python (Part 3)

Concepts, applications, and examples with Python. Graphs are becoming central to machine learning these days, whether you’d like to understand the structure of a social network by predicting potential connections, detecting fraud, understand customer’s behavior of a car rental service or making real-time recommendations for example.

Technical Deep Dive: Random Forests

Random Forests are one of the most popular machine learning models used by data scientists today. How they are actually implemented and the variety of use cases they can be applied to are often overlooked. While this article will focus on the inner workings of Random Forests, we’ll start off by exploring the main problems this model solves.

Early stopping in polynomial regression

Using a deep learning technique to fight overfitting for a simple linear regression model. I was testing an example from scikit-learn site, that demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions, according to the article. Below is a modified version of this code.

5 Weird Ways to Use Data Science

So, I decided to collect 5 cases when data science is used for nothing but fun. Let’s get started, shall we?
1. Game Of Throne Deaths in Season 8, Data Science’s Angle
2. Predicting the outcome of sports game: Syracuse over Michigan State
3. Taylor Swift detector developed with Swift
4. Game of Wines – the ML and data science based detector of wine quality
5. Who’s killing the Academy Awards Game? – predicted by data science

Enabling developers and organizations to use differential privacy

Whether you’re a city planner, a small business owner, or a software developer, gaining useful insights from data can help make services work better and answer important questions. But, without strong privacy protections, you risk losing the trust of your citizens, customers, and users. Differentially-private data analysis is a principled approach that enables organizations to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual’s data to be distinguished or re-identified. This type of analysis can be implemented in a wide variety of ways and for many different purposes. For example, if you are a health researcher, you may want to compare the average amount of time patients remain admitted across various hospitals in order to determine if there are differences in care. Differential privacy is a high-assurance, analytic means of ensuring that use cases like this are addressed in a privacy-preserving manner. Today, we’re rolling out the open-source version of the differential privacy library that helps power some of Google’s core products. To make the library easy for developers to use, we’re focusing on features that can be particularly difficult to execute from scratch, like automatically calculating bounds on user contributions. It is now freely available to any organization or developer that wants to use it.

Enhancing Static Plots with Animations

This post aims to introduce you to animating ggplot2 visualisations in r using the gganimate package by Thomas Lin Pedersen. The post will visualise the theoretical winnings I would’ve had, had I followed the simple model to predict (or tip as it’s known in Australia) winners in the AFL that I explained in this post. The data used in the analysis was collected from the AFL Tables website as part of a larger series I wrote about on AFL crowds.

Model Evaluation in the Land of Deep Learning

Applications for machine learning and deep learning have become increasingly accessible. For example, Keras provides APIs with TensorFlow backend that enable users to build neural networks without being fluent with TensorFlow. Despite the ease of building and testing models, deep learning has suffered from a lack of interpretability; deep learning models are considered black boxes to many users. In a talk at ODSC West in 2018, Pramit Choudhary explained the importance of model evaluation and interpretability in deep learning and some cutting edge techniques for addressing it.

Comparison of Lightweight Document Classification Models

Document Classification: The task of assigning labels to large bodies of text. In this case the task is to classify news articles into different labels, such as sport or politics. The data set used wasn’t ideally suited for deep learning, having only low thousands of examples, but this is far from an unrealistic case outside larger firms.

Building a Natural Language Processing Pipeline

Copenhagen is the capital and most populous city of Denmark and capital sits on the coastal islands of Zealand and Amager. It’s linked to Malmo in southern Sweden by the Oresund Bridge. Indre By, the city’s historic centre, contains Frederiksstaden, an 18th-century rococo district, home to the royal family’s Amalienborg Palace. Nearby is Christiansborg Palace and the Renaissance-era Rosenborg Castle, surrounded by gardens and home to the crown jewels.

Heart disease Classification with Apache beam and Tensorflow Transform

Machine learning models include the step of preprocessing or feature engineering before the data is actually trainable. Feature Engineering includes normalizing and scaling data, encoding categorical values as numerical values, forming vocabularies, and binning of continuous numerical values. Distributed frameworks like Google Cloud Dataflow or Apache Spark are often well known for applying large scale data preprocessing. To remove the inconsistency between training and serving ML models from different environments Google has come up with tf.Transform, a library for TensorFlow that ensures consistency of the feature engineering steps during model training and serving.

Philosophy and the practice of Bayesian statistics

A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.


DeepPrivacy is a fully automatic anonymization technique for images. This repository contains the source code for the paper ‘DeepPrivacy: A Generative Adversarial Network for Face Anonymization’, published at ISVC 2019.

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

Data Science Tools. pychamp is a data science tool intended to ease data science practices.
• configparser : It can be used for parsing configuration file format `json` and `ini`.
• connection : It can be used for creating `connection URL`, `connection engine` and for `executing` any type of sql `queries`.
• features_selection : It can be used for selecting features using `Backward Elimination`, `VIF` and `Features Importance`.
• sampling : It can be used for different types of sampling operations such as `SMOTE`, `SMOTENC` and `ADASYN` for both categorical and numerical features.
• stats : It can be used for `Confidence and Prediction Interval`, `IQR outlier removal` and `Summary Statistics`.
• viz : It can be used for visualization.
• net : It can be used for sending mail currently.
• model : Different types of regression, classification and clustering models can be used.
• eda : It handles data types and missing values.

Python Package for exploratory data analysis in Data Science

PYthon Simple Neural Network – PYSNN is python3 lib for machine learning

The Python document processor

Toolkit for recommender systems. Toolkit for building recommender systems
• Provide CLI interface for running recommendation algorithms
• Contains abstractions you can leverage to build custom recommenders

TopicNet is a module for topic modelling using ARTM algorithm

ATOM is an AutoML package. ATOM is a python package for exploration of ML problems. With just a few lines of code, you can compare the performance of multiple machine learning models on a given dataset, providing a quick insight on which algorithms performs best for the task at hand. Furthermore, ATOM contains a variety of plotting functions to help you analyze the models’ performances.

CAnonical Time-series Features, see description and license on GitHub.

Python implementation of causal trees with validation

Framework helping testing Google Cloud Dataflows. This framework aims to help test Google Cloud Platform dataflows in an end-to-end way.

A Python Toolbox for Algorithmic Fairness, Accountability and Transparency

OBA Sparql Manager

Modern Scientific Document Processing Framework. SciWING is a modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and believes in modularity from ground up and easy to use interface. SciWING includes many pre-trained models for fundamental tasks in Scientific Document Processing for practitioners.

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm in Python

Gaussian & Binomial Distributions

If you did not already know

Apache Avro google
Apache Avro is a data serialization system. …

Feudal Multi-agent Hierarchies (FMH) google
We investigate how reinforcement learning agents can learn to cooperate. Drawing inspiration from human societies, in which successful coordination of many individuals is often facilitated by hierarchical organisation, we introduce Feudal Multi-agent Hierarchies (FMH). In this framework, a ‘manager’ agent, which is tasked with maximising the environmentally-determined reward function, learns to communicate subgoals to multiple, simultaneously-operating, ‘worker’ agents. Workers, which are rewarded for achieving managerial subgoals, take concurrent actions in the world. We outline the structure of FMH and demonstrate its potential for decentralised learning and control. We find that, given an adequate set of subgoals from which to choose, FMH performs, and particularly scales, substantially better than cooperative approaches that use a shared reward function. …

Knowledge-Guided Generative Adversarial Network (KG-GAN) google
Generative adversarial networks (GANs) learn to mimic training data that represents the underlying true data distribution. However, GANs suffer when the training data lacks quantity or diversity and therefore cannot represent the underlying distribution well. To improve the performance of GANs trained on under-represented training data distributions, this paper proposes KG-GAN (Knowledge-Guided Generative Adversarial Network) to fuse domain knowledge with the GAN framework. KG-GAN trains two generators; one learns from data while the other learns from knowledge. To achieve KG-GAN, domain knowledge is formulated as a constraint function to guide the learning of the second generator. We validate our framework on two tasks: fine-grained image generation and hair recoloring. Experimental results demonstrate the effectiveness of KG-GAN. …

Relay google
Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressiveness, composability, and portability. We present Relay, a new intermediate representation (IR) and compiler framework for DL models. The functional, statically-typed Relay IR unifies and generalizes existing DL IRs and can express state-of-the-art models. Relay’s expressive IR required careful design of the type system, automatic differentiation, and optimizations. Relay’s extensible compiler can eliminate abstraction overhead and target new hardware platforms. The design insights from Relay can be applied to existing frameworks to develop IRs that support extension without compromising on expressivity, composibility, and portability. Our evaluation demonstrates that the Relay prototype can already provide competitive performance for a broad class of models running on CPUs, GPUs, and FPGAs. …

Distilled News

Document Embedding Techniques

Word embeddings – the mapping of words into numerical vector spaces – has proved to be an incredibly important method for natural language processing (NLP) tasks in recent years, enabling various machine learning models that rely on vector representation as input to enjoy richer representations of text input. These representation preserve more semantic and syntactic information on words, leading to improved performance in almost every imaginable NLP task.

BERT is changing the NLP landscape

BERT is changing the NLP landscape and making chatbots much smarter by enabling computers to better understand speech and respond intelligently in real-time. What Makes BERT so Amazing?
• BERT is a contextual model.
• BERT enables transfer learning.
• BERT can be fine-tuned cheaply and quickly.

Introduction to Neural Networks and Their Key Elements (Part-B) – Hyper-Parameters

In the previous story (part A) we discussed the structure and three main building blocks of a Neural Network. This story will take you through the elements which really make a useful force and separate them from rest of the Machine Learning Algorithms. Previously we discussed about Units/Neurons, Weights/Parameters & Biases today we will discuss – Hyper-Parameters

Tutorial on Variational Graph Auto-Encoders

Graphs are applicable to many real-world datasets such as social networks, citation networks, chemical graphs, etc. The growing interest in graph-structured data increases the number of researches in graph neural networks. Variational autoencoders (VAEs) embodied the success of variational Bayesian methods in deep learning and have inspired a wide range of ongoing researches. Variational graph autoencoder (VGAE) applies the idea of VAE on graph-structured data, which significantly improves predictive performance on a number of citation network datasets such as Cora and Citesser. I searched on the internet and have yet to see a detailed tutorial on VGAE. In this article, I will briefly talk about traditional autoencoders and variational autoencoders. Furthermore, I will discuss the idea of applying VAE to graph-structured data (VGAE).

Automate Hyperparameter Tuning for your models

When we create our machine learning models, a common task that falls on us is how to tune them. People end up taking different manual approaches. Some of them work, and some don’t, and a lot of time is spent in anticipation and running the code again and again. So that brings us to the quintessential question: Can we automate this process?

Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

What it is really like to develop a model for a real-world business case. Have you ever taken part in a Kaggle competition? If you are studying, or have studied machine learning it is fairly likely that at some point you will have entered one. It is definitely a great way to put your model building skills into practice and I spent quite a bit of time on Kaggle when I was studying.

A Breakthrough for A.I. Technology: Passing an 8th-Grade Science Test

On Wednesday, the Allen Institute for Artificial Intelligence, a prominent lab in Seattle, unveiled a new system that passed the test with room to spare. It correctly answered more than 90 percent of the questions on an eighth-grade science test and more than 80 percent on a 12th-grade exam.

The Anthropologist of Artificial Intelligence

How do new scientific disciplines get started? For Iyad Rahwan, a computational social scientist with self-described ‘maverick’ tendencies, it happened on a sunny afternoon in Cambridge, Massachusetts, in October 2017. Rahwan and Manuel Cebrian, a colleague from the MIT Media Lab, were sitting in Harvard Yard discussing how to best describe their preferred brand of multidisciplinary research. The rapid rise of artificial intelligence technology had generated new questions about the relationship between people and machines, which they had set out to explore. Rahwan, for example, had been exploring the question of ethical behavior for a self-driving car – should it swerve to avoid an oncoming SUV, even if it means hitting a cyclist? – in his Moral Machine experiment.

Getting Started With Text Preprocessing for Machine Learning & NLP

Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not preprocessing their text or were using the wrong kind of text preprocessing for their project. With that in mind, I thought of shedding some light around what text preprocessing really is, the different techniques of text preprocessing and a way to estimate how much preprocessing you may need. For those interested, I’ve also made some text preprocessing code snippets in python for you to try. Now, let’s get started!

Introducing Neural Structured Learning in TensorFlow

We are excited to introduce Neural Structured Learning in TensorFlow, an easy-to-use framework that both novice and advanced developers can use for training neural networks with structured signals. Neural Structured Learning (NSL) can be applied to construct accurate and robust models for vision, language understanding, and prediction in general.

2018 in Review: 10 AI Failures

• Chinese billionaire’s face identified as jaywalker
• Uber self-driving car kills a pedestrian
• IBM Watson comes up short in healthcare
• Amazon AI recruiting tool is gender biased
• DeepFakes reveals AI’s unseemly side
• Google Photo confuses skier and mountain
• LG robot Cloi gets stagefright at its unveiling
• Boston Dynamics robot blooper
• AI World Cup 2018 predictions almost all wrong
• Startup claims to predict IQ from faces

Uber has troves of data on how people navigate cities. Urban planners have begged, pleaded, and gone to court for access. Will they ever get it?

As the deputy director for technology, data, and analysis at the San Francisco County Transportation Authority, Castiglione spends his days manipulating models of the Bay Area and its 7 million residents. From wide-sweeping ridership and traffic data to deep dives into personal travel choices via surveys, his models are able to estimate the number of people who will disembark at a specific train platform at a certain time of day and predict how that might change if a new housing development is built nearby, or if train-frequency is increased. The models are exceedingly complex, because people are so complex. ‘Think about the travel choices you’ve made in the last week, or the last year,’ Castiglione says. ‘How do you time your trips? What tradeoffs do you make? What modes of transportation do you use? How do those choices change from day to day?’ He has the deep voice of an NPR host and the demeanor of a patient professor. ‘The models are complex but highly rational,’ he says.

Visualizing SVM with Python

In my previous article, I introduced the idea behind the classification algorithm Support Vector Machine. Here, I’m going to show you a practical application in Python of what I’ve been explaining, and I will do so by using the well-known Iris dataset. Following the same structure of that article, I will first deal on linearly separable data, then I will move towards no-linearly separable data, so that you can appreciate the power of SVM which lie in the so-called Kernel Trick.

How to generate neural network confidence intervals with Keras

Whether we’re predicting water levels, queue lengths or bike rentals, at HAL24K we do a lot of regression, with everything from random forests to recurrent neural networks. And as good as our models are, we know they can never be perfect. Therefore, whenever we provide our customers with predictions, we also like to include a set of confidence intervals: what range around the prediction will the actual value fall within, with (e.g.) 80% confidence?

If you did not already know

semopy google
Structural equation modelling (SEM) is a multivariate statistical technique for estimating complex relationships between observed and latent variables. Although numerous SEM packages exist, each of them has limitations. Some packages are not free or open-source; the most popular package not having this disadvantage is $\textbf{lavaan}$, but it is written in R language, which is behind current mainstream tendencies that make it harder to be incorporated into developmental pipelines (i.e. bioinformatical ones). Thus we developed the Python package $\textbf{semopy}$ to satisfy those criteria. The paper provides detailed examples of package usage and explains it’s inner clockworks. Moreover, we developed the unique generator of SEM models to extensively test SEM packages and demonstrated that $\textbf{semopy}$ significantly outperforms $\textbf{lavaan}$ in execution time and accuracy. …

Causaltoolbox google
Estimating heterogeneous treatment effects has become extremely important in many fields and often life changing decisions for individuals are based on these estimates, for example choosing a medical treatment for a patient. In the recent years, a variety of techniques for estimating heterogeneous treatment effects, each making subtly different assumptions, have been suggested. Unfortunately, there are no compelling approaches that allow identification of the procedure that has assumptions that hew closest to the process generating the data set under study and researchers often select just one estimator. This approach risks making inferences based on incorrect assumptions and gives the experimenter too much scope for p-hacking. A single estimator will also tend to overlook patterns other estimators would have picked up. We believe that the conclusion of many published papers might change had a different estimator been chosen and we suggest that practitioners should evaluate many estimators and assess their similarity when investigating heterogeneous treatment effects. We demonstrate this by applying 32 different estimation procedures to an emulated observational data set; this analysis shows that different estimation procedures may give starkly different estimates. We also provide an extensible \texttt{R} package which makes it straightforward for practitioners to apply our analysis to their data. …

Gradient Episodic Memory google
One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge. To better understand this issue, we study the problem of continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model for continual learning, called Gradient Episodic Memory (GEM) that alleviates forgetting, while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art. …

AlignFlow google
Given unpaired data from multiple domains, a key challenge is to efficiently exploit these data sources for modeling a target domain. Variants of this problem have been studied in many contexts, such as cross-domain translation and domain adaptation. We propose AlignFlow, a generative modeling framework for learning from multiple domains via normalizing flows. The use of normalizing flows in AlignFlow allows for a) flexibility in specifying learning objectives via adversarial training, maximum likelihood estimation, or a hybrid of the two methods; and b) exact inference of the shared latent factors across domains at test time. We derive theoretical results for the conditions under which AlignFlow guarantees marginal consistency for the different learning objectives. Furthermore, we show that AlignFlow guarantees exact cycle consistency in mapping datapoints from one domain to another. Empirically, AlignFlow can be used for data-efficient density estimation given multiple data sources and shows significant improvements over relevant baselines on unsupervised domain adaptation. …

Document worth reading: “Deconstructing Blockchains: A Comprehensive Survey on Consensus, Membership and Structure”

It is no exaggeration to say that since the introduction of Bitcoin, blockchains have become a disruptive technology that has shaken the world. However, the rising popularity of the paradigm has led to a flurry of proposals addressing variations and/or trying to solve problems stemming from the initial specification. This added considerable complexity to the current blockchain ecosystems, amplified by the absence of detail in many accompanying blockchain whitepapers. Through this paper, we set out to explain blockchains in a simple way, taming that complexity through the deconstruction of the blockchain into three simple, critical components common to all known systems: membership selection, consensus mechanism and structure. We propose an evaluation framework with insight into system models, desired properties and analysis criteria, using the decoupled components as criteria. We use this framework to provide clear and intuitive overviews of the design principles behind the analyzed systems and the properties achieved. We hope our effort will help clarifying the current state of blockchain proposals and provide directions to the analysis of future proposals. Deconstructing Blockchains: A Comprehensive Survey on Consensus, Membership and Structure

Distilled News

Building a Backend System for Artificial Intelligence

Let’s explore the challenges involved in building a backend system to store and retrieve high-dimensional data vectors, typical to modern systems that use ‘artificial intelligence’ – image recognition, text comprehension, document search, music recommendations, …

Automatic GPUs

A reproducible R / Python approach to getting up and running quickly on GCloud with GPUs in Tensorflow.

Classification algorithm for non-time series data

One of the critical problems of ‘identification’, be it NLP – speech/text or solving an image puzzle from pieces like a jigsaw, is to understand the words, or pieces of data and the context. The words or pieces individually don’t give any meaning and tying them together gives an idea about the context. Now the data itself has some patterns, which is broadly classified as sequential or time-series data and non-time series data, which is largely non-sequential or arbitrary. Sentiment analysis of text reports, documents and journals, novels & classics follow time series pattern, in the sense, the words itself follow a precedence as governed by the grammar and the language dictionary. So are the stock-price prediction problems which has a precedent of the previous time period predictions and socio-economic conditions.

Calendar Heatmaps in ggplot

Calendar heatmaps are a neglected, but valuable, way of representing time series data. Their chief advantage is in allowing the viewer to visually process trends in categorical or continuous data over a period of time, while relating these values to their month, week, and weekday context – something that simple line plots do not efficiently allow for. If you are displaying data on staffing levels, stock returns (as we will do here), on-time performance for transit systems, or any other one dimensional data, a calendar heatmap can do wonders for helping your stakeholders note patterns in the interaction between those variables and their calendar context. In this post, I will use stock data in the form of daily closing prices for the SPY – SPDR S&P 500 ETF, the most popular exchange traded fund in the world. ETF’s are growing in popularity, so much so that there’s even a podcast devoted entirely to them. For the purposes of this blog post, it’s not necessary to have any familiarity with ETF’s or stocks in general. Some knowledge of tidyverse packages and basic R will be helpful, though.

Getting Machine Learning Models Ready For Production

As a Scientist, it’s incredibly satisfying to be given the freedom to experiment by applying new research and rapidly prototyping. This satisfaction can be sustained quite well in a lab environment but can diminish quickly in a corporate environment. This is because of the underlying commercial value motive which science is driven by in a business setting – if it doesn’t add business value to employees or customers, there’s no place for it! Business value, however, goes beyond just being a nifty experiment which shows potential value to employees or customers. In the context of Machine Learning models, the only [business] valuable models, are models in Production! In this blog post, I will take you through the journey which my team and I went through in taking Machine Learning models to Production and some important lessons learnt along the way.

Adversarial Examples – Rethinking the Definition

Adversarial examples are a large obstacle for a variety of machine learning systems to overcome. Their existence shows the tendency of models to rely on unreliable features to maximize performance, which if perturbed, can cause misclassifications with potentially catastrophic consequences. The informal definition of an adversarial example is an input that has been modified in a way that is imperceptible to humans, but is misclassified by a machine learning system whereas the original input was correctly classified.

Data Science is Boring (Part 1)

My boring days of deploying Machine Learning and how I cope.

Parsing Text for Emotion Terms: Analysis & Visualization Using R: Updated Analysis

The motivation for an updated analysis: The first publication of Parsing text for emotion terms: analysis & visualization Using R published in May 2017 used the function get_sentiments(‘nrc’) that was made available in the tidytext package. Very recently, the nrc lexicon was dropped from the tidytext package and hence the R codes in the original publication failed to run. The NRC emotion terms are also available in the lexicon package.

R Neural Network

In the previous four posts I have used multiple linear regression, decision trees, random forest, gradient boosting, and support vector machine to predict MPG for 2019 vehicles. It was determined that svm produced the best model. In this post I am going to use the neuralnet package to fit a neural network to the cars_19 dataset.

Kubernetes: A simple overview

This overview covers the basics of Kubernetes: what it is and what you need to keep in mind before applying it within your organization. The information in this piece is curated from material available on the O’Reilly online learning platform and from interviews with Kubernetes experts.

Quickly understanding process mining by analyzing event logs with Celonis Snap

Data is the new oil.’, ‘Our company needs to become more efficient.’, ‘Can we optimize this process?’, ‘Our processes are too complicated.’ – sentences you have heard very often and maybe cannot hear anymore. It is understandable but there are some actual real world benefits that stem from the technologies and discussions behind the super trend of (Big) Data. One of the emerging technologies in this field is in more ways than one directly linked to the sentences above. It is process mining. Maybe you have heard of it. Maybe you have not. Harvard Business Review thinks ‘[…] you should be exploring process mining’.

Fine-grained Sentiment Analysis (Part 3): Fine-tuning Transformers

Hands-on transfer learning using a pretrained transformer in PyTorch. This is Part 3 of a series on fine-grained sentiment analysis in Python. Parts 1 and 2 covered the analysis and explanation of six different classification methods on the Stanford Sentiment Treebank fine-grained (SST-5) dataset. In this post, we’ll look at how to improve on past results by building a transformer-based model and applying transfer learning, a powerful method that has been dominating NLP task leaderboards lately.

Industrializing AI & Machine Learning Applications with Kubeflow

Enable data scientists to make scaling and production-ready ML products.

Building and Labeling Image Datasets for Data Science Projects

Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg. Another route I have taken is downloading a bunch of images by hand and just display images and label them in an excel spreadsheet… For certain projects you might just have to take a bunch of pictures with your phone as was the case when I made my dice counter. These days I have figured out a few more tricks which make this processes a bit easier and am working on improving things along the way.

Introducing IceCAPS: Microsoft’s Framework for Advanced Conversation Modeling

The new open source framework that brings multi-task learning to conversational agents. Neural conversation systems and disciplines such as natural language processing(NLP) have seen significant advancements over the last few years. However, most of the current NLP stacks are designed for simple dialogs based on one or two sentences. Structuring more sophisticated conversations that factor in aspects such as personalities or context remains an open challenge. Recently, Microsoft Research unveiled IceCAPS, an open source framework for advanced conversation modeling.

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

Side Channel Attack Assisted with Machine Learning

Use Scikit Learn models in Flutter. Easily transpile scikit-learn models to native Dart code aimed at Flutter. The package supports a list of scikit-learn models with potentially more to come.

Slalom GGP libary for DataOps automation

Watson TTS Implementation. This Module is designed to convert Text to Speech format. It will generate Wav file for any Text-String passed to the module.


ML classifier package

Enterprise Machine-Learning and Predictive Analytics. Vivid Code is a pioneering software framework for next generation data analysis applications, that interconnects collaborative data science with automated machine learning. Based on the **Cloud-Assisted Meta programming** (CAMP) paradigm, the framework allows the usage of Currently Best Fitting (CBF) algorithms. Before code interpretation / compilation the concrete algorithms, that implement the CBF specifications, are automatically chosen from local and public catalog servers, that host and deploy the concrete algorithms. Thereby the specification is constituted by a unique algorithm category, a data domain and a metric, which substantiates the meaning of *Best Fitting* within the respective algorithm- and data context. An example is the average prediction accuracy within a fixed set of gold standard samples of the data domain (e.g. latin handwriting samples, spoken word samples, TCGA gene expression data, etc.).

Alyeska /al-ee-EHS-kah/ n. A Data Pipeline Toolkit

中文语义理解服务 Python SDK

Python package to explore the color of language. compsyn is a package which provides a novel methodlogy to explore relationships between words and abstract concepts through color. The work rose through a collaboration between the contributors at the Santa Fe Institute’s Complex System Summer School 2019.

deep learning framework from zero

Kubeflow Fairing Python SDK. Python SDK for Kubeflow Fairing components.

Markdown to Jupyter Notebook converter.

Python client for ML Pipelines. Python client for the BitGN Machine Learning Pipelines project.

Memory-efficient probabilistic counter namely Morris Counter