Neural Network Exchange Format (NNEF)

NNEF reduces machine learning deployment fragmentation by enabling a rich mix of neural network training tools and inference engines to be used by applications across a diverse range of devices and platforms. The goal of NNEF is to enable data scientists and engineers to easily transfer trained networks from their chosen training framework into a wide variety of inference engines. A stable, flexible and extensible standard that equipment manufacturers can rely on is critical for the widespread deployment of neural networks onto edge devices, and so NNEF encapsulates a complete description of the structure, operations and parameters of a trained neural network, independent of the training tools used to produce it and the inference engine used to execute it.

Pandas Tutorial 3: Important Data Formatting Methods (merge, sort, reset_index, fillna)

This is the third episode of my pandas tutorial series. In this one I´ll show you four data formatting methods that you might use a lot in data science projects. These are: merge, sort, reset_index and fillna! Of course, there are many others, and at the end of the article, I´ll link to a pandas cheat sheet where you can find every function and method you could ever need. Okay! Let´s get started!

Setting up your AI Dev Environment in 5 Minutes

Whether you’re a novice data science enthusiast setting up TensorFlow for the first time, or a seasoned AI engineer working with terabytes of data, getting your libraries, packages, and frameworks installed is always a struggle. Learn how datmo, an open source python package, helps you get started in minutes.
0. Prerequisites
1. Install datmo
2. Initialize a datmo project
3. Start environment setup
4. Select System Drivers (CPU or GPU)
5. Select an environment
6. Select a language version (if applicable)
7. Launch your workspace

Datmo – An open source model tracking and reproducibility tool for developers.

Workflow tools to help you experiment, deploy, and scale. By data scientists, for data scientists. Datmo is an open source model tracking and reproducibility tool for developers. Features
• One command environment setup (languages, frameworks, packages, etc)
• Tracking and logging for model config and results
• Project versioning (model state tracking)
• Experiment reproducibility (re-run tasks)
• Visualize + export experiment history

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

In-brief: splashr update + High Performance Scraping with splashr, furrr & TeamHG-Memex’s Aquarium

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it´s best to use it in a Docker container.

IML and H2O: Machine Learning Model Interpretability And Feature Explanation

Model interpretability is critical to businesses. If you want to use high performance models (GLM, RF, GBM, Deep Learning, H2O, Keras, xgboost, etc), you need to learn how to explain them. With machine learning interpretability growing in importance, several R packages designed to provide this capability are gaining in popularity. We analyze the IML package in this article. In recent blog posts we assessed LIME for model agnostic local interpretability functionality and DALEX for both local and global machine learning explanation plots. This post examines the iml package (short for Interpretable Machine Learning) to assess its functionality in providing machine learning interpretability to help you determine if it should become part of your preferred machine learning toolbox. We again utilize the high performance machine learning library, h2o, implementing three popular black-box modeling algorithms: GLM (generalized linear models), RF (random forest), and GBM (gradient boosted machines). For those that want a deep dive into model interpretability, the creator of the iml package, Christoph Molnar, has put together a free book: Interpretable Machine Learning. Check it out.

PCA revisited

Principal component analysis (PCA) is a dimensionality reduction technique which might come handy when building a predictive model or in the exploratory phase of your data analysis. It is often the case that when it is most handy you might have forgot it exists but let´s neglect this aspect for now 😉

Delayed Impact of Fair Machine Learning

Delayed impact of fair machine learning’ (https://…/1803.04383 ) won a best paper award at ICML this year. It´s not an easy read (at least it wasn´t for me), but fortunately it´s possible to appreciate the main results without following all of the proof details. The central question is how to ensure fair treatment across demographic groups in a population when using a score-based machine learning model to decide who gets an opportunity (e.g. is offered a loan) and who doesn´t. Most recently we looked at the equal opportunity and equalized odds models. The underlying assumption of course for studied fairness models is that the fairness criteria promote the long-term well-being of those groups they aim to protect. The big result in this paper is that you can easily up end ‘killing them with kindness’ instead. The potential for this to happen exists when there is a feedback loop in place in the overall system. By overall system here, I mean the human system of which the machine learning model is just a small part. Using the loan/no-loan decision that is a popular study vehicle in fairness papers, we need to consider not just (for example) the opportunity that someone in a disadvantaged group has to qualify for a loan, but also what happens in the future as a result of that loan being made. If the borrower eventually defaults, then they will also see a decline in their credit score, which will make it harder for the borrower to obtain additional loans in the future. A successful lending event on the other hand may increase the credit score for the borrower.

What is a CapsNet or Capsule Network?

What is a Capsule Network? What is a Capsule? Is CapsNet better than a Convolutional Neural Network (CNN)? In this article I will talk about all the above questions about CapsNet or Capsule Network released by Hinton.

Predicting Employee Churn in Python

Analyze employee churn. Find out why employees are leaving the company and learn to predict, who will leave the company.
In the past, most of the focus on the ‘rates’ such as attrition rate and retention rates. HR Managers compute the previous rates try to predict the future rates using data warehousing tools. These rates present the aggregate impact of churn, but this is the half picture. Another approach can be the focus on individual records in addition to aggregate.
There are lots of case studies on customer churn are available. In customer churn, you can predict who and when a customer will stop buying. Employee churn is similar to customer churn. It mainly focuses on the employee rather than the customer. Here, you can predict who, and when an employee will terminate the service. Employee churn is expensive, and incremental improvements will give significant results. It will help us in designing better retention plans and improving employee satisfaction. In this tutorial, you are going to cover the following topics:
• Employee Churn Analysis
• Data loading and understanding feature
• Exploratory data analysis and Data visualization
• Cluster analysis
• Building prediction model using Gradient Boosting Tree.
• Evaluating model performance
• Conclusion

Unveiling Mathematics behind XGBoost

This article is targeted at people who found it difficult to grasp the original paper. Follow me till the end, and I assure you will atleast get a sense of what is happening underneath the revolutionary machine learning model. Lets get started.

Time series intervention analysis with fuel prices

Look into whether the regional fuel tax in Auckland has led to changes in fuel prices in other regions of New Zealand.

OBDA – Ontology-Based Data Access – A Data Management Paradigm

Ontology-based Data Access (OBDA) is a new paradigm, based on the use of knowledge representation and reasoning techniques, for governing the resources (data, meta-data, services, processes, etc.) of modern information systems. The challenges of data governance:
• Accessing Data
In large organizations data sources are commonly re-shaped by corrective maintenance and to adapt to application requirements, and applications are changed to meet new requirements. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, accessing data means interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. This process can be both expensive and time-consuming.
• Data Quality
Data quality is cited often as a critical factor in delivering high value information services. But how can we check data quality, and how can we decide if it is good if we do not have a clear understanding of the semantics that data should bring? Moreover, how can we judge the quality of external data of business partners, clients, or even public sources, that we connect to? Data quality is also crucial for opening data to external organisations, to favor new business opportunities, or even to the public, which we are seeing more of nowadays in the age of Open Data.
• Process Specification
Information systems are key assets for business organisations, which rely not only on data, but also, for instance, on processes and services. Designing and managing processes is an important aspect of information systems, but deciding what a process should do is tough to do properly without a clear idea of which data the process will access, and how it will possibly change it. The difficulties of doing this properly come from various factors, including the lack of modelling languages and tools for describing process and data holistically, and the problems related to the semantics of data make this task even harder.
• Three-Level Architecture
The key idea of OBDA is provide users with access to the information in their data sources through a three-level architecture, constituted by the ontology, the sources, and the mapping between the two, where the ontology is a formal description of the domain of interest, and is the heart of the system. Through this architecture, OBDA provides a semantic end-to-end connection between users and data sources, allowing users to directly query data spread across multiple distributed sources, through the familiar vocabulary of the ontology: the user formulates SPARQL queries over the ontology which are transformed, through the mapping layer, into SQL queries over the underlying relational databases.
• The Ontology layer in the architecture is the mean for pursuing a declarative approach to information integration, and, more generally, to data governance. The domain knowledge base of the organization is specified through a formal and high level description of both its static and dynamic aspects, represented by the ontology. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.
• The Mapping layer connects the Ontology layer with the Data Source layer by defining the relationships between the domain concepts on the one hand and the data sources on the other hand. These mappings are not only used for the operation of the information system, but can also be a significant asset for documentation purposes in cases where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards.
• The Data Source layer is constituted by the existing data sources of the organization.

Data Scientist guide for getting started with Docker

Docker is an increasingly popular way to create and deploy applications through virtualization, but can it be useful for data scientists? This guide should help you quickly get started.