How to Discover and Classify Metadata using Apache Atlas on Amazon EMR

The boundaries of the enterprise are becoming diffused. You have data on the network, on the endpoint, and on the cloud. Enabling visibility into your data flows is a critical first step to understanding which data is at risk for theft or misuse. You need to know what data you have, where it’s located, and why that data exists in order to properly protect it. This is where data discovery and data classification come into play.

Specialized tools for machine learning development and model governance are becoming essential

A few years ago, we started publishing articles (see ‘Related resources’ at the end of this post) on the challenges facing data teams as they start taking on more machine learning (ML) projects. Along the way, we described a new job role and title – machine learning engineer – focused on creating data products and making data science work in production, a role that was beginning to emerge in the San Francisco Bay Area two years ago. At that time, there weren’t any popular tools aimed at solving the problems facing teams tasked with putting machine learning into practice. About 10 months ago, Databricks announced MLflow, a new open source project for managing machine learning development (full disclosure: Ben Lorica is an advisor to Databricks). We thought that given the lack of clear open source alternatives, MLflow had a decent chance of gaining traction, and this has proven to be the case. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow.

Symbolic Regression, Genetic Programming… or if Kepler had R

A problem researchers often face is that they have an amount of data and need to find some functional form, e.g. some kind of physical law, for it. The standard approach is to try different functional forms, like linear, quadratic or polynomial functions with higher order terms. Also possible is a fourier analysis with trigonometric functions. But all of those approaches are quite limited and it would be nice if we had a program to do this for us and come up with a functional form that is both accurate and parsimonious… well, your prayers were heard! This approach is called symbolic regression (also sometimes called free-form regression) – a special case of what is called genetic programming – and the idea is to give the algorithm a grammar which defines some basic functional building blocks (like addition, subtraction, multiplication, logarithms, trigonometric functions and so on) and then try different combinations in an evolutionary process which keeps the better terms and recombines them to even more fitting terms. And the end we want to have a nice formula which captures the dynamics of the system without overfitting the noise. The package with which you can do such magic is the gramEvol package (on CRAN).

Automating ethics

We are surrounded by systems that make ethical decisions: systems approving loans, trading stocks, forwarding news articles, recommending jail sentences, and much more. They act for us or against us, but almost always without our consent or even our knowledge. In recent articles, I’ve suggested the ethics of artificial intelligence itself needs to be automated. But my suggestion ignores the reality that ethics has already been automated: merely claiming to make data-based recommendations without taking anything else into account is an ethical stance. We need to do better, and the only way to do better is to build ethics into those systems. This is a problematic and troubling position, but I don’t see any alternative. The problem with data ethics is scale. Scale brings a fundamental change to ethics, and not one that we’re used to taking into account. That’s important, but it’s not the point I’m making here. The sheer number of decisions that need to be made means that we can’t expect humans to make those decisions. Every time data moves from one site to another, from one context to another, from one intent to another, there is an action that requires some kind of ethical decision.

DeepMind and Google: the battle to control artificial intelligence

One afternoon in August 2010, in a conference hall perched on the edge of San Francisco Bay, a 34-year-old Londoner called Demis Hassabis took to the stage. Walking to the podium with the deliberate gait of a man trying to control his nerves, he pursed his lips into a brief smile and began to speak: ‘So today I’m going to be talking about different approaches to building…’ He stalled, as though just realising that he was stating his momentous ambition out loud. And then he said it: ‘AGI’. AGI stands for artificial general intelligence, a hypothetical computer program that can perform intellectual tasks as well as, or better than, a human. AGI will be able to complete discrete tasks, such as recognising photos or translating languages, which are the single-minded focus of the multitude of artificial intelligences (AIs) that inhabit our phones and computers. But it will also add, subtract, play chess and speak French. It will also understand physics papers, compose novels, devise investment strategies and make delightful conversation with strangers. It will monitor nuclear reactions, manage electricity grids and traffic flow, and effortlessly succeed at everything else. AGI will make today’s most advanced AIs look like pocket calculators.

The Responsible Machine Learning Principles

The Responsible Machine Learning Principles are a practical framework put together by domain experts. Their purpose is to provide guidance for technologists to develop machine learning systems responsibly. Below are the summarised 8 principles.
1. Human augmentation – I commit to assess the impact of incorrect predictions and, when reasonable, design systems with human-in-the-loop review processes
2. Bias evaluation – I commit to continuously develop processes that allow me to understand, document and monitor bias in development and production.
3. Explainability by justification – I commit to develop tools and processes to continuously improve transparency and explainability of machine learning systems where reasonable.
4. Reproducible operations – I commit to develop the infrastructure required to enable for a reasonable level of reproducibility across the operations of ML systems.
5. Displacement strategy – I commit to identify and document relevant information so that business change processes can be developed to mitigate the impact towards workers being automated.
6. Practical accuracy – I commit to develop processes to ensure my accuracy and cost metric functions are aligned to the domain-specific applications.
7. Trust by privacy – I commit to build and communicate processes that protect and handle data with stakeholders that may interact with the system directly and/or indirectly.
8. Data risk awareness – I commit to develop and improve reasonable processes and infrastructure to ensure data and model security are being taken into consideration during the development of machine learning systems.

The emergence of number and syntax units in LSTM language models

Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as long-distance number agreement. We have however no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that long-distance number information is largely managed by two “number units”. Importantly, the behaviour of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.

So you want to deploy multiple containers running different R models?

If you followed the first part in this tutorial series, you have achieved the following things: ?running your R code with a plumber API inside a docker container ?activating HTTP Basic authentication and SSL encryption via the NGINX reverse proxy For a single container deployment the approach we have seen is absolutely feasible, but once you repeat this process you will notice the redundancy. Why should we install and configure NGINX for every single container instance we are running in our environment? The logical thing to do is to configure this common task of authentication and encryption only once for all deployed containers. This can be achieved by utilizing docker swarm (https://…/swarm ).


A curated, but probably biased and incomplete, list of awesome machine learning interpretability resources. If you want to contribute to this list (and please do!) read over the contribution guidelines, send a pull request, or contact @jpatrickhall. An incomplete, imperfect blueprint for a more human-centered, lower-risk machine learning. The resources in this repository can be used to do many of these things today. The resources in this repository should not be considered legal compliance advice.

Structural Time Series modeling in TensorFlow Probability

In this post, we introduce tfp.sts, a new library in TensorFlow Probability for forecasting time series using structural time series models. Although predictions of future events are necessarily uncertain, forecasting is a critical part of planning for the future. Website owners need to forecast the number of visitors to their site in order to provision sufficient hardware resources, as well as predict future revenue and costs. Businesses need to forecast future demands for consumer products to maintain sufficient inventory of their products. Power companies need to forecast demand for electricity, to make informed purchases of energy contracts and to construct new power plants. Methods for forecasting time series can also be applied to infer the causal impact of a feature launch or other intervention on user engagement metrics , to infer the current value of difficult-to-observe quantities like the unemployment rate from more readily available information , as well as to detect anomalies in time series data.

Teaching machines to reason about what they see

Researchers combine statistical and symbolic artificial intelligence techniques to speed learning and improve transparency.

Extracting and Learning Semantics from Social Web Data

In this thesis, we extract semantic information from both tagging data created by users of social tagging systems and human navigation data in different semantic-driven social web systems. Our main goal is to construct high quality and robust vector representations of words which can the be used to measure the relatedness of semantic concepts. First, we show that navigation in the social media systems Wikipedia and BibSonomy is driven by a semantic component. After this, we discuss and extend methods to model the semantic information in tagging data as low-dimensional vectors. Furthermore, we show that tagging pragmatics influences different facets of tagging semantics. We then investigate the usefulness of human navigational paths in several different settings on Wikipedia and BibSonomy for measuring semantic relatedness. Finally, we propose a metric-learning based algorithm in adapt pre-trained word embeddings to datasets v containing human judgment of semantic relatedness. This work contributes to the field of studying semantic relatedness between words by proposing methods to extract semantic relatedness from web navigation, learn highquality and low-dimensional word representations from tagging data, and to learn semantic relatedness from any kind of vector representation by exploiting human feedback. Applications first and foremest lie in ontology learning for the Semantic Web, but also semantic search or query expansion.

Statistical methods and machine learning in weather and climate modeling

Predicting future weather or climates requires modeling a hugely complex and chaotic system, the atmosphere. Remarkable progress has been made over the last decades but stubborn errors remain. Two main contributors to these errors are the parameterization of subgrid processes such as clouds and the uncertainty that comes from the chaotic evolution of the atmosphere. In this thesis, four studies are presented that tackle these issues with statistical and machine learning approaches.

A review of spline function procedures in R

With progress on both the theoretical and the computational fronts the use of spline modelling has become an established tool in statistical regression analysis. An important issue in spline modelling is the availability of user friendly, well documented software packages. Following the idea of the STRengthening Analytical Thinking for Observational Studies initiative to provide users with guidance documents on the application of statistical methods in observational research, the aim of this article is to provide an overview of the most widely used spline-based techniques and their implementation in R.