Use Amazon Glue Component to Store Metadeta

Amazon Web Services (AWS) was the first significant player to offer reasonably priced cloud infrastructure and services, and it continues to be the single largest vendor in the cloud market. With AWS, businesses have access to extremely durable storage, cost-effective compute power, high-performing databases and more without the hassle of provisioning and managing infrastructure. AWS services are available without any up-front investments, and you pay for only what you use.
ETL is one of the primary tasks in the analytics industry. You can do ETL in AWS in a few different ways:
1.AWS Glue
2.Data pipeline
3.A custom solution, e.g. a Docker
In this blog, we are going to focus the ETL part through AWS Glue. We will also look at components of AWS Glue. Furthermore, we will try to play with AWS glue a bit in order to understand it in depth.

What is ROC?

ROC curve is used to find out the accuracy of classifiers. In a data science interview, different questions might directly come up around ROC curves and AUC score. There are also some interview questions around classification where ROC curves could help the interviewee decide between different classifiers and make more educated judgements around what would work better on a particular problem or dataset. They are one of the most handy tools in a Data Scientist’s toolkit. The article aims to understand the basic concept of ROC and how to use them in the context of a problem and eventually, a data science interview.

Causability and explainabilty of artificial intelligence in medicine

Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the problem of explainability is as old as AI itself and classic AI represented comprehensible retraceable approaches. However, their weakness was in dealing with uncertainties of the real world. Through the introduction of probabilistic learning, applications became increasingly successful, but increasingly opaque. Explainable AI deals with the implementation of transparency and traceability of statistical black-box machine learning methods, particularly deep learning (DL). We argue that there is a need to go beyond explainable AI. To reach a level of explainable medicine we need causability. In the same way that usability encompasses measurements for the quality of use, causability encompasses measurements for the quality of explanations. In this article, we provide some necessary definitions to discriminate between explainability and causability as well as a use-case of DL interpretation and of human explanation in histopathology. The main contribution of this article is the notion of causability, which is differentiated from explainability in that causability is a property of a person, while explainability is a property of a system.

Top 10 Coding Mistakes Made by Data Scientists

A data scientist is a ‘person who is better at statistics than any software engineer and better at software engineering than any statistician’. Many data scientists have a statistics background and little experience with software engineering. I’m a senior data scientist ranked top 1% on Stackoverflow for python coding and work with a lot of (junior) data scientists. Here is my list of 10 common mistakes I frequently see.
1. Don’t share data referenced in code
2. Hardcode inaccessible paths
3. Mix data with code
4. Git commit data with source code
5. Write functions instead of DAGs
6. Write for loops
7. Don’t write unit tests
8. Don’t document code
9. Save data as csv or pickle
10. Use jupyter notebooks

The Need for Analytic and Algorithm Governance is Growing

Two articles in the last week show the fallacy of the idea of perfect information. In the Economist this week there is ‘How Price-bots can conspire against consumers – and how trustbusters might thwart them’. This article reports on an illicit cooperation between price bots setting gas prices in the US. In today’s Wall Street Journal there is an article titled, ‘To Set Prices, Stores turn to Algorithms’ that reports on a similar story in Germany. The point of the message is that when algorithms are charged with maximizing certain objectives, they might accidentally ‘collude’ (more precisely, behave as if they were colluding since they cannot) and the result has been, so the observations states, a rise in prices. How can this be? Why do I note this today? In yesterday’s US print edition of the Wall Street Journal there was a fine insert with the title and focus: Artificial Intelligence. It is a quick snapshot on the state of the market of AI. One startling item concerns an innovation being worked on that seems to suggest that those of us that are ill with heart disease, but might not yet know it, might be helped simply by talking to a specific app trained with AI to watch for certain heart disease markets. Yes, markers that can be found in ones’ voice!

Comment on Frameworks for Causal Inference

Rose compares three frameworks for causal inference: Potential Outcomes, DAGS, and Campbell’s Framework. I argue that Potential Outcomes is more appropriately thought of as a conceptual analysis of the notion of a causal effect and offer some objections to that analysis. I also contend that DAGS allow for a more precise definition of the concept of confounding variable than the one more typically seen in the social sciences. And I argue that what Rose regards as limitations of Potential Outcomes and DAGS aren’t really limitations. Both Potential Outcomes and DAGS were developed for certain purposes. Rose’s claim about limitations amounts to expecting each to do something it wasn’t designed to do.

Azure Data Studio in One Picture

Azure Data Studio is a cross-platform database tool for data professionals using the Microsoft family of on-premises and cloud data platforms on Windows, MacOS, and Linux. Previously released under the preview name SQL Operations Studio, Azure Data Studio offers a modern editor experience with Intellisense, code snippets, source control integration, and an integrated terminal. It is engineered with the data platform user in mind, with built in charting of query result sets and customizable dashboards.

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex l1 l1 , nonvoncex MCP and SCAD regularizers. The library is coded in \texttt{C++} and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently.

Optimal Policies for Observing Time Series and Related Restless Bandit Problems

The trade-off between the cost of acquiring and processing data, and uncertainty due to a lack of data is fundamental in machine learning. A basic instance of this trade-off is the problem of deciding when to make noisy and costly observations of a discrete-time Gaussian random walk, so as to minimise the posterior variance plus observation costs. We present the first proof that a simple policy, which observes when the posterior variance exceeds a threshold, is optimal for this problem. The proof generalises to a wide range of cost functions other than the posterior variance. It is based on a new verification theorem by Nino-Mora that guarantees threshold structure for Markov decision processes, and on the relation between binary sequences known as Christoffel words and the dynamics of discontinuous nonlinear maps, which frequently arise in physics, control and biology. This result implies that optimal policies for linear-quadratic-Gaussian control with costly observations have a threshold structure. It also implies that the restless bandit problem of observing multiple such time series, has a well-defined Whittle index policy. We discuss computation of that index, give closed-form formulae for it, and compare the performance of the associated index policy with heuristic policies.

NetSDM: Semantic Data Mining with Network Analysis

Semantic data mining (SDM) is a form of relational data mining that uses annotated data together with complex semantic background knowledge to learn rules that can be easily interpreted. The drawback of SDM is a high computational complexity of existing SDM algorithms, resulting in long run times even when applied to relatively small data sets. This paper proposes an effective SDM approach, named NetSDM, which first transforms the available semantic background knowledge into a network format, followed by network analysis based node ranking and pruning to significantly reduce the size of the original background knowledge. The experimental evaluation of the NetSDM methodology on acute lymphoblastic leukemia and breast cancer data demonstrates that NetSDM achieves radical time efficiency improvements and that learned rules are comparable or better than the rules obtained by the original SDM algorithms.

Kernels for Sequentially Ordered Data

We present a novel framework for learning with sequential data of any kind, such as multivariate time series, strings, or sequences of graphs. The main result is a ‘sequentialization’ that transforms any kernel on a given domain into a kernel for sequences in that domain. This procedure preserves properties such as positive definiteness, the associated kernel feature map is an ordered variant of sample (cross-)moments, and this sequentialized kernel is consistent in the sense that it converges to a kernel for paths if sequences converge to paths (by discretization). Further, classical kernels for sequences arise as special cases of this method. We use dynamic programming and low-rank techniques for tensors to provide efficient algorithms to compute this sequentialized kernel.

Neural Architecture Search: A Survey

Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Because of this, there is growing interest in automated \emph{neural architecture search} methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.

Deep Reinforcement Learning for Swarm Systems

Recently, deep reinforcement learning (RL) methods have been applied successfully to multi-agent scenarios. Typically, the observation vector for decentralized decision making is represented by a concatenation of the (local) information an agent gathers about other agents. However, concatenation scales poorly to swarm systems with a large number of homogeneous agents as it does not exploit the fundamental properties inherent to these systems: (i) the agents in the swarm are interchangeable and (ii) the exact number of agents in the swarm is irrelevant. Therefore, we propose a new state representation for deep multi-agent RL based on mean embeddings of distributions, where we treat the agents as samples and use the empirical mean embedding as input for a decentralized policy. We define different feature spaces of the mean embedding using histograms, radial basis functions and neural networks trained end-to-end. We evaluate the representation on two well-known problems from the swarm literature in a globally and locally observable setup. For the local setup we furthermore introduce simple communication protocols. Of all approaches, the mean embedding representation using neural network features enables the richest information exchange between neighboring agents, facilitating the development of complex collective strategies.

Tunability: Importance of Hyperparameters of Machine Learning Algorithms

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

Analysis of spectral clustering algorithms for community detection: the general bipartite setting

We consider spectral clustering algorithms for community detection under a general bipartite stochastic block model (SBM). A modern spectral clustering algorithm consists of three steps: (1) regularization of an appropriate adjacency or Laplacian matrix (2) a form of spectral truncation and (3) a kmeans type algorithm in the reduced spectral domain. We focus on the adjacency-based spectral clustering and for the first step, propose a new data-driven regularization that can restore the concentration of the adjacency matrix even for the sparse networks. This result is based on recent work on regularization of random binary matrices, but avoids using unknown population level parameters, and instead estimates the necessary quantities from the data. We also propose and study a novel variation of the spectral truncation step and show how this variation changes the nature of the misclassification rate in a general SBM. We then show how the consistency results can be extended to models beyond SBMs, such as inhomogeneous random graph models with approximate clusters, including a graphon clustering problem, as well as general sub-Gaussian biclustering. A theme of the paper is providing a better understanding of the analysis of spectral methods for community detection and establishing consistency results, under fairly general clustering models and for a wide regime of degree growths, including sparse cases where the average expected degree grows arbitrarily slowly.

Robust Frequent Directions with Application in Online Learning

The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameter-free online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms.

Musings on missing data

As I started to develop simulations to highlight key issues, I found myself getting bogged down in the data generation process. Once I realized I needed to be systematic about thinking how to generate various types of missingness, I thought maybe DAGs would help to clarify some of the issues (I’ve written a bit about DAGS before and provided some links to some good references). I figured that I probably wasn’t the first to think of this, and a quick search confirmed that there is indeed a pretty rich literature on the topic. I first found this blog post by Jake Westfall, which, in addition to describing many of the key issues that I want to address here, provides some excellent references, including this paper by Daniel et al and this one by Mohan et al.