Data Lineage Analysis google
“Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time.” It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already.
Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. “Lineage is a simple type of why provenance.” …


Deep Temporal Clustering (DTC) google
Unsupervised learning of time series data, also known as temporal clustering, is a challenging problem in machine learning. Here we propose a novel algorithm, Deep Temporal Clustering (DTC), to naturally integrate dimensionality reduction and temporal clustering into a single end-to-end learning framework, fully unsupervised. The algorithm utilizes an autoencoder for temporal dimensionality reduction and a novel temporal clustering layer for cluster assignment. Then it jointly optimizes the clustering objective and the dimensionality reduction objec tive. Based on requirement and application, the temporal clustering layer can be customized with any temporal similarity metric. Several similarity metrics and state-of-the-art algorithms are considered and compared. To gain insight into temporal features that the network has learned for its clustering, we apply a visualization method that generates a region of interest heatmap for the time series. The viability of the algorithm is demonstrated using time series data from diverse domains, ranging from earthquakes to spacecraft sensor data. In each case, we show that the proposed algorithm outperforms traditional methods. The superior performance is attributed to the fully integrated temporal dimensionality reduction and clustering criterion. …

Knowledge-Based Distant Regularization Framework google
Exploiting the appropriate inductive bias based on the knowledge of data is essential for achieving good performance in statistical machine learning. In practice, however, the domain knowledge of interest often provides information on the relationship of data attributes only distantly, which hinders direct utilization of such domain knowledge in popular regularization methods. In this paper, we propose the knowledge-based distant regularization framework, in which we utilize the distant information encoded in a knowledge graph for regularization of probabilistic model estimation. In particular, we propose to impose prior distributions on model parameters specified by knowledge graph embeddings. As an instance of the proposed framework, we present the factor analysis model with the knowledge-based distant regularization. We show the results of preliminary experiments on the improvement of the generalization capability of such model. …

Advertisements