Temporal representation for mining scientific data provenance


Chen P., Plale B., Aktaş M. S.

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, vol.36, pp.363-378, 2014 (SCI-Expanded) identifier identifier

Abstract

Provenance of digital scientific data is a distinct piece of metadata about a data object. It can serve as a "ground-truth" for determining the cause of execution failure for instance, or can explain a particular result to a researcher intending to reuse a data object. Provenance can quickly grow voluminous and be quite feature rich, requiring new structure and concepts that support data mining. We propose a representation of data provenance using logical time that reduces the feature space of the provenance. The temporal representation supports clustering, classification and association rule mining. This paper studies the full utility of the temporal representation through an empirical evaluation and identification of the data mining algorithms that are most effective in application to the proposed representation. The evaluation is carried out against a multi-gigabyte semi-synthetic provenance dataset built from a range of scientific workflows, and against a real one month provenance dataset gathered from a satellite instrument. Through analysis of the results via clustering metrics-purity and Normalized Mutual Information (NMI), we determine that the k-means algorithm gives the best clustering with the proposed temporal representation, while still yielding provenance-useful information. (C) 2013 Elsevier B.V. All rights reserved.