Temporal representation for mining scientific data provenance

Chen, Peng; Plale, Beth; Aktaş, Mehmet

doi:10.1016/j.future.2013.09.032

Temporal representation for mining scientific data provenance

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, cilt.36, ss.363-378, 2014 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 36
Basım Tarihi: 2014
Doi Numarası: 10.1016/j.future.2013.09.032
Dergi Adı: FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.363-378
Anahtar Kelimeler: Provenance, Temporal representation, Data mining
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Provenance of digital scientific data is a distinct piece of metadata about a data object. It can serve as a "ground-truth" for determining the cause of execution failure for instance, or can explain a particular result to a researcher intending to reuse a data object. Provenance can quickly grow voluminous and be quite feature rich, requiring new structure and concepts that support data mining. We propose a representation of data provenance using logical time that reduces the feature space of the provenance. The temporal representation supports clustering, classification and association rule mining. This paper studies the full utility of the temporal representation through an empirical evaluation and identification of the data mining algorithms that are most effective in application to the proposed representation. The evaluation is carried out against a multi-gigabyte semi-synthetic provenance dataset built from a range of scientific workflows, and against a real one month provenance dataset gathered from a satellite instrument. Through analysis of the results via clustering metrics-purity and Normalized Mutual Information (NMI), we determine that the k-means algorithm gives the best clustering with the proposed temporal representation, while still yielding provenance-useful information. (C) 2013 Elsevier B.V. All rights reserved.