On the Provenance Extraction Techniques from Large Scale Log Files: A Case Study for the Numerical Weather Prediction Models

Tufek A., AKTAŞ M. S.

Workshops held at the 26th International Conference on Parallel and Distributed Computing, Euro-Par 2020, Virtual, Online, 24 - 25 August 2020, vol.12480 LNCS, pp.249-260 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 12480 LNCS
  • Doi Number: 10.1007/978-3-030-71593-9_20
  • City: Virtual, Online
  • Page Numbers: pp.249-260
  • Keywords: Machine learning-based provenance extraction, Numerical weather prediction models, Provenance, Provenance analysis, Weather forecast models
  • Yıldız Technical University Affiliated: Yes


Day by day, severe meteorological events increasingly highlight the importance of fast and accurate weather forecasting. There are various Numerical Weather Prediction (NWP) models worldwide that are run on either a local or a global scale to predict future weather. NWP models typically take hours to finish a complete run, however, depending on the input parameters and the size of the forecast domain. Provenance information is of central importance for detecting unexpected events that may develop during model execution, and also for taking necessary action as early as possible. Besides, the need to share scientific data and results between researchers or scientists also highlights the importance of data quality and reliability. In this study, we develop a framework for tracking The Weather Research and Forecasting (WRF) model and for generating, storing, and analyzing provenance data. We develop a machine-learning-based log parser to enable the proposed system to be dynamic and adaptive so that it can adapt to different data and rules. The proposed system enables easy management and understanding of numerical weather forecast workflows by providing provenance graphs. By analyzing these graphs, potential faulty situations that may occur during the execution of WRF can be traced to their root causes. Our proposed system has been evaluated and has been shown to perform well even in a high-frequency provenance information flow.