Data assimilation with missing data in nonstationary environments for probabilistic machine learning models

In this study, we further develop the data assimilation framework proposed for probabilistic Machine Learning (ML) models, named Probabilistic Optimal Interpolation (POI), in nonstationary environments with missing data which are common in real-world situations. The dataset is based on a multi-scale...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei, Yuying, Law, Adrian Wing-Keung, Yang, Chun
Other Authors: School of Civil and Environmental Engineering
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/173067
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-173067
record_format dspace
spelling sg-ntu-dr.10356-1730672024-01-12T15:34:14Z Data assimilation with missing data in nonstationary environments for probabilistic machine learning models Wei, Yuying Law, Adrian Wing-Keung Yang, Chun School of Civil and Environmental Engineering Interdisciplinary Graduate School (IGS) School of Mechanical and Aerospace Engineering Environmental Process Modelling Centre Nanyang Environment and Water Research Institute Engineering::Civil engineering Data Assimilation Missing Data In this study, we further develop the data assimilation framework proposed for probabilistic Machine Learning (ML) models, named Probabilistic Optimal Interpolation (POI), in nonstationary environments with missing data which are common in real-world situations. The dataset is based on a multi-scale Lorenz 96 chaos system. Three types of nonstationary environments (i.e., trend, heteroscedasticity, and random walk) are introduced in the dataset. In addition, the test datasets are masked with different missingness rates to evaluate the POI performance under scenarios with missing values. This study utilizes several filters to identify background noises for observation covariance initialization, and the covariance is updated along the real-time data assimilation specifically for nonstationary environments. The results show that heteroscedastic noises can be well identified while random-walk noises are very difficult to analyze. Overall, the results show that the POI implementation can lead to reduced uncertainty, but POI performance can also be significantly affected due to the limitation of ML models accuracy in the nonstationary environments. The impact from missing values is then examined and compared between stationary and nonstationary environments. Both prediction and POI updates are more accurate with smaller missingness rates as expected, and whether POI is bypassed or not at missing points does not affect the overall performance significantly. Finally, input evolution can perform well with POI under high noise level and missingness rates in stationary environments, but it always yields worse results in nonstationary environments and thus is not recommended. National Research Foundation (NRF) Public Utilities Board (PUB) Submitted/Accepted version This research / project is supported by the National Research Foundation, Singapore, and PUB, Singapore’s National Water Agency under its RIE2025 Urban Solutions and Sustainability (USS) (Water) Centre of Excellence (CoE) Programme, awarded to Nanyang Environment & Water Research Institute (NEWRI), Nanyang Technological University, Singapore (NTU). 2024-01-10T07:01:12Z 2024-01-10T07:01:12Z 2023 Journal Article Wei, Y., Law, A. W. & Yang, C. (2023). Data assimilation with missing data in nonstationary environments for probabilistic machine learning models. Journal of Computational Science, 74, 102151-. https://dx.doi.org/10.1016/j.jocs.2023.102151 1877-7503 https://hdl.handle.net/10356/173067 10.1016/j.jocs.2023.102151 2-s2.0-85174034277 74 102151 en Journal of Computational Science © 2023 Elsevier B.V. All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at http://doi.org/10.1016/j.jocs.2023.102151. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Civil engineering
Data Assimilation
Missing Data
spellingShingle Engineering::Civil engineering
Data Assimilation
Missing Data
Wei, Yuying
Law, Adrian Wing-Keung
Yang, Chun
Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
description In this study, we further develop the data assimilation framework proposed for probabilistic Machine Learning (ML) models, named Probabilistic Optimal Interpolation (POI), in nonstationary environments with missing data which are common in real-world situations. The dataset is based on a multi-scale Lorenz 96 chaos system. Three types of nonstationary environments (i.e., trend, heteroscedasticity, and random walk) are introduced in the dataset. In addition, the test datasets are masked with different missingness rates to evaluate the POI performance under scenarios with missing values. This study utilizes several filters to identify background noises for observation covariance initialization, and the covariance is updated along the real-time data assimilation specifically for nonstationary environments. The results show that heteroscedastic noises can be well identified while random-walk noises are very difficult to analyze. Overall, the results show that the POI implementation can lead to reduced uncertainty, but POI performance can also be significantly affected due to the limitation of ML models accuracy in the nonstationary environments. The impact from missing values is then examined and compared between stationary and nonstationary environments. Both prediction and POI updates are more accurate with smaller missingness rates as expected, and whether POI is bypassed or not at missing points does not affect the overall performance significantly. Finally, input evolution can perform well with POI under high noise level and missingness rates in stationary environments, but it always yields worse results in nonstationary environments and thus is not recommended.
author2 School of Civil and Environmental Engineering
author_facet School of Civil and Environmental Engineering
Wei, Yuying
Law, Adrian Wing-Keung
Yang, Chun
format Article
author Wei, Yuying
Law, Adrian Wing-Keung
Yang, Chun
author_sort Wei, Yuying
title Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
title_short Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
title_full Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
title_fullStr Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
title_full_unstemmed Data assimilation with missing data in nonstationary environments for probabilistic machine learning models
title_sort data assimilation with missing data in nonstationary environments for probabilistic machine learning models
publishDate 2024
url https://hdl.handle.net/10356/173067
_version_ 1789483106571386880