Data-driven rainfall prediction at a regional scale: a case study with Ghana

Indrajit Kalita*,a, Lucia Vilallongab, Yves Atchadea,b
aFaculty of Computing and Data Sciences, Boston University, Boston, USA
bDepartment of Mathematics and Statistics, Boston University, Boston, USA

*Indicates corresponding author

Summary

Accurate rainfall forecasting is essential for agriculture, water resource management, and disaster preparedness. Numerical weather prediction (NWP) models, even state-of-the-art models, are known to struggle to produce skillful rainfall forecasts in tropical regions of Africa. See for example this study.

Over the last decade or so, the increased availability of large-scale meteorological datasets and the development of powerful machine learning models have opened up new opportunities for weather forecasting.

As a proof of concept, focusing on Ghana in West Africa (see map below), we use these tools to develop a model to forecast 24h rainfall at 12h and 18h lead-time. The models that we obtain noticeably outperform the state-of-the-art NWF model of the European Center for Medium Range Weather Forecasting (ECMWF).

Area of Interest (AOI)

This is the Area of Interest (AOI) for our study.

Data sets

We collect data over Ghana from the following sources from June 1st 2000, to September 30th, 2021. Additional variables used includes the time of the year, and the latitude and longitude coordinates.

  • GPM-IMERG Precipitation Data: Satellite data providing global rainfall measurements since June 2000 at 30min temporal resolution and 0.1 deg spatial resolution. We collect daily rainfall measurements over Ghana, regridded to 64x64 images. We treat these images as ground truth.
  • ERA5 Meteorological Variables: a re-analysis database of atmospheric and meteorological data from ECMWF. As predictors, we collect 55 variables from ERA5 (wind, temperature, humidity, pressure, etc) at 12h, and 18h before the 24h rainfall window. Regridded to 64x64 images.
  • TIGGE Forecast Data: ECMWF's actual rainfall forecasts at 18-hour lead-time, regridded to 64x64 images. Used here for comparison.

Methodology

We collect a dataset \( \{({\bf x}_{t-h},{\bf y}_t),\; 1\leq t\leq N\}\) as described above where \( {\bf x}_{t-h}\in \mathbb{R}^{57\times 64\times 64}\) (ERA5 input collected h=12hours or h=18hours before date t), \( {\bf y}_t\in \mathbb{R}^{64\times 64}\) (GPM-IMERG rainfall data for date t). We fit the regresson model \[ {\bf y}_t = {\cal F}_W({\bf x}_{t-h}) + \epsilon_t,\] where \( {\cal F}_W\) is a U-Net model, a type of neural network artchitecture (depicted below) to predict the rainfall images from meteorological images.

Block Diagram of DL Architecture
Figure 1: Block diagram of the proposed DL architecture for regional precipitation forecasting.

After training, we evaluate our models on a test dataset by comparing their predictions with actual rainfall amounts as obtained from GPM-IMERG. Specifically, if \( \widehat{W}\) is the estimated model parameter and given a new test data point \( {\bf x}_{t'-h} \), we predict the rainfall image across the AOI at time \(t'\) using \[\hat {\bf y}_{t'} = \mathcal{F}_{\widehat{W}}\left({\bf x}_{t'-h}\right).\]

Models Compared

We evaluate and compare the following models:

  • UNET12: Our U-Net model that uses the input variables at 6PM to predict 24h rainfall starting 6AM next day.
  • UNET18: Our U-Net model that uses the input variables at midnight to predict 24h rainfall starting 30h later.
  • NWP: The 18-hour lead-time predictions from the ECMWF model, obtained from the TIGGE database.
  • Ens: An ensemble model that averages (weighted) predictions from UNET12 and NWP.
  • CLIM: A reference model based on climatological averages.

Comparison of mean absolute errors

We first evaluate performance using the mean absolute error over the test dataset: \[ {\rm MAE} = \frac{1}{|\mathcal{D}'|}\sum_{t'}\|\hat {\bf y}_{t'} - {\bf y}_{t'}\|_1.\]
MAE Sd MAE
CLIM 3.90 1.15
NWP 3.92 1.00
UNET_18 3.81 1.22
UNET_12 3.74 1.13
HYB 3.69 1.01

Comparison using the CRPS

If \(\hat{y}_{it'}\) is the prediction of \(y_{it'}\) at a pixel \(i\in[64]\times [64]\), we also use a very interesting methodology developed here to estimate the conditional cumulative distribution function of \(y_{it'}\) given \(\hat{y}_{it'}\). This allows to produce confidence intervals in the prediction, and also to compare different methods using the continuous ranked proper scoring (CRPS) \[{\rm CRPS}(F,y) = \int_{-\infty}^{+\infty}\left(F(u) - \textbf{1}_{\{y\leq u\}}\right)^2 du\]

If \(\hat{F}_{i,t'}^{(m)}(\cdot)\) denotes the estimated cdf for method \(m\) at time \(t'\) and pixel \(i\). The CRPS of model \(m\) at pixel \(i\) is defined as: \[ {\rm CRPS}(i,m) = \frac{1}{|\mathcal{D}^{''}|}\sum_{t'\in\mathcal{D}^{''}} {\rm CRPS}(\hat{F}_{i,t'}^{(m)}, y_{i,t'}).\] Using the CLIM prediction as a reference, we compute the pixel-by-pixel CRPS error skill of model \(m\) as \[ \rm{Skill}(i,m) = \frac{ {\rm CRPS}(i,{\rm CLIM}) - {\rm CRPS}(i,m)}{{\rm CRPS}(i,{\rm CLIM})}, \]

The maps below shows the CRPS values (first row) and CRPS skill scores accross the AOI. Positive values means a performance better than the CLIM model. We see that our model performs noticeably better than the NWP.

Mean Absolute Error (MAE)
Skill Map

CRPS values (first row) and CRPS skill scores accross the AOI. Positive values means a performance better than the CLIM model.

Forecasting rainy days

> We also compare the models in their ability to correctly predict whether the upcoming day is a rainy (total rainfall \(>0.5mm\)). In the dataset collected 34.9% of all pixels have rainfall above 0.5mm, and 46.1% of rainy pixels (i.e., pixels exceeding 0mm) exceed 0.5mm.

Given a threshold level \(\tau\), and a model \(m\in\{ \rm{CLIM, UNET_{12}, UNET_{18}, NWP, HYB}\}\), its precision and recall at pixel \(i\) are defined respectively as \[ \mathcal{P}_i(m) = \frac{\sum_{t'\in\mathcal{D}^{''}}\textbf{1}_{\left\{|\hat y_{it'}|>\tau\right\}}\textbf{1}_{\left\{|y_{it'}|>\tau\right\}}}{\sum_{t'\in\mathcal{D}^{''}}\textbf{1}_{\{|\hat y_{it'}|>\tau\}}},\] and \[ \mathcal{R}_i(m) = \frac{\sum_{t'\in\mathcal{D}^{''}}\textbf{1}_{\{|\hat y_{it'}|>\tau\}}\textbf{1}_{\{|y_{it'}|>\tau\}}}{\sum_{t'\in\mathcal{D}^{''}}\textbf{1}_{\{|y_{it'}|>\tau\}}}. \] The figures below shows the precision values (first row) and recall (second row) across the area at threshold \(\tau = 0.5mm\). Again our method comes up on top in the comparison.

Precision in predicting rainy days
Recall in predicting rainy days

Precision values (first row) and recall (second row) across the AOI at threshold tau=0.5.

Forecasting heavy rains

We also compare the models in their ability to correctly predict upcoming heavy rainfall (total rainfall \(>10mm\)). We compute the same precision and recall defined above at threshold \(\tau=10mm\). Such amount of rainfall in 24h is relatively rare in the area. About 13.2% of rainy days record respectively larger than 10mm.

  • The figures below shows the precision values (first row) and recall (second row) across the area.
  • Again our method performs best. However, none of the methods performs at the level that would be acceptable in practical use.
  • Given the devastating effects that heavy rainfall can have more research is needed to improve those tail events predictions.
Precision in predicting rainy days
Recall in predicting rainy days

Precision values (first row) and recall (second row) across the AOI.

Model interpretation

We also develop a statistical methodology to probe the relative importance of the meteorological variables used as input in our model, leading to useful insights into the factors driving precipitation in the Ghana.

Important Variables
  • The results show that the most important predictive variable in the U-Net model is the space-time variable. This is hardly surprising since rainfall in Ghana is strongly seasonal, with seasons that vary with latitude.
  • Evaporation drives rainfall, and our method indeed highlights specific humidity (𝑞925), relative humidity (𝑟950) and total column water vapor (𝑡𝑐𝑤𝑣) as important inputs.
  • The variable wind (𝑢300) also appears important. This is possibly related to the Tropical Easterly Jet (TEJ), which plays an important role in the West African monsoon.
  • Our methodology also highlights several convection-related ariables: convective inhibition (𝑐𝑖𝑛), K-index (𝑘𝑥) and the convective available potential energy 358 (𝑐𝑎𝑝𝑒) as key input variables.