Advanced Computing in the Age of AI | Wednesday, May 8, 2024

LUMI Drives Efficiency in Converting Handwritten Weather Records to Digital 

Aug. 18, 2023 -- The Swedish Meteorological and Hydrological Institute (SMHI) has secured 1,280,000 GPU core hours on LUMI, courtesy of the EuroHPC JU Development Call.

The LUMI supercomputer. Image courtesy of CSC.

SMHI, a leading authority with a global perspective, is dedicated to forecasting weather, water, and climate changes. The institute stands as a beacon for applied research, providing forecasts, alerts, and decision-making support through a foundation of scientific expertise, advanced technology, and extensive weather and climate data.

Technical/Scientific Challenge

Most meteorological organizations globally, SMHI included, have vast collections of observational data stored in paper format over the years. This project aims to develop and perfect a machine learning model capable of processing various tabular formats, translating handwritten text, and generating machine-readable files (Figure 1).

Such an advancement would streamline and speed up the current manual process of digitizing paper archives. Through this initiative, SMHI seeks to digitize many historical weather observations, enhancing our understanding of the climate, particularly in terms of extreme weather events.

Proposed Solution

'Dawsonia' stands as a pioneering project in the realm of table-detection and handwritten text recognition (HTR), primarily focused on resurrecting data from ancient weather journals. Its expertise lies in the digitization of handwritten numeric entries presented in table formats.

Figure 1: Example of two pages from a scanned weather journal dated January 1, 1927. Click to enhance.

Unlike traditional optical character recognition (OCR) which simply reads printed text, Dawsonia delves deep into understanding the structural organization of text within tables and converting handwritten data to digital format (Figure 2).

SMHI's strategy revolves around blending image processing with machine learning to accomplish this mission. They've orchestrated the digitization pipeline (Figure 3) using Python, capitalizing on reputable open-source scientific libraries such as scikit-image and TensorFlow.

With resources allocated on LUMI, the team is gearing up to refine their digitization workflow. The challenge lies in balancing complexity across various tasks within the pipeline, specifically in table-detection and HTR. To fine-tune the model parameters and expand the training dataset, they'll undergo several rounds of training and evaluation, where GPUs will prove invaluable.

Business Impact

Figure 2: Tables detected from the scan of Figure 1.

Though still in its nascent stages, Dawsonia has already sparked interest both from international meteorological agencies and internally within SMHI, particularly from teams wrestling with paper documents structured in tables. If successful this venture could potentially accelerate the process of digitization of countless similar archives.

Benefits

Harnessing the power of high-performance computing (HPC) greatly expedites the project's development and testing phases. For example, when the current HTR neural network was experimented upon, using an 8-core CPU cluster took 11 hours for training. In stark contrast, the same training process was completed in a single hour on the LUMI GPU.

Figure 3: Digitization pipeline.

Having a GPU readily available significantly speeds up the tweaking of the model's hyperparameters. Furthermore, it offers the flexibility to experiment with other openly accessible neural networks. When adapting these networks to new tasks (a process known as 'transfer learning'), GPUs are crucial.

Information was provided my Ashwin Mohanan, Scientific programmer at SMHI.


Source: ENCCS

EnterpriseAI