Unfolding the universe of possibilities..

Journeying through the galaxy of bits and bytes.

From Paper to Pixel: Evaluating the Best Techniques for Digitising Handwritten Texts

A Comparative Dive into OCR, Transformer Models, and Prompt Engineering-based Ensemble Techniques

By: Sohrab Sani and Diego Capozzi

Organisations have long grappled with the tedious and expensive task of digitising historical handwritten documents. Previously, Optical Character Recognition (OCR) techniques, such as AWS Textract (TT) [1] and Azure Form Recognizer (FR) [2], have led the charge for this. Although these options may be widely available, they have many downsides: they’re pricey, require lengthy data processing/cleaning and can yield suboptimal accuracies. Recent Deep Learning advancements in image segmentation and Natural Language Processing that utilise transformer-based architecture have enabled the development of OCR-free techniques, such as the Document Understanding Transformer (Donut)[3] model.

In this study, we’ll compare OCR and Transformer-based techniques for this digitisation process with our custom dataset, which was created from a series of handwritten forms. Benchmarking for this relatively simple task is intended to lead towards more complex applications on longer, handwritten documents. To increase accuracy, we also explored using an ensemble approach by utilising prompt engineering with the gpt-3.5-turbo Large Language Model (LLM) to combine the outputs of TT and the fine-tuned Donut model.

The code for this work can be viewed in this GitHub repository. The dataset is available on our Hugging Face repository here.

Table of Contents:

· Dataset creation
· Methods
Azure Form Recognizer (FR)
AWS Textract (TT)
Ensemble Method: TT, Donut, GPT
· Measurement of Model Performance
· Results
· Additional Considerations
Donut model training
Prompt engineering variability
· Conclusion
· Next Steps
· References
· Acknowledgements

Dataset creation

This study created a custom dataset from 2100 handwritten-form images from the NIST Special Database 19 dataset [4]. Figure 1 provides a sample image of one of these forms. The final collection includes 2099 forms. To curate this dataset, we cropped the top section of each NIST form, targeting the DATE, CITY, STATE, and ZIP CODE (now referred to as “ZIP”) keys highlighted within the red box [Figure 1]. This approach launched the benchmarking process with a relatively simple text-extraction task, enabling us to then select and manually label the dataset quickly. At the time of writing, we are unaware of any publicly available datasets with labelled images of handwritten forms that could be used for JSON key-field text extractions.

Figure 1. Example form from the NIST Special Database 19 dataset. The red box identifies the cropping process, which selects only the DATE, CITY, STATE, and ZIP fields in this form. (Image by the authors)

We manually extracted values for each key from the documents and double-checked these for accuracy. In total, 68 forms were discarded for containing at least one illegible character. Characters from the forms were recorded exactly as they appeared, regardless of spelling errors or formatting inconsistencies.

To fine-tune the Donut model on missing data, we added 67 empty forms that would enable training for these empty fields. Missing values within the forms are represented as a “None” string in the JSON output.

Figure 2a displays a sample form from our dataset, while Figure 2b shares the corresponding JSON that is then linked to that form.

Figure 2. (a) Example image from the dataset; (b) Extracted data in a JSON format. (Image by the authors)

Table 1 provides a breakdown of variability within the dataset for each key. From most to least variable, the order is ZIP, CITY, DATE, and STATE. All dates were within the year 1989, which may have reduced overall DATE variability. Additionally, although there are only 50 US states, STATE variability was increased due to different acronyms or case-sensitive spellings that were used for individual state entries.

Table 1. Summary Statistics of the Dataset. (Image by the authors)

Table 2 summarises character lengths for various attributes of our dataset.

Table 2. Summary of character length and distribution. (Image by the authors)

The above data shows that CITY entries possessed the longest character length while STATE entries had the shortest. The median values for each entry closely follow their respective means, indicating a relatively uniform distribution of character lengths around the average for each category.

After annotating the data, we split it into three subsets: training, validation, and testing, with respective sample sizes of 1400, 199, and 500. Here is a link to the notebook that we used for this.


We will now expand on each method that we tested and link these to relevant Python codes which contain more details. The application of methods is first described individually, i.e. FR, TT and Donut, and then secondarily, with the TT+GPT+Donut ensemble approach.

Azure Form Recognizer (FR)

Figure 3 depicts the workflow for extracting handwritten text from our form images using Azure FR:

Store the images: This could be on a local drive or another solution, such as an S3 bucket or Azure Blob Storage.Azure SDK: Python script for loading each image from storage and transferring these to the FR API via Azure SDK.Post-processing: Using an off-the-shelf method means that the final output often needs refining. Here are the 21 extracted keys that require further processing:
[ ‘DATE’, ‘CITY’, ‘STATE’, ‘’DATE’, ‘ZIP’, ‘NAME’, ‘E ZIP’,’·DATE’, ‘.DATE’, ‘NAMR’, ‘DATE®’, ‘NAMA’, ‘_ZIP’, ‘.ZIP’, ‘print the following shopsataca i’, ‘-DATE’, ‘DATE.’, ‘No.’, ‘NAMN’, ‘STATEnZIP’]
Some keys have extra dots or underscores, which require removal. Due to the close positioning of the text within the forms, there are numerous instances where extracted values are mistakenly associated with incorrect keys. These issues are then addressed to a reasonable extent.Save the result: Save the result in a storage space in a pickle format.Figure 3. Visualisation of the Azure FR workflow. (Image by the authors)

AWS Textract (TT)

Figure 4 depicts the workflow for extracting handwritten text from our form images by using AWS TT:

Store the images: The images are stored in an S3 bucket.SageMaker Notebook: A Notebook instance facilitates interaction with the TT API, executes the post-processing cleaning of the script, and saves the outcomes.TT API: This is the off-the-shelf OCR-based text extraction API that is provided by AWS.Post-processing: Using an off-the-shelf method means that the final output often needs refining. TT produced a dataset with 68 columns, which is more than 21 columns from the FR approach. This is mostly due to the detection of additional text in the images thought to be fields. These issues are then addressed during the rule-based post-processing.Save the result: The refined data is then stored in an S3 bucket by using a pickle format.Figure 4. Visualisation of the TT workflow. (Image by the authors)


In contrast to the off-the-shelf OCR-based approaches, which are unable to adapt to specific data input through custom fields and/or model retraining, this section delves into refining the OCR-free approach by using the Donut model, which is based on transformer model architecture.

First, we fine-tuned the Donut model with our data before applying the model to our test images to extract the handwritten text in a JSON format. In order to re-train the model efficiently and curb potential overfitting, we employed the EarlyStopping module from PyTorch Lightning. With a batch size of 2, the training terminated after 14 epochs. Here are more details for the fine-tuning process of the Donut model:

We allocated 1,400 images for training, 199 for validation, and the remaining 500 for testing.We used a naver-clova-ix/donut-base as our foundation model, which is accessible on Hugging Face.This model was then fine-tuned using a Quadro P6000 GPU with 24GB memory.The entire training time was approximately 3.5 hours.For more intricate configuration details, refer to the train_nist.yaml in the repository.

This model can also be downloaded from our Hugging Face space repository.

Ensemble Method: TT, Donut, GPT

A variety of ensembling methods were explored, and the combination of TT, Donut and GPT performed the best, as explained below.

Once the JSON outputs were obtained by the individual application of TT and Donut, these were used as inputs to a prompt that was then passed on to GPT. The aim was to use GPT to take the information within these JSON inputs, combine it with contextual GPT information and create a new/cleaner JSON output with enhanced content reliability and accuracy [Table 3]. Figure 5 provides a visual overview of this ensembling approach.

Figure 5. Visual description of the ensembling method that combines TT, Donut and GPT. (Image by the authors)

The creation of the appropriate GPT prompt for this task was iterative and required the introduction of ad-hoc rules. The tailoring of the GPT prompt to this task — and possibly the dataset — is an aspect of this study that requires exploration, as noted in the Additional Considerations section.

Measurement of Model Performance

This study measured model performance mainly by using two distinct accuracy measures:

Field-Level-Accuracy (FLA)Character-Based-Accuracy (CBA)

Additional quantities, such as Coverage and Cost, were also measured to provide relevant contextual information. All metrics are described below.


This is a binary measurement: if all of the characters of the keys within the predicted JSON match those in the reference JSON, then the FLA is 1; if, however, just one character does not match, then the FLA is 0.

Consider the examples:

JSON1 = {‘DATE’: ‘8/28/89’, ‘CITY’: ‘Murray’, ‘STATE’: ‘KY’, ‘ZIP’: ‘42171’}
JSON2 = {‘DATE’: ‘8/28/89’, ‘CITY’: ‘Murray’, ‘STATE’: ‘KY’, ‘ZIP’: ‘42071’}

Comparing JSON1 and JSON2 using FLA results in a score of 0 due to the ZIP mismatch. However, comparing JSON1 with itself provides an FLA score of 1.


This accuracy measure is computed as follows:

Determining the Levenshtein edit distance for each corresponding value pair.Obtaining a normalised score by summing up all of the distances and dividing by each value’s total combined string length.Converting this score into a percentage.

The Levenshtein edit-distance between two strings is the number of changes needed to transform one string into another. This involves counting substitutions, insertions, or deletions. For example, transforming “marry” into “Murray” would require two substitutions and one insertion, resulting in a total of three changes. These modifications can be made in various sequences, but at least three actions are necessary. For this computation, we employed the edit_distance function from the NLTK library.

Below is a code snippet illustrating the implementation of the described algorithm. This function accepts two JSON inputs and returns with an accuracy percentage.

def dict_distance (dict1:dict,
dict2:dict) -> float:

distance_list = []
character_length = []

for key, value in dict1.items():

if len(dict1[key]) > len(dict2[key]):


accuracy = 100 – sum(distance_list)/(sum(character_length))*100

return accuracy

To better understand the function, let’s see how it performs in the following examples:

JSON1 = {‘DATE’: ‘8/28/89’, ‘CITY’: ‘Murray’, ‘STATE’: ‘KY’, ‘ZIP’: ‘42171’}
JSON2 = {‘DATE’: ‘None’, ‘CITY’: ‘None’, ‘STATE’: ‘None’, ‘ZIP’: ‘None’}
JSON3 = {‘DATE’: ‘8/28/89’, ‘CITY’: ‘Murray’, ‘STATE’: ‘None’, ‘ZIP’: ‘None’}dict_distance(JSON1, JSON1): 100% There is no difference between JSON1 and JSON1, so we obtain a perfect score of 100%dict_distance(JSON1, JSON2): 0% Every character in JSON2 would need alteration to match JSON1, yielding a 0% score.dict_distance(JSON1, JSON3): 59% Every character in the STATE and ZIP keys of JSON3 must be changed to match JSON1, which results in an accuracy score of 59%.

We will now focus on the average value of CBA over the analysed image sample. Both of these accuracy measurements are very strict since they measure whether all characters and character cases from the examined strings match. FLA is particularly conservative due to its binary nature, which blinds it towards partially correct cases. Although CBA is less conservative than FLA, it is still considered to be somewhat conservative. Additionally, CBA has the ability to identify partially correct instances, but it also considers the text case (upper vs. lower), which may have differing levels of importance depending on whether the focus is to recover the appropriate content of the text or to preserve the exact form of the written content. Overall, we decided to use these stringent measurements for a more conservative approach since we prioritised text extraction correctness over text semantics.


This quantity is defined as the fraction of form images whose fields have all been extracted in the output JSON. It is helpful to monitor the overall ability to extract all fields from the forms, independent of their correctness. If Coverage is very low, it flags that certain fields are systematically being left out of the extraction process.


This is a simple estimate of the cost incurred by applying each method to the entire test dataset. We have not captured the GPU cost for fine-tuning the Donut model.


We assessed the performance of all methods on the test dataset, which included 500 samples. The results of this process are summarised in Table 3.

When using FLA, we observe that more traditional OCR-based methods, FR and TT, perform similarly with relatively low accuracies (FLA~37%). While not ideal, this may be due to FLA’s stringent requirements. Alternatively, when using the CBA Total, which is the average CBA value when accounting for all JSON keys together, the performances of both TT and FR are far more acceptable, yielding values > 77%. In particular, TT (CBA Total = 89.34%) outperforms FR by ~15%. This behaviour is then preserved when focusing on the values of CBA that are measured for the individual form fields, notably in the DATE and CITY categories [Table 3], and when measuring the FLA and CBA Totals over the entire sample of 2099 images (TT: FLA = 40.06%; CBA Total = 86.64%; FR: FLA = 35,64%; CBA Total = 78.57%). While the Cost value for applying these two models is the same, TT is better positioned to extract all of the form fields with Coverage values approximately 9% higher than the FR ones.

Table 3. Performance metric values were calculated over the test dataset. CBA Total and CBA key (key= Date, City, State, Zip) are sample average values and accounts, respectively, for the JSON keys altogether and individually. (Image by the authors)

Quantifying the performance of these more traditional OCR-based models provided us with a benchmark that we then used to evaluate the advantages of using a purely Donut approach versus using one in combination with TT and GPT. We begin this by using TT as our benchmark.

The benefits of utilising this approach are shown through improved metrics from the Donut model that was fine-tuned on a sample size of 1400 images and their corresponding JSON. Compared to the TT results, this model’s global FLA of 54% and CBA Total of 95.23% constitute a 38% and 6% improvement, respectively. The most significant increase was seen in the FLA, demonstrating that the model can accurately retrieve all form fields for over half of the test sample.

The CBA increase is notable, given the limited number of images used for fine-tuning the model. The Donut model shows benefits, as evidenced by the improved overall values in Coverage and key-based CBA metrics, which increased by between 2% and 24%. Coverage achieved 100%, indicating that the model can extract text from all form fields, which reduces the post-processing work involved in productionizing such a model.

Based on this task and dataset, these results illustrate that using a fine-tuned Donut model produces results that are superior to those produced by an OCR model. Lastly, ensembling methods were explored to assess if additional improvements could continue to be made.

The performance of the ensemble of TT and fine-tuned Donut, powered by gpt-3.5-turbo, reveals that improvements are possible if specific metrics, such as FLA, are chosen. All of the metrics for this model (excluding CBA State and Coverage) show an increase, ranging between ~0.2% and ~10%, compared to those for our fine-tuned Donut model. The only performance degradation is seen in the CBA State, which decreases by ~3% when compared to the value measured for our fine-tuned Donut model. This may be owed to the GPT prompt that was used, which can be further fine-tuned to improve this metric. Finally, the Coverage value remains unchanged at 100%.

When compared to the other individual fields, Date extraction (see CBA Date) yielded higher efficiency. This was likely due to the limited variability in the Date field since all Dates originated in 1989.

If the performance requirements are considerably conservative, then the 10% increase in FLA is significant and may merit the higher cost of building and maintaining a more complex infrastructure. This should also consider the source of variability introduced by the LLM prompt modification, which is noted in the Additional Considerations section. However, if the performance requirements are less stringent, then the CBA metric improvements yielded by this ensemble method may not merit the additional cost and effort.

Overall, our study shows that while individual OCR-based methods — namely FR and TT — have their strengths, the Donut model, fine-tuned on 1400 samples only, easily surpasses their accuracy benchmark. Furthermore, ensembling TT and a fine-tuned Donut model by a gpt-3.5-turbo prompt further increases accuracy when measured by the FLA metric. Additional Considerations must also be made concerning the fine-tuning process of the Donut model and the GPT prompt, which are now explored in the following section.

Additional Considerations

Donut model training

To improve the accuracy of the Donut model, we experimented with three training approaches, each aimed at improving inference accuracy while preventing overfitting to the training data. Table 4 displays a summary of our results.

Table 4. Summary of the Donut model fine-tuning. (Image by the authors)

1. The 30-Epoch Training: We trained the Donut model for 30 epochs using a configuration provided in the Donut GitHub repository. This training session lasted for approximately 7 hours and resulted in an FLA of 50.0%. The CBA values for different categories varied, with CITY achieving a value of 90.55% and ZIP achieving 98.01%. However, we noticed that the model started overfitting after the 19th epoch when we examined the val_metric.

2. The 19-Epoch Training: Based on insights gained during the initial training, we fine-tuned the model for only 19 epochs. Our results showed a significant improvement in FLA, which reached 55.8%. The overall CBA, as well as key-based CBAs, showed improved accuracy values. Despite these promising metrics, we detected a hint of overfitting, as indicated by the val_metric.

3. The 14-Epoch Training: To further refine our model and curb potential overfitting, we employed the EarlyStopping module from PyTorch Lightning. This approach terminated the training after 14 epochs. This resulted in an FLA of 54.0%, and CBAs were comparable, if not better, than the 19-epoch training.

When comparing the outputs from these three training sessions, although the 19-epoch training yielded a marginally better FLA, the CBA metrics in the 14-epoch training were overall superior. Additionally, the val_metric reinforced our apprehension regarding the 19-epoch training, indicating a slight inclination towards overfitting.

In conclusion, we deduced that the model that was fine-tuned over 14 epochs using EarlyStopping was both the most robust and the most cost-efficient.

Prompt engineering variability

We worked on two prompt engineering approaches (ver1 & ver2) to improve data extraction efficiency by ensembling a fine-tuned Donut model and our results from TT. After training the model for 14 epochs, Prompt ver1 yielded superior results with an FLA of 59.6% and higher CBA metrics for all keys [Table 5]. In contrast, Prompt ver2 experienced a decline, with its FLA dropping to 54.4%. A detailed look at the CBA metrics indicated that accuracy scores for every category in ver2 were slightly lower when compared to those of ver1, highlighting the significant difference this change made.

Table 5. Summary of results: extracting handwritten text from forms. (Image by the authors)

During our manual labelling process of the dataset, we utilised the results of TT and FR, and developed Prompt ver1 while annotating the text from the forms. Despite being intrinsically identical to its predecessor, Prompt ver2 was slightly modified. Our primary goal was to refine the prompt by eliminating empty lines and redundant spaces that were present in Prompt ver1.

In summary, our experimentation highlighted the nuanced impact of seemingly minor adjustments. While Prompt ver1 showcased a higher accuracy, the process of refining and simplifying it into Prompt ver2, paradoxically, led to a reduction in performance across all metrics. This highlights the intricate nature of prompt engineering and the need for meticulous testing before finalising a prompt for use.

Prompt ver1 is available in this Notebook, and the code for Prompt ver2 can be seen here.


We created a benchmark dataset for text extraction from images of handwritten forms containing four fields (DATE, CITY, STATE, and ZIP). These forms were manually annotated into a JSON format. We used this dataset to assess the performances of OCR-based models (FR and TT) and a Donut model, which was then fine-tuned using our dataset. Lastly, we employed an ensemble model that we built through prompt engineering by utilising an LLM (gpt-3.5-turbo) with TT and our fine-tuned Donut model outputs.

We found that TT performed better than FR and used this as a benchmark to evaluate prospective improvements that could be generated by a Donut model in isolation or in combination with TT and GPT, which is the ensemble approach. As displayed by the model performance metrics, this fine-tuned Donut model showed clear accuracy improvements that justify its adoption over OCR-based models. The ensemble model displayed significant improvement of FLA but comes at a higher cost and therefore, can be considered for usage in cases with stricter performance requirements. Despite employing the consistent underlying model, gpt-3.5-turbo, we observed notable differences in the output JSON form when minor changes in the prompt were made. Such unpredictability is a significant drawback when using off-the-shelf LLMs in production. We are currently developing a more compact cleaning process based on an open-source LLM to address this issue.

Next Steps

The price column in Table 2 shows that the OpenAI API call was the most expensive cognitive service used in this work. Thus, to minimise costs, we are working on fine-tuning an LLM for a seq2seq task by utilising methods such as full fine-tuning, prompt tuning[5] and QLORA [6].Due to privacy reasons, the name box on the images in the dataset is covered by a black rectangle. We are working on updating this by adding random first and last names to the dataset, which would increase the data extraction fields from four to five.In the future, we plan to increase the complexity of the text-extraction task by extending this study to include text extraction of entire forms or other more extensive documents.Investigate Donut model hyperparameter optimization.


Amazon Textract, AWS TextractForm Recognizer, Form Recognizer (now Document Intelligence)Kim, Geewook and Hong, Teakgyu and Yim, Moonbin and Nam, JeongYeon and Park, Jinyoung and Yim, Jinyeong and Hwang, Wonseok and Yun, Sangdoo and Han, Dongyoon and Park, Seunghyun, OCR-free Document Understanding Transformer (2022), European Conference on Computer Vision (ECCV)Grother, P. and Hanaoka, K. (2016) NIST Handwritten Forms and Characters Database (NIST Special Database 19). DOI: http://doi.org/10.18434/T4H01CBrian Lester, Rami Al-Rfou, Noah Constant, The Power of Scale for Parameter-Efficient Prompt Tuning (2021), arXiv:2104.08691Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs (2023), https://arxiv.org/abs/2305.14314


We would like to thank our colleague, Dr. David Rodrigues, for his continuous support and discussions surrounding this project. We would also like to thank Kainos for their support.

From Paper to Pixel: Evaluating the Best Techniques for Digitising Handwritten Texts was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment