A joint study of deep learning-based methods for identity document image binarization and its influence on attribute recognition

Text recognition has benefited considerably from deep learning research, as well as the preprocessing methods included in its workflow. Identity documents are critical in the field of document analysis and should be thoroughly researched in relation to this workflow. We propose to examine the link between deep learning-based binarization and recognition algorithms for this sort of documents on the MIDV-500 and MIDV-2020 datasets. We provide a series of experiments to illustrate the relation between the quality of the collected images with respect to the binarization results, as well as the influence of its output on final recognition performance. We show that deep learning-based binarization solutions are affected by the capture quality, which implies that they still need significant improvements. We also show that proper binarization results can improve the performance for many recognition methods. Our retrained U-Net-bin outperformed all other binarization methods, and the best result in recognition was obtained by Paddle Paddle OCR v2.


Introduction
Document image analysis and recognition is a rapidly growing domain that simultaneously relies on image processing techniques, pattern recognition approaches, and computer optic principles.The handbook [1] provides a gentle introduction to the subject.One of the latest surveys of document image recognition problems and existing solutions is presented in the paper [2] Among the set of document types being analyzed, identity, or ID documents play a special role.They are utilized to confirm their owner's personality in a plenty of scenarios: usage of government services, banking, access granting, or travelling.The scope and context of their processing along with the corresponding recognition system design are addressed in works [3 -4].
A typical ID document type can be defined by its "template" -a set of features shared by every document sample of this type.The list of common features includes static (known in advance) textual or graphic elements, the information about their relative location on the document, a set of keypoints and their descriptors, physical sizes and many other [3].A set of ID document attributes, which vary from one sample to another and thus determine the identity, is known as the document's "content".Document number, surname, date of birth, owner's photo, are all examples of these attributes.In ID document recognition we are mainly interested in the "content" and the scope of this paper is limited to the recognition of printed textual attributes, placed in the positions predefined by the corresponding "template".Within this context, we consider two important stages of the typical document processing pipeline (DPP): DIB -document image binarization and OCR -optical character recognition.An OCR module is a common consumer of binarization outcome.Here, properly binarized document image can greatly simplify its recognition process.Some modern OCR modules are able to deal not only with binary images but also with colorful or grayscaled ones.This variation in DPP is displayed in Fig. 1.Having such a variation, it is important to assess the influence of the binarization stage upon the accuracy of OCR modules in DPP for ID documents.Real ID documents contain a lot of personal data, which makes the task of creating a comprehensive publicly available dataset very difficult, thus, limiting the evaluation and benchmarking.Such a dataset, named Computer Optics, 2023, Vol.47(4) DOI: 10.18287/2412-6179-CO-1207 MIDV-500 [5], was published in 2018.Later, its successors, the MIDV-2019 and the MIDV-2020 [6], became available.They consist of video clips, scanned images, photos and templates of unique mock ID documents captured in various conditions.The ground truth is provided for some problems including the OCR one, but it is not available for the DIB.
Both DIB and OCR problems greatly benefited from the latest achievements in supervised learning and most part of recent algorithms is based on deep learning (DL) techniques.However, their applicability to the ID document analysis is not established.Thus, the goal of this work is to assess the accuracy of modern OCR modules with preliminary binarization stage and without it, within the field of ID document analysis on MIDV datasets.At the same time, the visual quality of captured ID images can vary a lot and it is also important to assess its impact on the final recognition accuracy.
This work is an extension of the study [7] with additional experimental details and insights.The contributions can be summarized as follows: (a) an experimental analysis of OCR modules accuracy over binarization outcome on the MIDV-500 and MIDV-2020 subsets; (b) an experimental analysis of input image quality influence over the performance of the reviewed modules on the MIDV-500; (c) an analysis of a U-Net based solution accuracy retrained with domain specific data from the MIDV-500; (d) a manual pixel-wise annotation of ID document templates from the MIDV-500; (e) a manual image quality annotation for the MIDV-500.
The remainder of this paper is organized as follows.Section 1 provides information about related work.Section 2 describes the set of algorithms surveyed in this work.Section 3 presents the proposed pipeline analysis and the experimental design and results.Section 4 gives the conclusion.

Related work 1.Document image binarization
DIB algorithms have already been studied for a long time and many surveys and comparative reviews have been published [8][9].Nevertheless, most of them are focused on classic methods such as Otsu, Niblack, Wolf, Nick, Sauvola and many others.In most cases, their parameters should be carefully tuned to get appropriate result on target data.However, recent advances in machine learning, especially DL, have revolutionized the domain and gave rise to a multitude of methods based on end-to-end pixel classification using trained artificial neural networks (ANNs).The performance and limitations of such methods are studied at this moment.For instance, the recent review [10] takes in to account some of them.Therefore, the goal of this paper is to address not classical, but DL-based methods of DIB.
The training process requires the presence of consistent pixel-wise ground truth annotations (PWGT).A number of them have been collected and published.
Since 2009, Document Image Binarization Contest (DIBCO) regularly provided such annotated data for benchmark and track of DIB progress.Another examples of relevant datasets are the Palm Leaf Manuscripts and the Persian Heritage Image Binarization Dataset.All these datasets are mainly focused on historical document analysis, so they contain a lot of handwritten data [11].ID documents, on the contrary, contain mostly printed texts with well-defined characteristics.So, the set of problems is different from those seen on historical document images.Several issues can negatively interfere with the binarization and further OCR output, for instance, special security objects and marks, diversity of colors and backgrounds, printing methods, diversity of sources and other special characteristics that depend on the country and its emission.Thus, it is important to evaluate whether the trained solutions are applicable for these documents or not.The performance of ANNs is heavily influenced by the training data.As shown in [12], their application for images with minimal similarity to the data from the training dataset might provide extremely poor outcomes.

Evaluation of binarization method performance
The performance of DIB method can be evaluated using different strategies, depending on the final objective.These strategies are mainly divided into two groups, "direct" and "indirect", based on the presence of PWGT.For the first group, the presence of such wellestablished ground truth is essential, but its creation is very resource-consuming procedure which is mostly performed in a semi-manual way by domain experts.Moreover, classification results may vary from one expert to another.Many aspects of PWGT creation for binarization needs are covered in paper [13].Main performance metrics using PWGT are examined in work [14].
The "indirect" group relies on the evaluation of binary document visual appearance or its recognition performance [15,16].This approach was popular before datasets with PWGT like DIBCO appeared [17,18] and it is still employed in some scenarios [15].In such a case, the final result depends on the used OCR method and in fact the performance of the pair "binarization method × OCR method" is evaluated.To better understand each binarization method behavior, several OCR methods should be considered.
When the binarization outcome is fixed, the common way to evaluate the performance of a single OCR method is to calculate some Levenshtein-based metrics between the obtained results and the textual ground truth [16].Some insights about experimental evaluation of OCR methods performance are presented in paper [19].For the task of ID document binarization the "indirect" strategy seems to be a better choice for two reasons: (a) the recognition quality is the real final objective for the majority of applications; (b) there is no relevant dataset with well-established PWGT for this document class.

Document image quality analysis
Image quality assessment is another important task in the area of image processing with a lot of applications [20].The quality of ID document images can suffer from multiple distortions, especially when capturing conditions are not controlled.The photometric quality and geometry of the document image are affected by the presence of specular light, shape and motion distortions, defocusing and many others [21].These factors normally lead to poor document analysis results.Clearly, they strongly affect the recognition stage [22].Thus, a common step in ID document recognition system is to evaluate the quality score of every input image.This score helps to filter out evidently bad images, choose the best image from video stream and increase the reliability of the final recognition result [23].Taking in to account only the images with high scores allows to improve the system's performance in terms of speed and recognition quality.The maximum level of distortion acceptable for a recognition algorithm can be determined by the method from paper [24].

Algorithm selection
Several document binarization contests have been held, providing quantitative evaluations of various binarization methods, including DL-based ones.The higher the method's ranking, the more interest it attracts.However, most participants do not publish their methods or make it difficult to reproduce results under different conditions.Despite this, some methods, such as U-Net-bin [25] and Gallego's autoencoder [26], are top-ranked and provide their solutions and training procedures publicly.For this reason, we selected these methods, along with ROBIN [27] and the popular Otsu method [28] as a non-DL baseline as used in [5,12,29], for our study.
As for recognition methods, their choice is also based on participation in contests and benchmarks, as well as their relevance and recentness.The availability of source codes and pre-trained models, which allows for performance evaluations was also a significant factor in the selection process.
Semantic Reasoning Networks (SRN) [30].It is a fourstage DL-based system that won the first place in the ICDAR 2013 competition.It creates a 2D feature map by combining a ResNet50 backbone network with a Feature Pyramid Network and two transformer units.The authors developed a novel attention mechanism called Parallel Visual Attention, which surpasses previous attention mechanisms in terms of efficiency.
Paddle Paddle OCR v2 (PPOCR2) [31].This framework uses the same architecture as SRN, and it is a new version of it focusing on improving the training process using novel strategies like Collaborative Mutual Learning (CML) and new data augmentation techniques.It uses the new LCNet network as backbone which is a modification of MobileNet v1.

Self-Attention
Text Recognition Network (SATRN) [32].It is an autoencoder influenced by Transformers, which exploits the 2D spatial dependencies of characters in a text image.
Baek et al. [33].This approach presents four stages in which authors combine text normalization, feature extraction (ResNet), sequence modeling (BiLSTM) and character sequence prediction.
ResNet CTC.It is a mix of DL approaches that includes a ResNet backbone as a feature extractor and a Connectionist Temporal Classification (CTC) module that uses the features to forecast the text's characters.
ResNet FC.A straightforward DL technique including a ResNet backbone for extraction and a fully connected layer for character prediction.
CSTR [34].This is a classification-based process that incorporates a two-network/stage framework: a core network for classification based on classification perspective network and a second stage prediction based on separated convolutions with global average pooling prediction network.
Tesseract [35] It is a popular open-source engine which is commonly used as a baseline for recognition accuracy evaluation in competitions [29] and surveys [36].We used version 4.1.1utilizing an LSTM (Long short-term memory) ANN for recognition, making it ideal for this study as it focuses on DL approaches.Additionally, the engine's sensitivity to image preprocessing techniques can help in the analysis of the binarization step's influence.
All the provided models were trained mostly on synthetic images.The global architecture of their networks use the same structure.In the most simple framework they employ a feature extractor with a prediction layer, and then they add a sequence modeling stage, some attention mechanisms, or some angle correction steps.Only Tesseract, PPOCR2, SRN and SATRN models are trained for recognizing punctuation characters (back and forward slashes, dots, commas, hyphens and other symbols) which are regularly presented in the ID document images, thus affecting the recognition accuracy.

Experimental analysis
To objectively measure the influence of DIB on the task of textual field recognition in ID document analysis domain, we designed a set of four experiments.With these experiments, we expect to address the following questions: 1) Is the binarization stage relevant within the text recognition pipeline for ID documents?2) Is it possible to improve text recognition accuracy for ID documents using a binarization algorithm retrained with domain specific data?4) How and to what extent do image quality variations influence the recognition accuracy?In this section, these experiments are carried out and their results and discussions are presented.We used the MIDV-500 [5] and the MIDV-2020 [6] datasets, which are among the few available datasets that focus on ID documents.The MIDV-500 dataset comprises 50 document types, each with a corresponding "template"the best quality document image sample that is not affected by any capture problems.The set of templates is denoted as T 500 .The dataset also contains 10 videos (for each type of document) of 30 frames each, captured with two different devices in 5 scenarios with varying conditions.
A ground truth annotation is provided for every textual attribute.It consists of bounding rectangle and the text string.The MIDV-2020 dataset is organized similarly to the MIDV-500 but contains more data.It comprises data for 100 templates T 2020 from each of ten of the documents in the MIDV-500, for a total of 1000 unique templates of ten kinds.
The basic scheme of every proposed experiment  is as follows: the set of binarization algorithms   is applied to a subset of images from MIDV-500 or MIDV-2020 denoted as   .From the binary outcome of every B E , textual field images are extracted according to the provided ground truth.The sets of retrieved textual fields  B are input for every recognition algorithm R  .The recognition error E is evaluated for every algorithm R and set of processed fields  B .The set of all recognition algorithms is denoted as .
In this work, we also use two special binarization methods, B gt and B id .The first one is employed to evaluate the optimal result that can be achieved with binarization, using the PWGT of the image set   (note that such PWGT is not available for either MIDV-500 or MIDV-2020).B id preserves the original image and is used to evaluate the recognition error in the absence of binarization, helping determine the necessity of this step.The set of all binarization algorithms is denoted as .
In this work, the OCR module error R is evaluated over the given dataset   = {( f, g)| f, g, || = || = N}.Here,  and  represent sets of textual image fields and corresponding ground truth.The recognition result r of the textual field f   is a string, so does the corresponding value g  .To compare r with g, string matching approach, based on Levenshtein distance L dist (r, g) calculation, is used.It is known as "normalized Levenshtein distance" (Eq. 1) and is described in details in work [37].

( ) ( ) . ( )
The overall error of recognition algorithm R over the dataset  is denoted as E (R, ).It is calculated as mean value of all the distances V (r, g) (Eq.2). .

Experiment 1: Impact of binarization algorithms in the text recognition pipeline
The goal of the first experiment is to establish the real effect of the DIB stage within the context of the recognition process on the MIDV-500 dataset.

Fig. 2. Description of the first experiment
In this experiment, the image data is a subset of T 500 , with ten documents written in non-Latin alphabets excluded.The best character size for each DIB algorithm, as determined in [7], is used, and all the template images are accordingly resized before being processed by the binarization algorithms.The set of filtered and preprocessed templates is 500 T     , with   =   {B id , B gt } and   = .The full pipeline is illustrated in Fig. 2.
To use B gt , we created PWGT for all 50 templates from the MIDV-500 dataset (Fig. 9b).For every template this annotation represents a binary image that delimits the background from the texts and any attributes containing printed or handwritten characters, including signatures.During this process, texts over any support (ink, printed, sublimated, optically variable, etc), in any alphabet, and with different sizes, colors and typefaces were taken into account.This PWGT is not restricted only to the training and evaluating some binarization algorithm, it extends the dataset for future experiments and research like signature detection, segmentation and recognition.
The annotation was performed by multiple specialists in order to obtain more variability in the resulting data, given that some pixels can receive different classification by different persons.Some of the templates presented low resolution, complex backgrounds, overlapping texts and zones with text occlusions (because of security and information printings like photos, watermarks and seals).The annotation is freely available online on ftp://smartengines.com/midv-500-extra-annotations.
The conducted experiment results are presented in Tab. 1.These results show that all recognition algorithms perform better on non-preprocessed images or on ground truth sources.The poor performance of binarization on the text recognition pipeline suggests that current DLbased binarization results are not goodenough or current DL-based OCR algorithms are already good enough without the need for preprocessing their input.However, when the same recognition algorithms were run on PWGT (Tab. 1 column B gt ), lower error rates were obtained comparing to original images, suggesting that binarization can improve recognition accuracy and that the used algorithms have room for improvement.
It can be observed that the Otsu algorithm consistently outperforms other methods, particularly for the uniform background and high-quality templates found in ID documents.The DL-based methods, such as the Gallego autoencoder, also show promising results, but are not able to reach the level of performance of Otsu or nonbinarized images in some cases.Additionally, the PPOCR2 and SRN recognition algorithms stand out as the top performers among the recognition methods tested.

Experiment 2: Retraining binarization network on specific domain data
The goal of this experiment is to establish the effect of the DIB within the context of the whole recognition process on the subset MIDV-2020 dataset after retraining one of the binarization solutions using domain data taken from the MIDV-500 dataset.
Here, the image data is the subset of T 2020 .Since there are two document types filled entirely in non-Latin alphabet, the 200 corresponding templates are excluded from the evaluation for a total of 800 images.The filtered set of templates is designated as 2020 Here, B U is original U-Net-bin method and R U B -its retrained version which used some domain data from the newly annotated PWGT for MIDV-500.The choice of Otsu stems from its results in the first experiment and the fact that it is still a common baseline for the task of binarization.Unfortunately, the size of T 2020 is too big, so PWGT preparation is too resource consuming.Thus, B gt is not available for this experiment.The set of recognition algorithms contains only the recognizer with the lowest error according to the first experiment:   = {R PROCR2 }.Now let describe the retraining process of U-Net based binarization solution with domain specific data from the MIDV-500 dataset, in order to contrast it with the original solution trained on general image data.For document templates binarization we used a DL approach, provided by DIBCO-2017 competition winners and based on the U-Net model.Compared to DIBCO-2017 challenge, binarization of ID documents is easier than arbitrary historical handwritten documents.Also, there is a difference in the amount and variability of training data: in case of MIDV-500, the number of training images is two times less, and some of them are similar in many respects.Thus, to reduce the effect of overfitting and improve performance, we reduced the number of training parameters in the network by reducing the number of filters in all convolutional layers by 2 times.
The model was trained from scratch using MIDV-500 templates as a training set along with their newly annotated PWGT, B gt MIDV-500 contains 50 template images in different quality and resolution, which differ from the MIDV-2020 test set.The intersected set of document types was removed from training data to eliminate the biased estimation.To overcome this, we scaled template images to widths: 930, 1100, 1400, 1800 and 2160 (original aspect ratio was preserved).The obtained images were sliced into 128×128 grayscale patches with step size 64, and also with random shifts from 0 to 32 pixels.During training, we used the following augmentation on the fly: (a) cutting out region with a size of 0.6 to 0.9 from patch size followed by scaling to initial size, probability 0.1; (b) random rotations of 90 or 180 degrees with a probability of 0.1; (c) autocontrast, probability 0.2; (d) adding lines, probability 0.05; (e) Gaussian noise, probability 0.15.The U-Net model was trained for 80 epochs using SGD optimizer (learning rate 1e-6, momentum 0.99) and batch size 128.
The data annotated in the MIDV-2020 is more detailed than in the MIDV-500 even for the same kind of document.The MIDV-2020 contains some extra annotated fields, and some of them are difficult to binarize, for instance, holographic texts and vertically oriented texts.In this work, this set of fields is called "problematic" and its complement -"regular".Let denote the whole set of all textual fields as All B  and the set of "regular" textual fields as Reg B  .In this experiment, error is measured over these two groups of fields.
The results of this experiment are shown in Tab. 2. The first row corresponds to the set of all fields, the second one only to "regular" fields.It can be observed that even if the retrained network still does not outperform the non binarized images as in the first experiment, Otsu is no longer the one with best binarization results, which may indicate that for this domain, specific data training gets better results than general purpose algorithms.

Experiment 3: Binarization and recognition of individual ID document fields
To further investigate the behaviour of the PPOCR2 recognizer jointly with the retrained version of U-Net-bin binarizer, another experiment was designed.The recognition error is evaluated for every previously binarized field of each document.It sheds light on how recognition errors behave inside a single document type.
The Finnish ID templates from the MIDV-2020 dataset, denoted as 2020 Fin T , were chosen as input image data for this experiment.This document type is indicative since it simultaneously contains two kinds of problematic fields: holographic and vertically oriented (see Fig. 4a).Finally, ,   = {R PROCR2 }.In this experiment, the resulting measurements are integrated over field types.As observed in Tab. 3, the difference in recognition error between problematic and regular fields is significant.It means that there is room for improvement in the binarization and recognition algorithms in these special cases that are common in this domain.This behaviour also supports the results shown in the second row of Table 2, displaying lower recognition error if these fields are not taken in to account.

Experiment 4: Influence of image capture quality on the document attributes recognition
Previous experiments were performed on ID document images with the best possible quality.However, in real-world applications, ID document analysis is often performed using a video stream as input, resulting in varying image quality.A method for aprioriquality estimation is desirable in these circumstances.Currently, there is no universal solution to this problem, but there are domain-specific methods such as document image quality assessment (IQA) [20] and ID document IQA [38].However, the goal of this experiment is not to evaluate the quality of the image itself, but to illustrate its influence on the entire DPP the binarization and recognition steps using the data and methodology established earlier in this work.

Fig. 5. Description of the fourth experiment
For this experiment, the image set of of all video frames from the MIDV-500 dataset was used.We conducted an evaluation based on qualitative values determined by expert personnel, aided by algorithms for focus analysis and presence of specular light.Given this evaluation, four document image quality groups are established, which are used to assess the recognition error.All the frames from F 500 are divided into these four groups: (a) "good" -without any visible problem and close to template images quality (Fig. 6); (b) "average"with almost no incidences on their fields (Fig. 6); (c) "bad" -with very low photometrical quality, but readable with effort (Fig. 6); (d) "discard" -with unreadable fields due to motion blur, occlusion or specular light (Fig. 4).From initial 15000 frames 5476 were discarded, 2294 denoted as "bad", 3160 as "average", and 4070 as "good".The obtained dataset is denoted as 500 G F and this annotation is available on ftp://smartengines.com/midv-500-extra-annotations.Recognition error is calculated for every group using the same methodology as in the previous experiments. Finally, According to the proposed annotation, all nondiscarded frames were fed to the top three ranked recognition algorithms from the first experiment.Additionally, all the frames were binarized using Otsu algorithm, since it was the best binarization algorithm in the MIDV-500 templates experiments.The pipeline is presented in the Fig. 13.
The results are presented in Tab. 4. The results measured for the templates from the first experiment are added as baseline reference.As expected, there is a direct relationship between the quality of the frames and the recognition error.The best case scenario are the templates (3 first rows), which depict the best possible capture quality.Other rows indicate how the recognition error behave when the quality is degraded in video streams.The first row corresponds to the binarization groundtruth, representing the ideal binarization output.Even for algorithms like PPOCR2, which obtained the best results in Experiment 1 without binarization, using an appropriate binarization process could improve its results on images with lower quality or captured in the wild.In this context, even the bestquality images have room for improvement in order to achieve the results obtained for the binarization ground truth.
Although the Otsu method obtained good results in the previous experiments, in this one the error rates raised significantly even for the "Good" quality batch, which visually is not so distant from the templates.This is consistent with Otsu's algorithm well-known drawback Computer Optics, 2023, Vol.47(4) DOI: 10.18287/2412-6179-CO-1207 when dealing with cluttered or non-uniform backgrounds, as well as its dependence on illumination, focus and photometrical quality.This algorithm is not recommended for processing ID documents captured from video streams.
The overall best recognition algorithm for video frames with lower quality was PPOCR2, which maintains good results even for the "Average" batch.
This analysis can provide basis for future workflows to include IQA as an intermediate step in the recognition pipeline.Also, the negative impact of bad quality samples on the recognition process can justify the need for more robust binarization methods to handle these real scenarios artifacts.

Conclusions
In this paper, we performed a comparative joint study of DIB and OCR stages within the ID document analysis domain.This subject has been poorly addressed in the literature, lacking studies of this type.We conducted our experiments on the ID document image datasets MIDV-500 and MIDV-2020.
Two new ground-truth annotations were obtained: one related to the quality of ID document images captured from video stream and another one is a pixelwise ground truth for 50 good document images.A trained model, specifically for ID documents binarization was obtained, which could serve as baseline for future studies.
We could observe that all recognition algorithms seem to behave better on non-binarized images, except when the input was the image binary ground truth.This means that if binarization algorithms improve their results, they can be helpful for the recognition task.A valuable observation in all the experiments was that for this domain, the PPOCR2 algorithm outperformed all the other evaluated methods in terms of recognition rate.It was shown that Otsu algorithm outperformed all DL methods in many cases while using the image templates, but the domain-specific retrained U-Net network obtained lower error rates than Otsu.As expected, it also improved the rates obtained by the same network pretrained on general data.
We conducted a study regarding the recognition by field within each document (instead of the global document image).We found that fields with holographic and vertical characteristics are the ones with greater influence in dropping the recognition rates.This may indicate that this kind of fields requires especial attention in this research domain.
For ID documents captured from video streams, we measured how quality of frames affects the recognition rates.For good and average image quality groups there is still room for improvement if a good binarization could be obtained from them, since the best recognition methods degrade their results with respect to that obtained over the binarization ground-truth andimage templates (not affected by quality artifacts).In this case, Otsu's algorithm obtained worse error rates, thus we recommend using DL-based solutions.As future research we plan to develop new binarization methods more robust to the quality issues presented in video environments, as well as new methods that take in to account the problematic fields we studied.

3 .
Original template and different binarization outputs, a) Template from MIDV, b) Prepared ground truth, c) Otsu output, d) Gallego output, e) U-Net-Bin Output, f) ROBIN output Tab. 1. Recognition error E over binarization outcomes for 500