Algorithm for choosing the best frame in a video stream in the task of identity document recognition

During the process of document recognition in a video stream using a mobile device camera, the image quality of the document varies greatly from frame to frame. Sometimes recognition system is required not only to recognize all the specified attributes of the document, but also to select final document image of the best quality. This is necessary, for example, for archiving or provid-ing various services; in some countries it can be required by law. In this case, recognition system needs to assess the quality of frames in the video stream and choose the “best” frame. In this paper we considered the solution to such a problem where the “best” frame means the presence of all specified attributes in a readable form in the document image. The method was set up on a private dataset, and then tested on documents from the open MIDV-2019 dataset. A practically applicable result was obtained for use in recognition systems. : This work was partially supported by the Russian Foundation for Basic Research (projects ## 17-29-03161, 18-07-01387).


Introduction
The steady growth of quality and reliability of automated text recognition algorithms over the past decade has led to an increase in demand for input and verification systems for various text documents [1,2]. A classic source of document images in such systems are specialized flatbed scanners. However, with the development of modern mobile devices, input systems using small format cameras are gradually replacing traditional systems.
Images obtained from specialized scanners are characterized by a fairly uniform illumination of the document, high image resolution and the absence of projective distortions (see fig. 1a). At the same time, images of documents received from cameras of mobile devices can have a number of defects that are absent when working with a scanner: flares (see fig. 1d), "blurring" of the document area (see. fig. 1c), the document not completely present in the frame (see fig. 1e), etc. [3]. In this case, the use of several frames from the video stream with the subsequent combination of the recognition results obtained on those frames can significantly increase the recognition quality [4,5].
In some cases, the recognition system is required to select one "good" frame from the video stream (see fig.  1b), which will either be shown to the operator to check the correctness of recognition, saved in a special database, or used to provide services (for example, issuing SIM cards). Hereinafter, such a frame will mean the document image that has in a readable form all the text attributes of the document with the owner's data.
In this paper we will consider the solution to the problem of evaluating the "goodness" of frames in a video stream and choosing the "best" frame. The possibility of cutting off "bad" frames based on the recognition results is investigated, provided that each given attribute corresponds to a certain recognition result. Since flare or camera focusing errors can lead to unpredictable recognition results [6], to assess the "goodness" of the frame the document image is additionally checked for flares and evaluated for blurs.

Related work
First of all, we note the works that directly analyze the quality of text areas in the document image. These methods can be conditionally divided into two groups: those that directly analyze the parameters of the font (typeface anatomy), and those that determine the readability of the font by indirect signs, for example, by the quality of recognition of OCR systems.
An example of the first group is the work [8], which uses the analysis of luminance gradients within zones containing individual characters or groups of characters. Another example is the work [9]. It describes three groups of features calculated for each symbol: morphological, anti-aliasing artifacts, and a group of spatial features that describe geometric distortions of the image. In [10] the assessment of the image quality of the document is the weighted sum of the image clarity assessment and the font parameters assessment. The latter, in turn, is the sum of three estimates: the number of dark specks around the text, imitating the speckle structure, the estimate of the inter-letter space and the estimate of the size of the inter-letter space to the total size of the letter, which were proposed in [11] and adapted by the authors of the article for their own document format. The second group of methods includes the work [12], which proposes a method for calculating the image quality assessment based on calculating the maps of the mean square deviations of the brightness gradient calculated on the text areas of the image. The calculated estimates were further correlated with the accuracy of the OCR systems on the same images. Another example is the work [13], where the image quality of the document was assessed using deep learning methods. For this, the input image using binarization methods was divided into sections containing text information of equal size, and the neural network was trained in such a way that the predicted quality of each section correlated with the recognition accuracy.
In addition to directly determining the quality of the text areas of the document, in literature, one can single out the direction when the entire document image is analyzed. In the work [14], a neural network model is proposed that receives an image as input and returns a quali-ty value. The model was trained on the following data: a pair (input image -target image) and the value of the quality score for this pair.
A number of methods have also been proposed that calculate a document quality score based on some sharpness score. In the work [15], two values are used to assess the sharpness: the maximum gradient and the standard deviation of the gradient calculated for the entire image. The first value characterizes the sharpest part of the frame, the second shows how uniform the image is as a whole. The estimate proposed in [16] is based on measuring the width of the gradient transition that forms the boundaries of objects in the image: the sharper the image, the narrower the gradient transitions, and the lower this indicator.
Most of the works use their own internal datasets, which makes it difficult to compare different approaches. Therefore, it is necessary to mention the existence of open datasets [17 -19] containing document images. The main purpose of their creation and application is both to ensure the possibility of correctly comparing the quality of different OCR systems on the same dataset, and to compare different approaches to assessing image quality among themselves.
As can be seen from the review, to determine the readability of textual information on the document image, either explicitly specified signs of text degradation (thinning of letter strokes or gaps in symbols) are used, or machine learning methods, which themselves formulate features based on a training sample. Taking into account that the concept of character readability is formalized rather poorly (as well as the concept of a high-quality image of a document in general), the latter will require a significantly larger amount of training data. It should also be noted that all the methods mentioned above do not take into account that the document image may not be entirely in the frame: then a situation is possible when the document image will have a high-quality rating, but it does not contain all the necessary details.
In this work, the result of document recognition is analyzed to check for the presence of all specified attributes on the document image in the frame. It is assumed that if a frame contains the document image with all the specified attributes in a readable text form, then the recognition network's confidence in its response will tend to 1.0 for each specified attribute. Therefore, in this work, the possibility of determining the readability of a symbol by the confidence of the recognition system in its answer will be investigated. This will allow us not to explicitly set a list of possible reasons for the poor readability of a single character, and will also reduce the total number of recognition networks in the document analysis system, which is especially important in conditions of recognition on low-power computing processors (for example, smartphones or tablets).

Task formulation
Let, as a result of recognition of a sequence of frames, a sequence of projectively corrected and recognized images be obtained: I = (I 1 ,, I N )I N of length N (I(j) = I j ), and each I(j) contains M j fields (document attributes): Let us define the indices array as The function of choosing the best image in the sequence Q : I N  {I q , q score }, where the best frame estimate q score takes the given values: q score  {"good", "bad"}.
Note that the choice of the best image does not depend on the order of images in the sequence: p p(Q (p))  p (Q(p)).
Thus, the goal of this article is to construct a function Q for choosing the best image in the sequence.

Algorithm for choosing the best frame
The general scheme of the proposed decisionmaking algorithm is shown in the next paragraph in the form of pseudocode. Here, the algorithm receives as input a sequence of projectively corrected images of the document (the projective image is achieved using specialized algorithms such as [20,21]) and the results of recognition of all specified attributes on each image. Each such result contains the coordinates of the text field bounding rectangle (for more information on this topic, see [22]), the field recognition result, and the neural network's confidence in its answer. The result of the algorithm is the best image from the input sequence and its evaluation in the form "good"/"bad". The assessment is carried out through a sequential analysis of three frame quality indicators, calculated by analyzing the confidences of the recognizer in its answer, searching for flare in the document image and evaluating the "blur" of the document image.
Algorithm: Best frame choosing.
Input: N recognized images I 1 ,, I N , thresholds T CS , T FS . Output: Best image I x , its grade as "good"/"bad" If CS i >T CS and FS i > T FS then: 7 Add (Ij, DS i ) in accepted 8 Sort accepted by DS i 9 else: 10 Add (Ij, DS i ) in rejected 11 Sort rejected by DS i 12 If size(accepted)>0 then: 13 return accepted[0], "good" 14 else: 15 return rejected[0], "bad" Document images for which the recognition system did not find all the attributes in the frame are not allowed to enter the module.
Next, the algorithms for calculating the three mentioned frame quality indicators will be considered in sections 2.3, 2.4, 2.5, and then, in section 3, all parameters and threshold values of the algorithm are adjusted.

Recognizer confidence analysis
Let there be a symbol image x  X in the string X and a finite set of classes of size M, also called the recognition alphabet.
The neural network implements the classifying function A(x), which assigns the vector of alternatives a  to the image x in such a way that: p is an estimate of the recognition object x belonging to the class C v k , v k is an index of class with k-th largest estimate. The recognition result of the symbol x is the class C v 1 with the maximal confidence 1 p [23]. Let's define the recognition result of the string X as , where (v i , p i ) -the result of recognizing the i-th character in a string of l X character long.
We will define the current frame as "good" if the following condition is met: where (X) is some statistical function over the result of recognizing the string X,  -over X ', X ' is the set of recognized lines on the document,  -experimentally selected threshold.

Image assessment for flares
We will consider a flare as a local spot with sharp edges and maximum possible brightness, that appears on laminated documents [24]. In this case, the value of image quality will be defined as the minimum among all estimates calculated as the ratio of the area of the attribute zone to the area of the flare that falls into the attribute zone. Thus, it is enough to overlap the area of one attribute with a flare to affect the quality assessment of the entire document. On the other hand, a flare is allowed if it does not interfere with the reading of the document attributes and does not affect the quality of recognition. However, it should be noted that if the flare is located in close proximity to the field, it is no longer possible to determine whether it covers part of the field or not without expert judgment (see fig. 2). Therefore, the requirement for the location of flares must be tighter: if the distance between the flare and the field in the direction of the text is less than one printed character, it is considered that the flare overlaps the field. In other words, the algorithm is applied to the widened bounding rectangles.
For segmentation of the document image into flare / non-flare, binarization is used with a threshold of T bin (the choice of thresholds will be described in the section 3 of the algorithm settings). Further, for each zone of the text field on the image, bounded by the corresponding rectangle, the corresponding zone of the flare mask is considered: 1. for each pixel-width strip i of the field across the direction of the text, the ratio S i of the area of the flare in this strip to the area of the strip, measured in the number of pixels, is calculated: S 1 , ..., S width ; 2. the maximum among the calculated ratios is calculated S max = max{S i : i[1width]}; 3. next, using the threshold T flare flare score for the field is calculated FS field : To estimate the sharpness score, the modified algorithm described in [25] with the next steps was used: 1. get a one-channel image I 1ch ; 2. calculate gradient maps in two orthogonal directions -vertically G H and horizontally G V ; 3. calculate given quantile T q for each direction Q Tq (G H ) and Q Tq (G V ); 4. select the smallest of the obtained values qsharp = min (Q Tq (G H ), Q Tq (G V )). The choice of the T q threshold is described in the section on algorithm setup 3.4.

Algorithm parameters setting on a training dataset
Adjustment of parameters of each of the modules responsible for the classification of the frame for good / bad was carried out by constructing the ROC curves corresponding to each specific module, followed by comparing the areas under them (area under the curve, AUC) and choosing the thresholds corresponding to the optimal ratio FPR / TPR.
To configure the final algorithm, an internal closed dataset of Arabic ID-documents (identifier ARE-BO-01001 in the PRADO [26] database) was selected, containing 1535 images from 26 video clips. Each image was marked good / bad -there were 723 "bad" and 812 "good" in total. Images were marked as follows: if at least one field was unreadable on the document image after localizing the document area and correcting its projective distortions, the document was marked as "bad".
Also, to compare the proposed algorithm with other approaches, the [27] approach was chosen and compared with.

Recognizer confidence analysis
The following statistical functions  were considered in the paper: (1/ 2) 1 2 ( ) = ( ) = , For  similar functions were taken, but they were considered over (X)XX '.
By enumerating all possible combinations of  and , ROC curves were constructed to select the most appropriate classifier.
Constructed ROC curves for the internal dataset of Arabic IDs are shown in the fig. 3a. As can be seen from the graphs, the most qualitative classifier turned out to be  = mean with  = mean. We should also pay special attention to the behavior of the ROC curve with  = median (see fig. 3b). It can be seen from the graphs that a change in  by 0.1 can lead to a sharp change in the FPR / TPR ratio, which is not very convenient when setting up the algorithm and choosing the optimal threshold. Based on the results of the experiments,  = mean,  = mean,  = 0.9, were chosen to assess the frame quality (see. fig. 3a).

Image assessment for flares
To determine the binarization threshold T bin for all binarization thresholds T bin with a step of 5 (for the range of values of the original image [0, 255]), ROC curves were constructed for the cutoff thresholds T flare . The fig. 4a shows four curves with the maximum area, the rest are omitted for clarity. As you can see from the graph, the maximum values are reached at T bin thresholds equal to 235, 240, 245. The average value of 240 was chosen for the algorithm. The fig. 4b shows the ROC curve for this threshold separately.
For the flare estimation T bin = 240 and T flare = 0.33 were chosen.

Sharpness analysis
To check the performance of the algorithm, the following was done. A sequence of frames [28] was taken, on which certain conditions of blur were reproduced: camera shift in different directions in combination with a slow shutter speed, focusing error, document capturing at an angle at low apertures (uneven sharpness across the field frame). On each frame, the document was localized and projective distortions were corrected. Within the series, the images were sorted by the degree of sharpness: out of 25 frames, the first 15 images were the sharpest, then the sharpness gradually decreases with increasing frame number. The graphs of the dependence of the sharpness estimate on the frame number for different quantiles were built ( fig. 5b).
Based on results of the experiments, the following conclusions were drawn: 1. The contrast of the image significantly affects the absolute value of the sharpness score. The image can be visually sharper, but due to the low contrast, have a lower sharpness score (for example, as in fig. 6). In this algorithm, the value of the sharpness score is not normalized in any way and is not tied to the image contrast. This is done because the proposed algorithm does not need the score of "absolute" sharpness: the sharpness scores of the images of the same document, taken in the same sequence under similar conditions, are compared with each other, that is, within the task under consideration, such big changes within the video stream are not assumed. 2. Flare of a relatively small size (less than 5 % of the frame area) does not affect the value of the score. Below, in fig. 5a graphs of the sharpness score for two series -with and without flares are shown. All images within the series are visually sharp, the series differ only in the presence / absence of flare. As you can see from the graphs, the range of values for the series is the same. 3. Visual assessment of sharpness is in better agreement with the calculated value of the sharpness score for the 95 % quantile than for other values of the quantile (fig. 5b). Sharpness score is 0.055 Sharpness score is 0.052 Fig. 6. A blurry but high contrast image (left) may have a higher sharpness score than a visually sharper but low-contrast image (right)

Comparison with other approaches
To compare the proposed algorithm with other approaches, a ROC curve was constructed for frame evaluation only by the sharpness assessment proposed in [27]. As you can see from the graph in fig. 7, the area under the ROC curve for evaluating the frame quality based on the sharpness assessment is less than the area for the recognizer confidence.

The result of setting the algorithm on the training dataset
Let us introduce terminology: True Positive (TP) is number of correct images on which zone was found correctly, True Negative (TN) is number of incorrect images which were correctly rejected, False Positive (FP) is number of correct images on which the target zone was found incorrectly or incorrect images that were accepted, False Negative (FN) is number of correct images which were mistakenly rejected. The results on the described dataset after adjustment are presented in Tab. 1.

Experimental results
The fine-tuned algorithm was tested on the following documents from the MIDV-2019 reference dataset: German, Spanish, Slovak, Turkish and Czech id documents (folders "14_deu_id_new", "20_esp_id_new", "42_svk_id", "43_tur_id" and "10_cze_id"), as well Algerian passports (folder "18_dza_passport") and Italian driving licenses (folder "30_ita_drvlic"). Total of 256 images were used, of which 157 were "good" and 99 "bad". The images were marked up in the same way and could be downloaded from [28].
The results for this dataset are presented in Table 2. The result of algorithm with disabled separate parts of the algorithm is also presented.
An experiment was also carried out when a system configured to recognize a new type of German IDs was given "good" images of old German IDs as input. Even in the case of an erroneous linking of documents, the result of their recognition was ultimately assessed as "bad". Thus, even if "good" images of documents are submitted for recognition, but not of the type for which the recognition system is configured, the proposed algorithm will reject them. It should be noted that the average number of frames in a clip when recognizing from a video stream is 4 -8 frames.
With the obtained precision value of 91.6 for frame-byframe evaluation, we can assume that the method allows to select the best frame in the video stream with high accuracy.

Further research
In further work the authors plan to improve the algorithm for flare detection: first, use adaptive flare threshold (this is especially needed for black -white document copies) [29], and second, use clustering approach for understanding if a flare was found or just a white part of the document.
It is also necessary to expand the amount of data -to increase the number of document types both for setting up the algorithm and for testing.

Conclusion
This work considered the problem of choosing the best frame and its assessment. The main factors for as-sessing the quality of the frame were the confidence of the neural network's response to the recognized text, as well as the presence of flares in the document image and the defocus degree of the frame. The choice of the best image is proposed to be considered as the problem of ranking images by quality.
The reference markup for a part of the MIDV-2019 open dataset has been prepared and made publicly available.
A practical method is proposed for choosing the best frame when recognizing a document in a video stream and the results of its application on the selected dataset are obtained: accuracy is 88.7 %. Also, the proposed method can be considered suitable for verifying the correctness of the input data and settings of the recognition system: if the system receives a document unfamiliar to it, the recognition result of such a document will correspond to the system's low confidence in the response and all frames in the stream will be marked as "bad". This situation can serve as a signal to the user of the system about the occurrence of an emergency situation.