Weighted combination of per-frame recognition results for text recognition in a video stream

The scope of uses of automated document recognition has extended and as a result, recognition techniques that do not require specialized equipment have become more relevant. Among such techniques, document recognition using mobile devices is of interest. However, it is not always possible to ensure controlled capturing conditions and, consequentially, high quality of input images. Unlike specialized scanners, mobile cameras allow using a video stream as an input, thus obtaining several images of the recognized object, captured with various characteristics. In this case, a problem of combining the information from multiple input frames arises. In this paper, we propose a weighing model for the process of combining the per-frame recognition results, two approaches to the weighted combination of the text recognition results, and two weighing criteria. The effectiveness of the proposed approaches is tested using datasets of identity documents captured with a mobile device camera in different conditions, including perspec-tive distortion of the document image and low lighting conditions. The experimental results show that the weighting combination can improve the text recognition result quality in the video stream, and the per-character weighting method with input image focus estimation as a base criterion allows one to achieve the best results on the datasets analyzed.


Document recognition in uncontrolled conditions
Nowadays text object recognition is widely used not only in government and business processes but also in everyday life [1,2]. One of the first problems in which optical character recognition (OCR) technologies found their application was automatic data entry. A few decades ago such problems required special equipment, knowledge of the used fonts, scanned image characteristics, etc. But today the scope of application of such technologies has expanded, and document recognition is increasingly carried out in uncontrolled capturing conditions. For instance, automatic personal data entry can be done without the use of specialized equipment, for example, when opening a bank account using a mobile application or when buying and registering SIM cards in a selfservice mode [3]. Apart from the automatic input of personal data, text object recognition is essential in electronic document management systems, allows saving time, reducing expenses, and saving natural resources [4]. The development of hardware, such as personal mobile devices, has made it possible to expand the applicability of OCR technologies for recognizing text in natural scenes and use these technologies in such cases as driver assis-tance systems [5], assistance for people with visual impairments [6], online translators [7], government photo and video recording systems [8,9], and many more. Along with the applicability of the text recognition technologies, the requirements for the quality and reliability of recognition results are increasing [10]. Besides, more and more cases require the possibility to use "improvised means" for the recognition, with input images captured using a smartphone camera or a web-camera [11,12].
An important area of OCR technologies application is document recognition [13]. The translation of paper documents into electronic form allows quickly and conveniently to process and index them. A separate important subsection of document recognition is the identity documents recognition [14]. These technologies have found application when filling out various registration forms [15], identifying a person in security systems [10], filling out personal and sensitive information [16,17], etc. In many of these applications recognition errors are extremely costly. Improving the recognition quality of identity documents captured using mobile devices is an important topic, and this paper will primarily consider identity documents as the target object for recognition.
Unlike images obtained with special scanners, for which it is possible to set up the lighting conditions be-forehand, ensure the immobility of the recognized object and the recording matrix, etc., the frames received from a mobile camera can have low quality, contain highlights on the reflective surfaces of the object, be out of focus or blurry, target object can have strong projective distortions [13,18,19]. Especially often these difficulties arise when capture is performed in uncontrolled conditions [20]. Lighting problems can decrease text image quality and make it difficult to recognize. Uneven illumination can lead to sharp differences in brightness and to the appearance of false borders, which complicate text per-character segmentation [21]. To avoid highlights or shadows on the surface of the recognized document or the occlusion of recognized text fields by holograms and other security elements, the user can rotate the document during capture, thus some projective distortions of text objects may occur [22,23]. This, as well as a complicated, cluttered, non-homogeneous background, can complicate the localization of the document in the image [24,25]. Fig. 1 shows examples of document images with various types of distortions. Despite the additional difficulties associated with the usage of mobile devices for recognition, the advantage of mobile cameras in comparison with scanners is that they allow to get not a single image of the recognized object, but a video stream, which makes it possible to get frames captured with different illumination, at different angles, with different focus characteristics, thus allowing to reduce sporadic errors of an OCR-system [16].

Scope
After obtaining an image of a recognized document, the recognition process usually involves such stages as preprocessing input images, text fields localization, segmenting string image into characters, which then are submitted for recognition, post-processing of recognition results. Some of these steps may be absent. For example, in recognition systems where the text is analyzed in an end-to-end way [26], per-character segmentation is not required.
The purpose of input images preprocessing is to improve the accuracy of text detection and recognition. This stage includes, for example, contrasting, colored background removal [27], binarization [28], as well as removing image defects (noise, glare, overlaying holograms) using various types of filtering and the use of morphological operations [29].
The stage of document localization involves the precise detection of the document boundaries in an image. If the document has a fixed layout of fields, the document localization allows us to simplify and increase the accuracy of the fields localization and, as a consequence, text recognition quality. A common approach to document localization problem is to find the vanishing points using the straight lines present in an image (for example, document edges, baselines of text fields). In conditions of weak projective distortions, an approach based on the generalized Viola-Jones method [30] is also applicable. An approach based on the key points search is more robust to various kinds of image distortions [31]. Methods based on the fast Hough transform [14,32], the RANSAC algorithm, or the least-squares method are used to search for straight lines in an image or to refine the search for feature points.
Algorithms for segmentation of the found fields into individual characters can be based both on the analysis of the horizontal projection [32], and use character candidates recognition methods with dynamic programming methods to determine the optimal set of cuts [33].
Approaches to the classification of individual symbols include pattern matching algorithms [34], support vector machine (SVM) based algorithms [32], artificial neural networks (ANN), and much more.

Text recognition in a video stream
When using a video stream as input data for recognition, the problem arises of choosing methods for combining information obtained from different frames of a video sequence. The methods of combining per-frame information can be divided into two groups: methods, relying on image combination to obtain a higher quality object representation, and methods of combining the extracted text recognition results. The first group includes methods for selecting the most informative frame [35,36], "superresolution" methods that create a higher quality image based on several low-resolution frames [37  39], methods for tracking and combining images of a recognized object on a sequence of frames [40,41], methods of blur compensation by replacing blurred areas in one frame with their clearer counterparts taken from other frames or using deep learning methods [42]. Also, for a better reconstruction of a recognized document image, it is possible to use the data obtained from various sensors of the recording device, such as, for example, an accelerometer or a gyroscope. However, for modern mobile devices, the error in their measurements can be quite significant and prevent using this data for image reconstruction [43]. The second group of methods involves combining the results of individual image recognition. The methods of the first group, which involve combination on the level of input images, could be time-consuming, sensitive to geometrical distortions between frames, and poorly scalable with regards to video sequences of arbitrary lengths. Thus, in this paper, we will consider methods of the second group, i.e. the combination of the recognition results obtained for the individual frames.
A distinctive feature of text object recognition is that a text string is a composite object, i.e. consisting of multiple components (characters). Text recognition algorithms that analyze the text in an end-to-end way are more applicable for recognizing strings that are difficult to segment into characters (such as text written in Arabic script) or for recognizing large texts, where the majority of the words occur frequently, and there are fewer limitations to the processing speed [2,44]. In a more general case, in particular, with regards to identity document recognition systems, the text recognition result is considered as a concatenation of character classification results. Such representation implies a preliminary text percharacter segmentation procedure, i.e. the process of splitting the image of a text string into the images of special characters. With such text representation, the model of per-frame recognition results combination has to deal with strings obtained for different frames, which in the case of segmentation errors have different lengths, and the combination algorithm needs to be able to account for that. One of the combination approaches which allows variablelength input strings is the ROVER method (Recognizer Output Voting Error Reduction) [45]. This method was originally created to improve the quality of speech recognition by combining the recognition results received from different systems. This method includes two stages. At the first stage, all the combined recognition results are aligned by inserting an empty character in an optimal way and combined into a single transition network. In the second stage, using the voting procedure, the best recognition result for each element of the composite object is selected. The voting procedure can be considered as the task of combining classifiers and such classifier ensemble models as the rules of sum, product, maximum, median, etc. [46  48] can be used as an extension of the voting procedure in ROVER. Thus, using the ROVER method to combine the recognition results obtained from several frames allows producing correct recognition results even if in some frames the text field was incorrectly segmented into characters.
The combination algorithms for per-frame text recognition could be further improved by introducing weights of the input results. If a predictor could be constructed such that it would be possible to estimate the validity of the recognition result, such predictor can be used for weighting the per-frame results in the combination. This could include zero-valued weights for "rejecting" some of the per-frame results which could spoil the overall combined result, or select and combine only a few "best" results. The question, however, arises -which predictor to use to maximize the quality of the final result. The goal of this paper is to consider the weighting problem, investigate the functions of the input images or input recognition results which could be used as the quality predictors, and to propose the model and methods for weighted perframe text recognition results combination.
The paper is structured as follows. Section 1 provides a detailed description of the difficulties that can arise at different stages of document recognition. Section 2 sets out the problem statement for the per-frame recognition results combination. In sections 3 and 4 a general weighing model and weighing criteria are proposed, respectively. In section 5 an approach to a weighted combination that takes into account the peculiarities of individual characters recognition is described. Section 6 describes the performed experimental evaluation. Section 7 provides an analysis of the obtained results. Finally, conclusions are drawn and the possible topics for future work are proposed in Section 8.

Error analysis
Text field recognition errors can be caused both by the physical difficulty to read the entire field (when the data cannot be fully recognized even by a human) and by errors that have arisen at various stages of document recognition. Figure 2 shows examples of documents, with some parts of data which cannot be read. The first type of reasons is the occlusion of a text by a highlight or a holographic security element. If a highlight or a hologram appears as a bright spot that completely occludes a part of the text, then most likely this part will be completely discarded during text localization. If the highlight occludes the character partially (for example, if it is the edge of the highlighted region), then this could lead to a single character classification error  the character becomes similar to another (for example, partially occluded "B" or "8" may become similar to "3"). If this problem is not present on all frames of the video stream or the occluded areas are different in different frames, then combining the recognition results of individual frames can allow you to get the correct final result, even if there are no correctly recognized frames. Thus, the problem of combining text strings of different lengths arises.

a) b) Fig. 2. Example of images when part of the data is difficult to read: (a) with defocus, (b) with an occlusion by a highlight
Image defocus or blur can significantly complicate the text segmentation into the separate characters and the characters classification, to the point that the text becomes unreadable. An example of identity document text strings with per-character segmentation errors is shown in Fig. 3. In contrast to the occlusion by highlights glare and holograms, blur and defocus often affect not the individual characters, but the entire text strings. Fig. 4 shows an example of extracted document text field images, most of which are unreadable due to defocus, and even when one frame (frame number 3) was correctly recognized, the combination result became spoiled by irrelevant recognition results of low-quality frames (see tab. 1).

a) b) Fig. 3. Example of per-character segmentation errors.
Recognition results: (a) "SPECINAEN" Fig. 4. Field images from a video clip in which low-quality frame recognition results spoil the combined recognition result Recognition errors can be caused not only by an incorrect classification of individual characters but also by errors of document localization or determining its orientation in an image, for example, due to a complicated background. If the localization of the text fields of the document is based on the assumption of a fixed geometric layout [49], even a slight deviation of the found document boundaries from their actual position can lead to a noticeable distortion of local parts of the document and incorrect localization or cropping of the text field (Fig. 5). Even if the document fields are adequately found, then the incorrectly found document boundaries quadrangle leads to text distortions and, as a result, errors at the further stages of recognition. Serious errors in document search lead to incorrect localization of text fields and the appearance of recognition results that are far from the true values.
Incorrect classification of correctly segmented characters may be caused, for example, due to the similarity of some characters and the poor quality of the input images, as well as complicated document background. Examples of misclassification of individual characters are shown in Fig. 6. In this case, if recognition errors are sporadic (i.e. are not present on all frames), combining the recognition results of individual frames can also improve the recognition quality due to the fact that correct recognition results for individual parts of the text field can be obtained from different frames.

Problem statement
In this section, we will present a problem statement for the weighted text recognition results combination.
When talking about text recognition results, we will mean the recognition results of a text string composed of characters from a fixed finite alphabet. The recognition result x of an individual character may be viewed as a sequence of membership estimations for each character class and represented as a vector: where K is the number of character classes (i.e. the size of the alphabet), x k  membership estimation for the class k, which can take a real value in the range from 0.0 to 1.0. The recognition result X of a text string can be represented as a matrix: where M is the length of the recognized string (in terms of the number of characters), x jk  membership estimation for the j-th symbol with regards to the k-th class. Such recognition result representation is commonly used at the text recognition post-processing stage to construct algorithms for correcting recognition errors based on a-prior information about the syntactic and semantic structure of the recognized data [50]. If the recognition result is represented as a matrix of membership estimations, the ROVER method can be generalized [51] as follows. Firstly, the set of possible classes is expanded by the "empty" class  (with a class number k = 0), such that its membership estimations for all characters of a frame recognition results will have zero value. In terms of the matrix, this corresponds to adding a zero-valued column at the beginning of the matrix. The distance between the recognition results of two characters x 1 and x 2 can be determined as: Using this metric, the two text string recognition results can be aligned with each other so as to minimize the total pairwise distance between characters. At the voting stage of the ROVER method, membership estimations for the combined recognition result r of matching characters of the strings with aligned characters can be calculated as the weighted average of membership estimations for x 1 and x 2 : where w(x 1 ) and w(x 2 )  weights with which the recognition results are included in the combination. Consider the problem of a weighted combination of text field recognition results in a video stream as follows: an object X is recognized in a sequence of N frames takes as an input the recognition results of the sequence of frames and their weights and outputs the combined recognition result (which will be treated as the recognition result of the whole video sequence). With a fixed sequence of frames 1 2 ( ), ( ),..., ( ) N I X I X I X and a fixed combination function R (N) , the task is to assign weights w 1 , w 2 , ..., w N to the frame recognition results X 1 , X 2 , ..., X N such that to minimize the expected distance (R (N) (X 1 , X 2 , ..., X N , w 1 , w 2 , ..., w N ), X * ).

Weighting model
Low-quality recognition result of the individual frame can decrease the quality of the combined result. Therefore, one of the questions is what strategy is better  a weighted combination of several recognition results or selection of the single best result. This issue was considered in [52] with regard to individual character recognition. In the context of text field recognition on identity documents and bank cards in a video sequence, it was shown that in the absence of localization and segmentation errors, i.e. when a document was found incorrectly or text fields were incorrectly split into characters, the strategy of combining several of the most "competent" classifiers according to the product rule (the product of membership estimations for each class) or a voting procedure shows the best result. However, it is not clear whether such strategy is applicable in the case of a full-text string recognition problem. Unlike the individual characters combination problem, in the case of text strings recogni-tion, a correctly recognized single frame could absent, but, at the same time, combining the recognition results can give the correct result (for example, in the case of a "sliding" highlight). Therefore, even in the absence of localization and segmentation errors, taking into account the recognition results from all frames may turn out to be essential.
To generalize and unify the combination approach and the selection of the best frame, the weighting model can be specified as follows: Let us set the order S N (S N being the set of permutations with the length N) of the recognition results according to a non-decreasing value of the basic weighting function: and the cut-off threshold t{1, ..., N}. Then the weights can be defined with the following function: This weighting model can be used to generalized both the selection of the single best result according to the quality predictor w (if the threshold value is t =1), and the full weighted combination of all input samples (with t =N), as well as the weighted combination of a few best frame results. Given such weighting model, the task is now to determine the best combination strategy, the best weighting criterion w and the threshold t.

Weighting criteria
In this paper, we considered two weighing criteria. The first is a focus estimation ( ( )) i F I X , calculated using an algorithm proposed in [53]. This criterion was also used to control the input frame quality in video stream document recognition systems [54]. First, the values of the image gradients are calculated in four directions (vertical, horizontal, and two diagonals): ( ( )) min ( ( ( ))), ( ( ( ))), ( ( ( ))), ( ( ( ))) , where q(G) is a 0.95-quantile of the gradient image G. It was assumed that the weighting method based on the text field image focus estimation will allow to reduce the significance of frames in which the image of the recognized field is of poor quality due to defocus, smears, and blur, that can lead to errors in text localization and per-character segmentation, as well as low quality of individual characters recognition.
The second weighting criterion used for recognition quality estimation was a-posteriori recognition confidence Q(X), where X is the text recognition result (2). The text string recognition result confidence value is calculated as a minimal value of the highest membership estimation across all string character classification results: This weighting criterion was based on the assumption that with correct recognition of the text field, the "best" membership estimations will have a higher value than with an inappropriate recognition when the classifier cannot determine the recognized character with high confidence.

Per-character weighting
Proposed weighting model and weighting criteria were evaluated in [55] for the problem of per-frame combining of identity documents text fields recognition results. It has been shown that the weighted combination actually improves the recognition quality. However since the result of the text string recognition depends on the results of individual characters classification, and, in some cases, the quality of the character images in the same frame can vary greatly (for example, in the case of highlight, partial defocus, mechanical occlusions of a part of the text string, etc.) or weighting criterion may not always correctly represent the quality of recognition of individual characters (for example, the confidence value criterion in the case of incorrect per-character segmentation), an additional question arises  how correct is it to assign the combination weights based on the characteristics calculated over the entire text string. Therefore it is sensible to introduce a weighting model considering each individual character with its own weight.
The ROVER method in this case needs to be modified. Before adding the recognition result X i to the combination result, weights 1 2 , , ..., of the string recognition result X i . For the weighting criterion based on focus estimation, the character weight is calculated as focus estimation of the image of character submitted to the recognition module. For the confidence value weighting criterion, the character weight can be assigned simply as the highest membership estimation of this character. For the "empty" character  the weight coincides with the weight with which the text string recognition result as a whole is included in the combined result, i.e. ( ( )) i F I X or Q(X i ). In the first step, the text string recognition result X 1 of the first frame is stored as the combined result R, with the corresponding combined character weights w(r 1 ), w(r 2 ), ..., w(r M ) and the full result weight W R taken from the weights of the first result. At the next steps, when adding the recognition result X i with weight to the combined result R, alignment is performed so as to minimize the total pairwise distance between characters, calculated according to the character recognition results distance function (3). After the combined result and the new frame result are aligned, their characters are combined according to the combination rule (4).
If the character i j x of the added recognition result was not matched with any combined result character during alignment, then it is combined with an empty symbol  with weight W R . If the character r j of the combined result did not match with any character of the per-frame result, then it is combined with an empty symbol  with weight w i . When combining two characters with weights W 1 and W 2 the weight of the combination character result is determined as W 1 +W 2 . After combining all text string recognition results, the updated weight of the new combined result is calculated as W R +w i . The diagram of the modified ROVER algorithm is shown in Fig. 7.

Experimental evaluation Full string weighting
After the definition of the weighting model, weighting criteria, and the algorithm for the combination of text string recognition results with per-character weighting, we can proceed to the experimental evaluation.
Previous work [55] described experiments performed on the MIDV-500 [16] dataset. This dataset contains 500 video clips of identity documents captured with mobile cameras without strong distortion. However, it seems important to evaluate the quality of the proposed method and weighing criteria in various, including challenging conditions. Therefore, experiments were also performed on the MIDV-2019 [22], which contains 200 video clips of identity documents. A feature of the MIDV-2019 dataset is that video clips were captured in low lighting conditions (subset MIDV-2019-L) and with strong projective distortions of document image (subset MIDV-2019-D). Each video clip contains 30 frames, but only the frames on which the document is fully visible were considered; if the resulting clip length had fewer than 30 frames, the frames were repeated in a loop, following the experimental procedure set up in other papers using this dataset [51]. Four field types were analyzed: document numbers, numeric dates, Latin name components, and machine-readable zone lines. The fields were recognized using the method described in [56]. The comparison with the correct text field values was case-insensitive, and the letter "O" was considered identical to the digit "0". Normalized Generalized Levenshtein Distance [57] was used as a metric function for the set of text string recognition results.
On the first stage for each basic weighting function we considered five weighted combination strategies: combination without weighting (i.e. using a constant value as a basic weighting function and threshold parameter t = N), choosing the single best result (threshold parameter t = 1), weighted combinations of the 3 best (threshold parameter t = 3), of the best 50 % (threshold parameter t = N/2), and of all frames (threshold parameter t = N). Fig. 8 shows the rate of combined text recognition result error decrease after the addition of new per-frame results for the various approaches to weighted combination using a focus estimation (7) and a recognition result confidence value (8), as measured on all analyzed field groups of the MIDV-500 dataset. Such plotted rates can be viewed as performance profiles [58] for the process of text recognition in a video stream as an anytimealgorithm (i.e. the algorithm with results increasing their quality over time).
It can be seen that weighted integration improves the quality of recognition, and in the case of using the focus estimation as a weighting criterion, noticeable improvements are achieved regardless of the number of combined frames. When using the confidence value of the recognition result for weighting, the selection of the few best frames to combine does not improve the recognition, in particular at the later stages of the process, i.e. with a higher number of combined per-frame results. Fig. 9 and 10 demonstrate similar performance profiles for recognition results on MIDV-2019 dataset, for subsets with low lighting conditions and with strong projective distortions respectively.

. Performance profiles for weighted combination based on focus estimation (top) and confidence value (bottom) for MIDV-2019-L dataset
According to the experimental results, it can be seen that on both datasets, weighting according to focus estimation criterion allows achieving a higher recognition quality than when using confidence value as a weighting criterion. It can also be noticed that, on average, the best result is achieved by a weighted combination of the best 50 % of frames. Figures 11, 12, and 13 show comparative profiles for combining the best 50 % frames for different combining strategies.

Per-character weighting
At the second stage, experiments with a per-character weighing model were performed. The main attention was paid to using focus estimation as a base weighting function, which gave the best results in the previous experiments. Fig. 14 shows performance profiles for combination without weighting, weighted combination of all frames, and half of the best frames both for the combination with weighting of the entire text field and for the percharacter weighting modification. Fig. 15 shows performing profiles for clips with low light conditions, Fig. 16 represents similar plots for clips with strong projective distortions. From the results of the experiments, it follows that per-character weighing allows improving the quality of the text string recognition in a video stream, regardless of the features of capture clips.

Discussion
From the results of the first series of experiments, it can be seen that for clips without significant projective distortions (Figs. 8 and 9), the weighted combination allows to improve the recognition precision, and with using focus estimation as a weighting criterion noticeable improvements are achieved regardless of the number of combined frames. For frames with strong projective distortions of the document, the image quality can be notice-ably different for different parts of the text strings. Therefore, a predictor constructed over the entire text field may not fully adequately reflect the quality of the recognition. This is especially important for long text fields, such as the machine-readable zone. Fig. 17 shows an example of a document and its machine-readable zone lines, with visibly uneven image quality. The results of the conducted experiments also show that for longer video streams (i.e. with a higher number of frames) the selection of a fixed small number of the best frames for weighting combination using a confidence value as a weighting criterion does not improve recognition (the blue and red performance profiles for the confidence value criterion in Figures 8 -10 are above the gray performance profile corresponding to the baseline). The cases when the weighting is performed using the focus estimation show quality improvement, but at the same time, the weighting according to the confidence value does not seem to be robust for clips in which some of the characters were lost due to per-character segmentation errors, clips with low quality of the original image or with highlights. Highlights lower the frame focus estimation criterion as computed according to the method (7), whereas for the confidence value, on the contrary, the loss of several characters of the text line may even increase the weight of the text string. On the other hand, the predictor built on the character membership estimations could better reflect the text field localization errors, as one can expect that the incorrectly localized field the maximal membership estimations of the characters will be lower than for the recognition of a correctly localized text field. It should be noted that in the performed experiments this case was not considered, since the document and text field coordinates were taken from the ground truth, and some additional investigations should be performed to analyze the impact of localization errors.
Thus, weighting combination using confidence values allows to potentially discard precarious recognition results (for example, from frames with strong defocus or incorrect text field localization), but does not work very well in the case where some of the characters are lost. The influence of this can be seen in the fact that, according to the results of the experiments on all datasets, weighing according to the focus estimation allows achieving a higher recognition quality than using the confidence value as a weighing criterion.
Also, it can be noticed that, on average, the best results are obtained using a weighted combination of the best three frames and a weighted combination of the best 50 % of frames. This can be explained by the fact that some frames, obtained using a mobile camera in uncontrolled conditions, are recognized with poor quality. The weighting model which allows to cut off the least significant per-frame result, which could worsen the overall combined recognition result, while still retaining the benefit of a weighted combination over the selection of a single best per-frame result.
To take into account the possible unevenness of the text field recognition quality, the second series of experiments was carried out. In these experiments, in addition to the total weight for the text field, we calculated weights for each individual character. The results of the experiments showed that per-character weighting allows to increase the recognition quality when compared with a full-string weighting model. It should be noted that if for clips without strong projective distortions ( Fig. 14 and  15), for which the full-string weighted combination made it possible to improve the recognition quality, percharacter weighting only slightly improves the result. However, for clips with strong projective distortions (Fig. 16), per-character weighting significantly increases the combined result precision. Moreover, such improvement occurs regardless of the number of combined perframe results. From this, we can conclude that for clips with strong projective distortions, if the weighted combination is performed using the focus estimation criterion, it is important to account for the local features of the text field. This can be explained in particular by the fact that, as mentioned earlier, for such clips, the image quality (and hence the character classification precision) of the text line may be uneven.
Thus, from the results of the conducted experiments, it can be concluded that the evaluation and use of the image quality of the recognized field when combining the image quality makes it possible to improve the recognition precision of the text in a video sequence. The best result was achieved using the focus estimation as a weighting criterion. Due to the fact that the input frames obtained using a mobile device camera in uncontrolled conditions may not be of very high quality, the best combination result is obtained using the strategy of combining 50% of the highest scoring frames. Such approach allows simultaneously to cut off the outliers (lowest quality input frames) and to accumulate useful information from multiple well-recognized frames. Since the image quality can be uneven, the use of per-character weighting, which accounts for the local features of the recognized text field, makes it possible to additionally improve the recognition quality, in particular, if the text is captured with significant projective distortions.

Conclusion
The paper considered the problem of the combination of text recognition results in a video stream to improve recognition quality. A weighting model and two weighing criteria were proposed: an assessment of the focus esti-mation of the text image and an a-posteriori text string recognition confidence value. The experiments were carried out on two open datasets containing video clips of identity documents captured with a mobile camera in various conditions.
The results of the first series of experiments have shown that a weighted combination of the recognition results of individual frames can improve the overall recognition quality in the absence of strong projective distortions of the text image. The combination of the best 50% of input frames weighted using an image focus estimation was shown to increase the precision of text recognition in a video stream, as such approach both filters the lowquality outliers and accumulates information from multiple input frames. However, in the case of uneven image quality, in particular on clips with high projective distortion of the recognized text, assigning weights based on characteristics calculated over the entire text string image could be inadequate. Therefore, a per-character weighting procedure was proposed. Experimental results show that the per-character weighting improves the recognition accuracy for all types of the analyzed video clips, including the clips with sliding highlights, long text strings, and the clips with high projective distortions.
Thus, it can be concluded that the combination of the best 50 % frames with per-character weighting according to the input image focus estimation can be applied for video stream recognition systems for increasing the text recognition result precision.
In future research, we plane to explore other possible weighting criteria for the individual frames recognition results, explore the possibility of using deep learning methods to solve the problem of combining recognition results in a video stream, as well as evaluate the proposed weighting model and methods for other domains of application, such as road scene objects recognition.