Tiny CNN for feature point description for document analysis: approach and dataset

In this paper, we study the problem of feature points description in the context of document analysis and template matching. Our study shows that the specific training data is required for the task especially if we are to train a lightweight neural network that will be usable on devices with limited computational resources. In this paper, we construct and provide a dataset with a method of training patches retrieval. We prove the effectiveness of this data by training a lightweight neural network and show how it performs in both documents and general patches matching. The training was done on the provided dataset in comparison with HPatches training dataset and for the testing we use HPatches testing framework and two publicly available datasets with various documents pictured on complex backgrounds: MIDV-500 and MIDV-2019.


Introduction
The image description is a very important part of computer vision in modern science.The algorithms that somehow build a representation for the object are required in many scopes from image tagging and annotation for medical [1] or other purposes [2] to face verification [3].The purpose of these methods is to transform an object (image, image patch, signal) into a vector of values.The essential property of these methods is to yield comparable vectors: the distance between these vectors must be small for the representations of the same/similar object and big for the representations of the different objects.In the scope of the document understanding and recognition these algorithms are also playing an important role.They are used for document template matching [4], forensics checks [5] and even for recognition of the characters [6].
Two main types of descriptors are used: binary and floating point.The main advantage of the binary vectors is that the distance between them can be calculated much faster.Another advantage is compactness: to store each value one needs only one bit.Unfortunately, many algorithms (including neural networks) provide floating point values and therefore cannot be used directly.In contrast with binary vectors floating point vectors are also used.While the comparison takes more time and storage consumes more space this type of descriptors is still viable because it allows us to employ more algorithms for example neural networks which naturally produce floating point values.Obtaining a neural network binary descriptor is possible but requires additional effort [7], [8] and is a separate problem.

Fig. 1. Two pictures of a planes. They are the same objects but completely different images
There is an ambiguity in the task of image description.In paper [9] authors show that published results on different descriptors comparison are inconsistent.But the problem here is even more complex because a set of different tasks is solved using similar methods and algorithms.For example, in [10] authors train an image to vector neural network which is clustering objects representations by their class.It is important that even though the idea of this kind of neural network is the same as for the feature point description, the meaning is completely different.In the first case, two images of a plane should transform into close vectors, and in the second case, the final result must depend on the similarity of the images regardless of the pictured object type as it is demonstrated in Fig. 1. "Image description" problem also exists in even more general form [11].
Existing datasets for the descriptors training are mostly focused on outdoors pictures (for example [12]) and/or too complex to be used for lightweight neural networks training.Moreover, these datasets contain distortions which make them ineffective for document feature point description.To solve this problem we will introduce a new training dataset that is suitable for document feature points descriptor training but can also be used for multiple purposes.
In our experiments, we will show that a very lightweight neural network trained on this dataset can show competitive results on both documents and general image patches.
These kinds of neural networks are typically called metric neural networks.To train such a network several loss functions are used.One of them is a triplet loss which was known for a long time already [13].Despite the fact that this loss function is widely used authors of many papers introduce modifications of the triplet loss for different purposes [14].Some of them went further for a quadruple loss [15].
To summarise, in this paper we introduce a new training dataset and show how to use it to train a lightweight universal neural network descriptor.The dataset and a method for training data retrieval are provided for public usage.

Training dataset 1.Dataset creation
The training dataset consists of five parts which are collected with three different methods: synthetic generation, capturing with a camera, and direct patch generation.
The first part of the dataset contains synthetically generated images with text lines.In document matching templates are usually matched by feature points descriptors.These points are often located on the static texts.So, the final descriptor must evaluate image patches with different letters as different even if these letters are located at the same places.The method described in [16] is used for text printing into the backgrounds which were selected manually.
The second part of the dataset aims for general purposes and consists of various textures collected from the walls, and various random surfaces with a 3D texture.These texture images introduce various shapes and their shades.To ensure the difference between the images we fix the camera and vary light source position like it is demonstrated in Fig. 2. The third part consists of the patches that were generated directly.They are blurry images with intensity peaks in random locations.To achieve this we generated pictures with a white background and several black dots, then apply Gaussian blur and Fast Hough Transform [17] as shown in Fig. 3.
In contrast with the rest of the dataset, these images must introduce shapes that are not usually presented in text strings or in the wild.Still, these images are perfectly valid for a patch matching task and therefore a reasonable amount of them will increase the quality of the trained algorithm in general.
The next part is similar to the text strings but instead of letters we used hieroglyphs.This part should cover a big variety of small objects which are not presented in the standard set of symbols.
The final part of the dataset contains images of barcodes instead of letters as they have a lot of small details that the trained algorithm is expected to differentiate.
All these parts of the dataset together contain 85 images of sizes from 1150×388 to 2000×6048.Most of them were duplicated and processed with a graphical editor.By doing this we ensure geometrical matching between the images with the same content and introduce some visual effects which the final descriptor must tolerate.

Patches retrieval
To create a final training set of the patches we process the dataset in a special way (scripts for patches retrieval will be available along with the dataset).Since our neural network is designed to take as input a grayscale picture of the size 32×32 we convert pictures to grayscale and retrieve image patches from all possible positions with small overlapping.This allows us to increase the number of classes and does not mix up classes.To diversify our data and extend the number of classes in the final training data we perform additional steps: add different scales and rotations.Also, we inverse some of the images to further extend the variety of the classes.The exact values of these parameters can be found in the retrieval script.Final training data contains 325176 classes and 534860 patches.The distribution of the images per class is shown in Table 1.It is by design that there are many single images per class as we will later employ special data augmentation.

Table. 1. Patches per class distribution
The dataset parts and the amount of classes they yield is summarised in Table 2.

Source
Classes number Samples

Text lines 265384
Photos 38916 FHT images 10000 Hieroglyphs 7140 Barcodes 3736 Since the designed dataset is created mostly for feature point description on the documents the text lines part is the biggest one in our experimental setup, but it can be balanced using the provided source code.

Neural network 2.1. Architecture
The neural network architecture was created with an extremely small amount of trainable parameters.This architecture has a dimensionality reduction layer (layer 6).This idea is presented in different forms in autoencoders [18], SqueezeNet [19], MobileNets [20] and others.Other than that the neural network is quite simple: the first layer has a 4×4 window size to obtain a noticeable initial receptive field and after that the extracted features are gradually transformed into the final vector with convolutional and fully connected (FC) layers.In the Table 3 we explain the architecture details.
where a>0.This will later allow us to evaluate output value bounds.The resulting neural network has only 3.9*10 4 trainable parameters which is considered to be very small.For example, HardNet [21] neural networks have much more than 10 6 parameters.Only 2.4*10 5 summations and 2.5*10 5 multiplications are required to evaluate the result which makes this neural network suitable for usage on the device with low computations power such as various smartphones, unmanned vehicles, and others.

Batch generation
To train our neural network we used the patches from a dataset.To generate a training batch we randomly choose 8192 of them.After that, for every patch we also randomly choose one random positive example (i.e. from the same class, if there was only one patch in this class we took the same patch) and one random negative example (i.e. from a different random class).While the current batch is processed by the training framework on GPU we generate the next one on CPU.This part can be improved with various triplet generation techniques like hard mining [22,23], but this is not a topic of the current paper therefore we use the simplest random selection.

Augmentation
Since our dataset does not have (and was not designed to, see Table 1) many patches for every class, the augmentation part is essential.In our experiments we used an online augmentation system [24].Image distortions were different for anchor/positive elements and for negative elements of the triplet.For anchor/positive element we carefully select the transformations which should not make the initially similar patches different: monotonic brightness changes, blur, additive noise, random crop and scale, motion blur.For the negative elements, the list of applied transformations was extended with opening and closing morphology operations, grid addition, and highlights.The initial probability of the image augmentation was 0.95.We select a random transformation from the list, apply it to the image with the current probability, then multiply the probability by a factor 0.85 and repeat the procedure until the list is empty.The probability reduction is needed to prevent the data over augmentation.In other words for every image the transformations {T} are shuffled and applied with a probability ( ) 0.95*0.85,

Loss function and training
After generating and augmenting the batch is passed to a neural network training framework.All the networks were trained for approximately 5000 batches (see Fig. 4 for a convergence plot) before testing.For initial randomization we use Xavier method [25].All the neural networks were trained with a standard triplet loss function with 1.5

=
. The convergence plot in Fig. 4 demonstrates interesting behaviour.The blue plot shows the loss and it is decreasing in the training process as it should.The orange plot shows the part of the triplets which were considered to be solved i.e. provided a zero gradient.The green plot shows the part of the triplet where the distance between the anchor and the positive elements was less than /2  .We can see that the number of such triplets is de- creasing that implies that the representations of the same class grow bigger with time.

Experiments
To prove the effectiveness of our dataset and method we performed four experiments.Firstly we train a neural network on the HPatches training data [9] and on our training data with the original triplet loss function.All neural networks were trained for approximately 5000 batches (each of which consisted of 8192 triplets) with described augmentation.
For testing purposes we used three datasets: HPatches to check the resulting descriptor validity in general and two open datasets containing documents: MIDV-500 [26] and MIDV-2019 [27].
In Fig. 5 we show some images from the used datasets.While HPatches is a dataset of the general image patches mostly containing outdoors images the MIDV-500 and MIDV-2019 datasets contain document images.The second one introduces heavier projective distortions and is considered to be harder.Both datasets have various complex backgrounds and are challenging for the task.

Results
In Tables 4, 5, and 6 we show the results obtained using HPatches testing framework [9].It can be seen that in some cases of patch verification (see Tables 4 and 5 our training data were even better.In the retrieval task the situation is even better (see Table 6).The neural network trained on our data shows comparable results.

Discussion
The suggested neural network has an interesting property: the activation function before the last layer limiting the value.This allows us to evaluate the lower and the upper bounds of the possible neural network output in each dimension.Let consider , ij W and j b are the weights and biases of the last layer.Then the lower L and the upper U bounds for every dimension can be calculated with equations (3)., , Even though we do not use this property in our current work, it can be very useful for the output quantization and reduction of the descriptor size.
Another interesting point is that one can notice that even though for the triplet training multiple examples per class are needed to construct anchor positive occurrences in our data there are many classes with a single image.It may seem that this is a disadvantage of the dataset but in fact on contrary.With this data distribution we can carefully choose augmentation for anchor and positive image and control which transformations should be tolerated and which should be not.
Finally, since most of the data is randomly generated we cannot be 100% sure that there are no images in different classes that are actually very similar.But our analysis shows that the probability of this is very low.Furthermore, with over 3*10 5 classes the chance, that two exact images will be selected incorrectly is negligible.

Conclusion
In this paper, we showed that feature points description is different for documents and for outdoor images.The comparison of the trained neural networks on the general and special (ours) datasets clearly shows the gap in the quality.The main purpose of our dataset is to provide the necessary information for the description of the image patches containing letters.Additional images make the training data applicable not only for document image patches matching but for other purposes as well.We also demonstrated that a very lightweight neural network can still be used for the task which makes this kind of algorithm applicable when using on devices with limited computational resources.
For future work we plan to further enhance the dataset in two main different ways: evaluate what type of data is still missing and add new images and improve the patches retrieval mechanism to use the already presented data even more efficiently.Also, we plan to quantize the neural network and its output down from 32 bits per value to 8 bits per value which should be possible without (or with minimal) quality loss according to the neural network properties.We also plan to study the possibility of the output dimension reduction for further descriptor size shrink.

Fig. 2 :
Fig. 2: Process of the gathering of the second part of the dataset.