Building detection by local region features in SAR images

The buildings are very complex for detection on SAR images, where the basic features of those are shadows. There are many different representations for SAR shadow. As result it is no possible to use convolutional neural network for building detection directly. In this article we give property analysis of SAR shadows of different type buildings. After that, each region (ROI) prepared for training of building detection is corrected with its own SAR shadow properties. Reconstructions of ROI will be put in a modified YOLO network for building detection with better quality result.


Introduction
One of the biggest discrete objects in satellite imagesthat needed to be detected are buildings. Although optical images can easily capture detailed ground surface information, the approach is limited by weather conditions. In contrast, synthetic aperture radar (SAR) sensing is independent of weather and daylight conditions, and thus more suitable for mapping areas with buildings promptly. Automatic recognition of buildings in satellite images is an important and rather difficult task.
Detection and highlighting buildings in satellite images is very useful for many applications. It's helpful in building maps, creating a territory building plan, finding malicious and illegally constructed objects, etc. The main difficulty of the SAR image analysis task lies in the large number of different structures recognition. The task is complicated cause of the various shape, color characteristics and size of the objects.
There are many algorithms developed to recognize buildings in optical satellite images. Most of those algorithms are based on the analysis of object shape, texture, shadow, s boundary, etc. [1,2]. Recently, neural network methods have been used for buildings extraction in optical satellite images [3]. The training dataset consists of satellite original images and their masks. They are binary images or contours of regions, where the pixel value corresponds to classes of objects. But there are only a few neural networks that can get quality results, for example, Fully convolutional network (FCN) [18], Mask R-CNN [19], CNNs [20]. Such methods have been successfully employed in computer vision and remote sensing fields for optical image classification but few applied to SAR images.
Backscattering echoes from buildings in highresolution SAR images include the information on the three-dimensional shapes of buildings. Reconstruction and detection those flat-roof and gable-roof buildings from SAR images were carried out by several researchers [4 -8]. But it's still a highly challenging problem of automatically detecting buildings in very high-resolution (VHR) SAR images.
In this paper we propose an approach to detect buildings in SAR-images by using neural networks, where the training areas for detection are selected by automatically determining a buffer zone on each building whose shadow is detected by using the shadow shape and the sun illumination direction. It allows us to improve recognition of regions of buildings The experiments were performed on set of high resolution SARS images

Characteristics of SAR images
The basis of SAR image establishment is the reflection features of the scattering of the emitted radar signal by various surface types [9]. The total intensity of signal reflection (pixelgray value) is affected by the characteristics of flatness and regular properties of objects. As a result of processing, the "raw" SAR data is converted into a gray image. The gray value in a pixel in SAR images, which is not affected by lighting, chemical composition (except salt and ice) and temperature of objects, depends on three factors: the SAR system, the SAR processing, and the properties of object. So an object can be classified by properties of it, such as geometry, dielectric proper-ties, and so on. Volumetric objects (for example, vegetation or other index decoration) corresponds to an average level of gray value and texture, surfaces (for example, a calm water surface) shifts brightness to a dark value and buildings to be a bright value. The dielectric properties of the material affect the intensity of the reflected signal. The difference in the coefficient values for different materials makes it possible to identify by SAR.
An important physical parameter of radar imagery is polarization. Polarization of radiation is a property inherent in radar systems. With linear polarization, the planes can be located horizontally H and vertically V. A vertically polarized wave will interact with vertical structures in the building, horizontally polarized will penetrate through them and vice versa ( fig. 1).
Since SAR image is conducted with a significant angle of deviation, the direction of decorative elements on buildings matters as such a key factor ( fig. 2). So the signal reflection intensity with different directions from the roof structure of one identical building will be different. The moisture of the building material, which determines its dielectric constant, is a great important factor too [10].

Characteristics of buildings in SAR images
Buildings in SAR images have their own characteristics that allow to be detected. Each type of building has its own characteristics of shadow. So the geometry of shadow is pattern for buildings recognition.
The preparing of mask or region plays a very important role in quality of detection by neural network method. The size of mask is depending on size of shadow. Therefore, each building should be covered by a mask that changes according to patterns of shadow.
To solve the problem of building classification and mapping for SAR, it is necessary to study the nature of the radar signal scattering intensity, which is similar to the study of the spectral reflectivity of objects in the processing of optoelectronic images. To complete the job on a given territory, a set of training samples was laid on the basis of available data. For each sample entirely falling within the limits of separate building regions, characteristics calculated were brightness profile.
Combination of SAR polarization types allow to improve quality of details of objects on image. There are many types of buildings. These types have different representation for every combination of polarization channels. Common properties of buildings are collected in sum of all polarization channels.
As a result, the basic feature for building detection in SAR image is shadow. The structure of shadow depends from features of buildings. The shadow, whichis formed by location of building details with conductive properties angle and speed of satellite motion reflections of waves from corners of construction element of building.
The paper [9] proposed that SAR building shadow is described by geometrical point of view, then defined an evaluation function implementing the ratio of exponentially weighted averages (ROEWA)which is used for the matching between the predicted structure and the observed SAR image.
The projection of the three-dimensional building into the two-dimensional slant image plane influence shadow and produces effects such as layover, shadow and foreshortening. In addition, there are specular reflections for close building for urban regions. The paper [9] proposes dense alignment for the buildings images along the radar look of sight (RLOS) for compensation the multi-path reflection effects [11] like as in fig. 3.
Profiles of shape of building shadow in the SAR image are divided into two basic types: the flat roof and the gable roof building. For example, the single bounce generated by the isolated flat roof of one floor building reflections from the ground, front wall and roof. The struc-ture of one store building with flat roof is not complex. For multistory building, brightness profile has more complex structure, like as fig. 4. Azimuthal and lateral resolution has different concepts. In this case, there is different brightness representation but the structural elements of the SAR shadow are preserved.
Material used in building construction influences its brightness in SAR image. Today there are many materials that characterize by different dielectric properties. As result there are different types of representation in SAR images. Wooden building Metal building, such as fig. 6, is characterized by very high brightness. Shown in fig. 7, it represented a metal building, whose material has low conductivity lead to low brightness.

SAR image preprocessing
Common SAR image processing consists of seven basic steps: creation, calibration, multilooking, selection of region of interest (ROI), indexing, detection and classification [12].
The first step is data creation. Imperially, it was found that the double polarization of VV+VH [6] allows obtaining more accurate results than the double polarization of HH+HV. Therefore, we use only combination VV+VH. In this step two branches of processing are created for every type of SAR polarization. The second step is calibration. The calibration radiometrically corrects the SAR image in which the pixel values are changed with the backscattering value of the radar beam from the reflecting surface. It is automatically determined based on satellite image metadata. Calibration radiometrically corrects the SAR images so that the pixel values truly represent the radar backscatter of the reflecting surface. The calibration corrections are realized by the SNAP software that automatically determines what corrections need to be applied to the image. Calibration is essential for quantitative estimation of SAR images.
Multilooking is used to produce a product with a conditional pixel size of the image. It accumulation is formed by averaging the pixel resolution in range and azimuth, increasing radiometric resolution, but deteriorating spatial resolution. As a result, the image has an approximate square pixel size. It is correspond to conversion from inclined range to ground range. The multilooking is an optional step. It is not required when the image is adjusted for terrain. But we do it in common processing scheme.
Speckle reduction removes specific noise that is caused by random constructive and destructive interference during construction of the image. The resulting image of a particular pixel is obtained by adding a set of values from antenna sensors. Speckle noise is represented as graininess caused by chaotic alternation of light and dark pixels. The presence of it makes difficult to analyze radar images. Speckle filters are applied to SAR images to reduce the amount of speckle at the cost of blurred features. For images of the RADARSAT-2 satellite in vertical-horizontal polarization, the Refined Lee filter was better for analyzing buildings in countryside.
The order of recording the return signal values in radar sensing of the earth's surface depends on the direction of motion of the satellite and the direction of sight, as a result of which the original images are not always correctly oriented relative to the cardinal directions. Terrain Correction is decoding of the image by correcting SAR geometric distortions. It includes geocoding and orthocorrection.
Geocoding is coordinate reference of the original or converted radar image without removing distortions for the relief. Such transformation converts an image from Slant Range or Ground Range Geometry into a Map Coordinate System.
Orthocorrection includes not only the coordinate reference, but also the elimination of distortions associated with the terrain, which uses a digital elevation model. As a rule, geocoding and orthocorrection of radar images is performed by orbital data. Such transformation is used a Digital Elevation Model (DEM) to correct for inherent SAR geometry effects on images such as shadow, foreshortening, layover. For terrain correction we used range-Doppler algorithm.
Then SAR image usually contrasted for best representation. This operation corresponds to expression of intensity like as logarithmic transformation. It is possible spend analysis and comparison SAR images after such transformation because all distortions should be corrected.
ROI selection is important step that depend on main task. It is necessary to be done because usually SAR satellite image has very big size and it is impossible to download it with limitation of computer capacity. We use traditional way where the image is cut into tiles.
After it we create addition channel (band) image for vegetation. This step has name as indexing [13]. Some mathematical combination or transformation of spectral bands that accentuates the spectral properties of green plants so that they appear distinct from other image features [14]. The radar vegetation index (RVI) is calculated for every pixel as [15]: γ 0 (gamma-nought) represents the radiometrically and geometrically corrected SAR backscattering coefficient for each polarization combination. As result for analysis we use three channels: VV, VH, RVI.

Building definition
Two last steps of building detection are detection and classification. They are very complicated procedures that have many realizations.
The detection is the process of partitioning a SAR image into multiple regions (connected sets of pixels that correspond to objects). The goal of detection is to simplify the representation of an image into more meaningful to analyze.
The classification is defining visual content to segmented regions. It is final step for detection building on SAR image. These two steps can be combined through the use of a.
For our research we use images from RadarSAT-2 satellite. The resolution of such images is about 1.7 meters per pixels. Type of such pixels is float. For realization of some functions we use free SNAP software from Sentinel Company and QGIS.

Modification of dataset
For the quality of the neural network, the preparation of the data set is the most important step. We used 500 samples for training set and 110 samples for tests. We test too many types of CNN, but semantic network YOLO v3 on base Darknet show the best results.
The YOLO model we selected has a strict input data array like as set of images with building. We needed some kind of interface that can accept any image, normalize it and feed it into a neural network. And we have developed this interface. For normalization, it uses Tensor-Flow, which works much faster than other solutions we tested (native Python, numpy, openCV). Our source data are represented as sets of polygons of building that were prepared on color satellite images ( fig. 9). This result of house marking is unsatisfactory. But sample for YOLO network is rectangle. For compensation of shape and dimension of such sample the algorithm on base region growing was developed. It consists of two basic steps. The first step is definition of bounding box for every contour. The bounding box is a rectangular box. The maximum and minimum of x and y axis coordinates of box correspond to the upper-left corner and the lowerright corner of the rectangle for every contour of buildings. Sets of such corner points are used as source coordinate for region growing algorithm. This algorithm process only gradient of VH polarized image. Every corner is shifted in diagonal direction for increasing of box area. The step of shifting is corresponding to nine pixels. This number of pixels can be changed for other image scale. Through lines of motion, the value of local gradient is accumulated. If this value does not exceed threshold, shifting line are stopped and coordinates of box line return to previous position. This threshold is deepened from type of building and it was define by user in our case. Such growing of box allow to bounding building with SAR shadow as fig. 10. Every new image is supported annotation txt-file in the same directory and with the same name, that include object number and object coordinates on this image, for each object in new line: <object-class> <x> <y> <width> <height>. This annotation was controlled by user and classified for next classes: -F1C -flat roof, one floor, concrete or brick material; -G1C -garble roof, one floor, concrete or brick material; -C1C -complex structure, one floor, concrete or brick material; -F2C -flat roof, two or three floors, concrete or brick material; -G2C -garble roof, two or three floors, concrete or brick material; -C2C -complex structure, two or three floors, concrete or brick material; -F3C -flat roof, multistory building, concrete or brick material; -G3C -garble roof, multistory building, concrete or brick material; -C3C -complex structure, multistory building, concrete or brick material; * wood -wood construction; * metal -metal construct. In addition, the data augmentation is used for every sample as rotation by 15 ("*" less sample).

Modification of dataset images
The existing of these problems connected with properties of materials and complexity of building construction. In this way we try to modify VV and VH images like as fig. 12. In this way we try to define wood and metal building by control area of region with flat brightness with high value for metal buildings and low value for wood construction. Such objects are detected by brightness threshold. For metal buildings it is small range of maximal values of brightness about five percent from histogram of brightness. For wood buildings it is small range of maximal values of brightness about ten percent from histogram of brightness also these regions selected by area in range that defined from image resolution. After, metal and wood classes excluded form YOLO processing results.

Modified YOLO CNN for Building detection
Many detection systems repurpose classifiers or localizers to perform detection by using R-CNN, VGGNet, ResNet, Inception, and so on. They use the model only for images with selected scales in separate location. As rule, such algorithms use images with very high resolution.
YOLO based on a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. Such box can include parts of different class of building with SAR shadow. These bounding boxes are weighted by the predicted probabilities. The basic method for detection of buildings we define as YOLO V3. It is multi-object detector. In our solution three types of images (VV, VH and RVI) are merged by concatenation. A layer that concatenates two inputs is along a specified axis, which corresponds to the concatenate layer. The inputs must be of the same shape except for the concatenation axis. Concatenation takes as input a list of tensors, all of the same shape except for the concatenation axis, and returns a single tensor, the concatenation of all inputs. It converted multiple inputs into same shape layer. It is possible to realize by adding additional dense layer to inputs of VV image and to result of such merging. The same shape outputs are used for every concatenation layer and for input. The basic architecture YOLO is shown at figure 11.
For next layers of network next characteristics are used. Every training step use 384 target images (batch = 384 / 8 = 48). The batch is divided by 8 to decrease GPU VRAM requirements.The number of categories for detection is 9. Therefore, case filters are 48 for every predicting scale. Weights are only saved per 100 iterations, then saved every 1,000 when over 1000 iterations. Below are the results (Table 1) of the building detection for eleven classes of buildings. The test dataset contains satellite imagery of one district region of China. All sets of images are divided to training and test sets. We estimate quality of results by accuracy.
The accuracy is the percentage of correctly classified region inside detected box as buildings in the predicted or reference images [16]. In common case it is relation of number of correct predictions to total number of boxes. The precision can be interpreted as proportion of true boxes of buildings from all positive defected boxes that algorithm found. The recall allows estimating count of relevant detection of algorithm [17]. Results of detection are good for next classes of building: F1C, G1C, C1C, F2C, G2C, C2C, F3C; G3C. Results of detection are bad for wood, C3C, Metal classes.

Conclusion
The YOLO deep learning algorithm integrates an SAR satellite image dataset with buildings suitable for YOLO training. The training model has good test results especially for traditional buildings and rotating ones, as well as compact and dense brightness objects with SAR shadow. We optimize input of the YOLO algorithm to three different types of images VV, VH and RVI. It should be noticed that the demonstrated approach has proven to be good to detect buildings. We think that this approach can be used to detect any discrete objects. The main thing is to provide the correct training set. This set should be balanced. Troubles in our result are connected with not enough data for training. The dielectric properties of wooden buildings form the low brightness of these objects. As a result, wooden houses have a low level of detection. Buildings with complex structure are often detected not as one but as many several buildings. These problems are very difficult to solve. It may carry out additional post processing under the supervision of an experienced user. The problem of metal building detection is redundancy. They are detected in all places of high brightness, even if they are absent. We solved this problem by shifting metal detection to preprocessing, and by segmenting with YOLO, this class is ignored. In this way, if the set is well balanced, then the result of the neural network training is better.
As further improvements, we would try other neural networks architectures or to use an ensemble of networks, i.e. to predict the value of each pixel not according to the