Acute ischemic stroke lesion segmentation in non-contrast CT images using 3D convolutional neural networks

In this paper, an automatic algorithm aimed at volumetric segmentation of acute ischemic stroke lesion in non-contrast computed tomography brain 3D images is proposed. Our deep-learning approach is based on the popular 3D U-Net convolutional neural network architecture, which was modified by adding the squeeze-and-excitation blocks and residual connections. Robust pre-processing methods were implemented to improve the segmentation accuracy. Moreover, a special patches sampling strategy was used to address the large size of medical images and class imbalance and to stabilize neural network training. All experiments were performed using five-fold cross-validation on the dataset containing non-contrast computed tomography volumetric brain scans of 81 patients diagnosed with acute ischemic stroke. Two radiology experts manually segmented images independently and then verified the labeling results for inconsistencies. The quantitative results of the proposed algorithm and obtained segmentation were measured by the Dice similarity coefficient, sensitivity, specificity and precision metrics. The suggested pipeline provides a Dice improvement of 12.0 %, sensitivity of 10.2 % and precision 10.0 % over the baseline and achieves an average Dice of 62.8  3.3 %, sensitivity of 69.9  3.9 %, specificity of 99.7  0.2 % and precision of 61.9  3.6 %, showing promising segmentation results.


Introduction
Acute cerebral circulation disorder or stroke is a disease with high rates of morbidity and mortality worldwide.According to the American Heart Association, the most common type of stroke is ischemic [1].Early diagnosis of stroke is crucial for treatment choice [2,3], because tissue changes in the ischemic penumbra may be reversible, especially in the early stages [4,5].The choice of diagnostic methods in each specific case strongly depends not only on its applicability (availability, contraindications, patient's condition, etc.), but also on the time of symptoms onset [6].Any delay in medical care increases the risk of severe consequences and death.
Neuroimaging is fundamental to most modern methods of differential diagnosis of acute stroke [5,[7][8][9][10][11][12].For some imaging procedures, contrast injection is required, which is related to the risk of complications and a number of contraindications.The analysis of Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) scans is an integral part of the examination guidelines for patients with signs and symptoms of acute stroke [7][8][9].
CT is the most common diagnostic tool for acute ischemic stroke (AIS) due to its availability (a large number of screening centers, low cost, no contraindications and low body burden) and short screening duration [7,8,13].Non-contrast CT (NCCT) was first used in the evaluation of AIS patients in 1995 to detect intracranial hemorrhage (hemorrhagic stroke) and select a treatment strategy at an early time window (within 3 hours of symptoms onset) [15,16] and proved to be efficient [13,14,17].When diagnosing stroke, NCCT demonstrates relatively high specificity and low sensitivity, however, it allows detecting blood clots in cerebral vessels and intracranial hemorrhage (hemorrhagic stroke) which are absolute contraindications for some strategies of AIS treatment [9,17].
Interpretation of NCCT scans is associated with certain difficulties, since early AIS changes look like areas of slightly reduced density, which the human eye due to various factors cannot always distinguish from the normal tissues [3,7,19].In addition, images often show artefacts (caused by patient movements or imaging camera) that may look like strokes [8].Therefore, developing of automated procedures for localization and assessment of AIS tissues volume based on NCCT scans [19][20][21][22][23][24][25][26] (including the procedures based on convolutional neural networks (CNNs) [19][20][21][22]) is an urgent task.Randomized controlled trials [19,[27][28][29] have shown that CNN-based methods are comparable to radiologists in terms of sensitivity, specificity, overall accuracy, AUC, ROC, etc.It allows using CNNs as auxiliary tool in clinical practice.In addition, automated processing based on 3D CNNs allows volumetric analysis taking into account the spatial context and the rapid detection of small ischemic foci.
Over recent years in the medical image segmentation task CNNs have achieved state-of-the-art results [30,31].It is due to the convolution operation and its weights capable of learning complicated structures and patterns at multiple scales in the data.The U-shaped encoder-decoder CNN architecture 2D U-Net [32] has a wide application [33][34][35].In [33], a self-adapting framework where 2D U-Net is adopted to segment various organs is proposed.For the analysis of volumetric images, 3D CNNs were introduced [36][37][38].3D U-Net [36] shows better performance in comparison with 2D U-Net.In [39], a slightly modified 3D U-Net is utilized for brain tumor segmentation in MRI scans.In [40] DeepMedic CNN architecture [38] is used for stroke lesions segmentation and post-processing techniques are investigated.In [33], cascaded 3D U-Net is proposed, which overcomes the disadvantage of 3D U-Net for datasets with large image sizes, but it requires training two neural networks.U-Net is often improved for a specific task using various architecture choices (e. g., Squeeze-and-Excitation blocks [34,[41][42][43], attention mechanisms, and computer vision transformers [44]).
In this work, we present a neural network algorithm for the volumetric segmentation of acute ischemic stroke lesions on NCCT brain images.We use a 3D CNN based on 3D U-Net architecture [36], which we modify with residual connections and relatively new Squeeze-and-Excitation blocks [41].To achieve better results, we implement robust pre-processing techniques, a particular patch extraction strategy, and a weighted loss function, which are aimed to alleviate the problem of class imbalance.We perform five-fold cross-validation and compare the results of experiments by measuring Dice, sensitivity, specificity, and precision metrics.
In the rest of the paper, we describe the used dataset and the method (including data pre-processing techniques and patches sampling strategy), the neural network architecture, and training and testing procedures.In the end, we summarize our work and discuss future plans.

Materials
The dataset used for our study contains volumetric non-contrast CT head scans of 81 patients diagnosed with acute ischemic stroke.The CT images were made by the Philips Ingenuity CT scanner and stored in medical DI-COM format.The data were acquired from the International Tomography Center SB RAS.All volumetric images have the same resolution of 512×512, but a different number of slices varying from 306 to 505, depending on the patient.For all images, slice thickness and spacing between slices are 1 and 0.5 mm, respectively.Also, the DICOM attribute pixel spacing, that is, the physical distance in mm between pixel centers, ranges from 0.38 to 0.5 for different volumes.For each volumetric image, corresponding manual segmentation is available.All segmentations were performed by two radiology experts (specialists in magnetic resonance imaging and X-ray computed tomography with 9-13 years experience, PhD in Radiology and Radiation Therapy) using 3D Slicer [45].It is worth mentioning that the number of voxels corresponding to the area affected by acute ischemic stroke is 0.8% of the total number of voxels in our dataset.

Data pre-processing
One crucial step in all deep-learning approaches is data pre-processing, which ensures data consistency.Training CNN directly on data without pre-processing leads to poor performance, as will be shown below.First, the intensities of CT images were thresholded between 0 and 80 Hounsfield units [46].This transformation eliminates most of the artifacts and highintensity tissues and remains visible such important parts of the brain as soft tissue, white matter, and gray matter.Second, the skull and coil were removed on each slice, leaving only the brain area on the CT image.This transformation was performed using connected component analysis.In particular, extraction of the largest connected component, then assigning zero values to pixels with the highest intensity, which correspond to the region of the skull, and re-extraction of the largest connected component.Third, each volume was normalized.We apply min-max normalization: where X min and X max are the minimum and the maximum intensity values (X) of 3D image, respectively.Such a normalization rescales values to [0, 1].We also conducted an experiment applying standardization before min-max normalization: where µ is the mean and s is the standard deviation of the brain region.
Standardization was applied to each volumetric CT image of each patient independently.Also, the non-brain region was set to 0. Such a technique gives comparative intensity values in the brain area while invariant to the size of the background part.Fourth, the volumes were cropped to the non-zero region to dispose of a large uninformative background area.An example of a pre-processed image is shown in Figure 1.

Patch-based approach
The main difficulty related to 3D medical images is their large size, while computing resources are always limited.For example, a CT scan from our dataset has on average 512 × 512 × 405 = 106168320 voxels (1.04 GB of memory for 32-bit voxels), where 405 is the average number of slices for a patient.It is worth noting that if the dataset is anisotropic, that is, the resolution of different axes of an image is not the same, resizing methods are not advisable.Different interpolation techniques can deform the physical object and remove small details.
An optimal solution is to extract small parts, so-called patches, of 3D images and use them as input to a neural network.We use the uniform sampler, which extracts patches randomly from the whole volume with uniform distribution, and the weighted sampler, which extracts patches from different parts of the volume with a given non-uniform distribution.In our case, the weighted sampler extracts a patch with its center in the area of the healthy brain tissue with a probability of 0.5 and with its center in the area affected by the stroke with a probability of 0.5.The probability of the background as a patch's center is set to 0. During one training epoch, the same number of patches set to 32 are extracted for each patient.The patch size is set to 128 × 128 × 128.

Neural network architecture
Our deep learning algorithm builds on the encoder-decoder 3D U-Net [36] convolutional neural network, which we specially modify for our objective.The architecture of the neural network is shown in Figure 2. The input size is 128 × 128 × 128.Each convolutional block of encoder and decoder consists of two 3D convolutions with the size of kernel 3 × 3 × 3 and stride of 1 × 1 × 1, where each of them is followed by a normalization layer and activation function.We adopt LeakyReLU as an activation function since it showed better results in our experiments in comparison with more commonly used ReLU.

Fig. 2. CNN architecture
It is known that batch normalization [47] is strongly contingent on the current batch statistics during training.The statistics can include some noise, depending on the input examples.Therefore, it requires a sufficiently large batch size and also a large size of the training set.However, due to the high memory usage of 3D convolutions when employing large patches, we are limited to a maximum batch size of 2. Thus, we apply instance normalization, which solves this issue and shows better performance in our task than batch normalization.
Trilinear upsampling is used instead of the more traditional transposed convolution operator in the decoder part.In our research, we observe similar results, but the upsampling operator has no trainable parameters, so the occupied memory can be reduced.
Residual connections mitigate the problem of vanishing and exploding gradients when training deep neural networks.It was first proposed in ResNet [48].We integrate this technique in each convolution block of the contracting and expansive paths.The input is processed by 3D convolution with a kernel size of 1 × 1 × 1, so the element-wise addition is possible.The architecture diagram of the convolutional block of the encoder with residual connection is illustrated in Figure 3.The parameters in brackets after Conv3D are the number of input channels, output channels, and kernel size.The decoder block is similar except for trilinear upsampling that halves the number of channels instead of 3D convolution that doubles the number of channels, therefore we leave aside the decoder block image.
We insert the Squeeze-and-Excitation (SE) mechanism [41] in our CNN as it shows strong performance in many computer vision tasks.It squeezes the global spatial information into a channel descriptor, captures inter-dependencies of all channels, and then recalibrates the feature maps to accentuate relevant channels.In our case, it can help to learn where the affected area is located by strengthening the informative features and suppressing the weak ones.
Channel-wise global average pooling is applied to the input tensor T ∈ R C×D×H×W of the SE block.Then the obtained vector U ∈ R C is processed by the excitation mechanism: where W 1 ∈ R C r ×C is a linear layer reducing the number of dimensions of the vector U , δ is the ReLU activation function, W 2 ∈ R C× C r is a linear layer increasing the number of dimensions of the vector U , σ is the sigmoid function, and r is a parameter.In the end, a channel-wise multiplication between each element s i of the vector S and the input tensor T is performed.SE module is integrated into each convolutional block.Its particular location is shown in Figure 3.

Training procedure
The obtained segmentation maps from the last CNN layer are transformed by the sigmoid function to get the probabilities of classes.Then the neural network weights are optimized using the sum of binary cross-entropy (BCE) and soft Dice [49] loss functions.We also perform experiments with weighted ), where N 0 is the number of background pixels of all training images, N 1 is the number of pixels related to the affected area, N = N 0 + N 1 , y i and p i denote the ground truth and the confidence score of the model for pixel i, respectively.

Training, implementation and testing details
We train our networks using the following setting: the batch size of 2, Adam optimizer [50] with the initial learning rate of 1E-4, which is reduced by factor 2 every 5000 iterations until the learning rate is 1.25E-5, L2 weight decay of 1E-5.The reduction ratio r of SE is 16.Each model was trained for a total of 40000 iterations.
The proposed algorithm was implemented in Python 3.7 using PyTorch 1.12.0 machine learning framework.We also employ TorchIO 0.18.80 [51] Python library for data pre-processing.All trainings were conducted on NVIDIA Tesla T4 GPU with 16 GB of memory.
All patients were randomly split into five parts, so all experiments were performed using five-fold cross-validation.To evaluate our results, we measure Dice similarity coefficient (DSC), sensitivity, specificity, and precision metrics, which are most significant in medical image segmentation.The use of these metrics, in addition to the main metric DSC, allows us to evaluate aspects of the behavior of segmentation methods in conditions of unbalanced samples.Results are presented as mean ± standard deviation.DSC is calculated similarly to the soft Dice loss but is not subtracted from 1 and thresholded values are used instead of confidence scores.To get the predicted binary segmentation mask, the obtained probabilities were binarized according to the threshold of 0.5.On the test set, patches were extracted across a whole volume with an overlap of 25% over a grid.The predictions in the overlapping areas were averaged.

Experiments and results
The quantitative comparison of our methods is presented in Table 1.The pre-processing of the first experiment included only thresholding of the Hounsfield units and the min-max normalization.Also, patches were derived  While the integration of the residual connections improves the DSC by 2.7%, the consistent inserting of the SE modules gives a 3.8% of DSC increase, also showing the promising sensitivity value of 0.689 ± 0.059.The final algorithm configuration maintains the above-described improvements, while also standardization before min-max normalization and training using the weighted loss function are performed.Such a configuration gives the best results, 0.628 ± 0.033 of DSC and 0.699 ± 0.039 of sensitivity.We also study how the patch overlapping size impacts the segmentation accuracy when performing testing.Table 2 shows the quantitative assessment with 25%, 50%, and 75% path overlapping using the final algorithm configuration.Although the overlapping of 75% gives the best results relying on DSC, it reduces sensitivity, and the running time of the algorithm per one patient is increased by 10 times.
We tried to insert in the training process various augmentation techniques, such as random affine transformation, random flip, random elastic deformation, random gamma intensity transformation, but it worsened the performance of our algorithm.
Qualitative results are presented in Figure 4. Several axial plane slices of the 3D images are from the validation set.

Conclusion
In this work, we presented and evaluated an automatic algorithm for the segmentation of acute ischemic stroke lesion in non-contrast computed tomography brain images.Our deep learning approach is based on the 3D U-Net convolutional neural network.As far as we know, no previous research has investigated the combination of 3D U-Net, Squeeze-and-Excitation blocks, residual connections, specific patch extraction technique, and weighted loss function for solving the problem of acute ischemic stroke lesion segmentation on NCCT images.We have experimentally shown that the suggested pipeline for NCCT images segmentation, as compared to other state-of-the-art methods, improved segmentation performance by implementing robust pre processing techniques, Squeeze-and-Excitation blocks, and residual connections.Also, specific patch extraction technique and weighted loss function partially solved the class imbalance problem and increased the segmentation accuracy.Our proposed model showed an average Dice of 0.628 ± 0.033, sensitivity of 0.699 ± 0.039, specificity of 0.9965 ± 0.0016, and precision of 0.619 ± 0.036.High specificity values are caused by the dominance of the negative class (non-affected tissue) over the positive class (pathological area) in the sample.The method can be used by radiologists in delineating between damaged and healthy brain tissue and deciding on further treatment.In particular, it can help correct inaccuracies in their stroke area predictions.Moreover, the method can assist doctors with the large flow of patients by selecting cases with affected brain areas from normal ones.Thus, doctors first of all examine patients in need of emergency care.
It is also important to note that one of the main constraints of our objec- tive is the small amount of data in our dataset.It is with this we associate the unusual obtained values of the specificity metric, and with the class imbalance problem.In the future, we plan to explore other CNN architectures and increase the dataset.

Table 2 .
The comparison of the patch overlap sizes The results of the second experiment show that adding the robust pre-processing techniques, including selecting the brain tissue and cutting to the non-zero region, increases DSC by 0.4%.The restricted patch extraction method performed by the weighted sampler during training improves performance by 3.6% of DSC and 4.9% of sensitivity.The main proposed modifications of 3D U-Net are residual connections and SE.