(50-1) 14 * << * >> * Russian * English * Content * All Issues

Hybrid architecture of transformer and convolutional neural network with a multi-scale deformable attention mechanism for semantic segmentation task
R.R. Otyrba1, A.A. Sirota1

1Voronezh State University, 394018, Russia, Voronezh, Universitetskaya Square 1

  Full text (PDF)

DOI: 10.18287/COJ1686

Article ID: 1686

Language: English

Abstract:
A hybrid neural network architecture, SegTwice, is proposed for the semantic segmentation task. It combines the strengths of transformers and convolutional neural networks within a unified encoder-decoder framework. The original architecture of the encoding network, TWICE-DA, is presented, featuring a hierarchical structure with four levels. New architectural solutions are introduced and justified within the transformer blocks, which differ from known analogs: a multi-scale perception unit, a channel attention module, a deformable attention module, and a convolutional feedforward network module. Experiments on image classification tasks were conducted to assess the feature extraction effectiveness of TWICE-DA on datasets of varying complexity. It is shown that TWICE-DA demonstrates high quality, outperforming most modern models in terms of accuracy and computational complexity. The integration of TWICE-DA into the semantic segmentation network structure is achieved by adding a lightweight MLP decoder, ultimately realizing the SegTwice architecture. Experiments conducted on standard aerospace datasets, LoveDA and Potsdam, revealed that the proposed SegTwice network demonstrates competitive performance, matching traditional models and modern transformers in accuracy, and in some cases, outperforming them. Notably, SegTwice was trained "from scratch" without pre-training on large datasets, highlighting its resilience to overfitting in scenarios with limited data.

Keywords:
computer vision, semantic segmentation, deep neural networks, convolutional neural networks, transformers, attention mechanism.

Acknowledgements:
This work was supported by the Ministry of Science and Higher Education within the State assignment № 075-00444-25-00 (by 26.12.2024).

Citation:
Otyrba RR, Sirota AA. Hybrid architecture of transformer and convolutional neural network with a multi-scale deformable attention mechanism for semantic segmentation task. Computer Optics 2026; 50(1): 1686. DOI: 10.18287/COJ1686.

References:

  1. Long J, Shelhamer E, Darrell T. Fully Convolutional Networks for Semantic Segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015: 3431-3440. DOI: 10.1109/CVPR.2015.7298965.
  2. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL, Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv Preprint. 2016. Source: <https://arxiv.org/pdf/1412.7062>.

151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20