(50-1) 15 * << * >> * Russian * English * Content * All Issues

AP-Pose: Enhancing human pose estimation in complex airport scenes with YOLOv8s-Pose and multi-module optimization
H.Z. Shen1, X.C. Wang1, G. Dong2, Y. Ci3

1School of Electrical Engineering, Shanghai Dianji University, 201306, Shanghai, China, Shuihua Road 300;
2Zhengzhou Coal Industry Group, 100039, Zhengzhou, Henan China;
3Beijing Simulation Center, 100039, Beijing, China

  Full text (PDF)

DOI: 10.18287/COJ1707

Article ID: 1707

Language: English

Abstract:
Recognizing human behavior in airports is crucial for security management and public safety. To address the issue of low accuracy in human behavior recognition caused by significant lighting changes and complex backgrounds in airport settings, we propose an improved human pose estimation method based on YOLOv8s-Pose, called the AP-Pose algorithm. Firstly, the PSA module is added to the end of the backbone network to enhance the network's ability to model global features, capture the global relationships between different regions in the image, and filter out interference from complex backgrounds. This enables accurate human target localization and pose estimation. Second, the WASP module is introduced into the neck network to improve the spatial capture capability of the pose estimation network, thereby enhancing the localization accuracy of human joint points. Finally, the Adapter module is employed to enrich the feature representation for pose estimation by incorporating information from the original detection task layer and fine-tuning the pose estimation task using this information. Experiments were conducted on a self-constructed airport human pose dataset, and the results show that our improved method significantly enhances pose estimation performance, achieving a detection accuracy of 94.9%, with improvements of 3.3% in precision (P), 3.3% in recall (R), and 3.5% in mAP@0.5 compared to the baseline model. This method enhances pose estimation accuracy and robustness in complex airport scenes while maintaining computational efficiency, achieving a 3.5% mAP@0.5 improvement over YOLOv8s-Pose with a marginal increase in computational complexity (12.5 GFLOPs vs. 10.8 GFLOPs).

Keywords:
human pose estimation, airport scene, ap-pose, adapter module, wasp module, psa module.

Citation:
Shen HZ, Wang XC, Dong G, Ci Y. AP-Pose: Enhancing human pose estimation in complex airport scenes with YOLOv8s-Pose and multi-module optimization. Computer Optics 2026; 50(1): 1707. DOI: 10.18287/COJ1707.

References:

  1. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30. DOI: 10.5040/9781350101272.00000005 .
  2. Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  3. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  4. Liang Y, Ge C, Tong Z, et al. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022.
  5. Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.DOI:10.1109/iccv48922.2021.00986.
  6. Xie E, Wang W, Yu Z, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 2021, 34: 12077-12090.
  7. Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers.European conference on computer vision. Cham: Springer International Publishing, 2020: 213-229.
  8. Chi C, Wei F, Hu H. Relationnet++: Bridging visual representations for object detection via transformer decoder. Advances in Neural Information Processing Systems, 2020, 33: 13564-13574.
  9. Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  10. Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?ICML. 2021, 2(3): 4.
  11. Arnab A, Dehghani M, Heigold G, et al. Vivit: A video vision transformer.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 6836-6846. DOI: 10.1109/iccv48922.2021.00676.
  12. Fan H, Xiong B, Mangalam K, et al. Multiscale vision transformers.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 6824-6835. DOI: 10.1109/iccv48922.2021.00675.
  13. Chen S, Yu T, Li P. Mvt: Multi-view vision transformer for 3d object recognition. arXiv preprint arXiv:2110.13083, 2021.
  14. Chen H, Wang Y, Guo T, et al. Pre-trained image processing transformer.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 12299-12310. DOI: 10.1109/cvpr46437.2021.01212.
  15. Liang J, Cao J, Sun G, et al. Swinir: Image restoration using swin transformer.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1833-1844. DOI: 10.1109/iccvw54120.2021.00210.
  16. Wang Z, Cun X, Bao J, et al. Uformer: A general u-shaped transformer for image restoration.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 17683-17693. DOI:10.1109/cvpr52688.2022.01716.
  17. Chen X, Xie S, He K. An empirical study of training self-supervised vision transformers.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 9640-9649. DOI: 10.1109/iccv48922.2021.00950.
  18. Pan T, Song Y, Yang T, et al. Videomoco: Contrastive video representation learning with temporally adversarial examples.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 11205-11214. DOI: 10.1109/cvpr46437.2021.01105.
  19. Toshev A, Szegedy C. Deeppose: Human pose estimation via deep neural networks.Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 1653-1660. DOI: 10.1109/cvpr.2014.214.
  20. Luvizon D C, Tabia H, Picard D. Human pose regression by combining indirect part detection and contextual information. Computers Graphics, 2019, 85: 15-22. DOI: 10.1016/j.cag.2019.09.002.
  21. Li K, Wang S, Zhang X, et al. Pose recognition with cascade transformers.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1944-1953. DOI: 10.1109/cvpr46437.2021.00198.
  22. Li J, Bian S, Zeng A, et al. Human pose regression with residual log-likelihood estimation.Proceedings of the IEEE/CVF international conference on computer vision. 2021: 11025-11034. DOI: 10.1109/iccv48922.2021.01084.
  23. Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines.Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 4724-4732. DOI: 10.1109/cvpr.2016.511.
  24. Chou C J, Chien J T, Chen H T. Self adversarial training for human pose estimation.2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2018: 17-30. DOI: 10.23919/apsipa.2018.8659538.
  25. Xiao B, Wu H, Wei Y. Simple baselines for human pose estimation and tracking.Proceedings of the European conference on computer vision (ECCV). 2018: 466-481.
  26. Cai Y, Wang Z, Luo Z, et al. Learning delicate local representations for multi-person pose estimation.European conference on computer vision. Cham: Springer International Publishing, 2020: 455-472.
  27. Li Y, Zhang S, Wang Z, et al. Tokenpose: Learning keypoint tokens for human pose estimation.Proceedings of the IEEE/CVF International conference on computer vision. 2021: 11313-11322. DOI: 10.1109/iccv48922.2021.01112.
  28. Ma H, Wang Z, Chen Y, et al. Ppt: token-pruned pose transformer for monocular and multi-view human pose estimation.European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 424-442.
  29. Shi D, Wei X, Li L, et al. End-to-end multi-person pose estimation with transformers.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11069-11078. DOI: 10.1109/cvpr52688.2022.01079.
  30. Pishchulin L, Insafutdinov E, Tang S, et al. Deepcut: Joint subset partition and labeling for multi person pose estimation.Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 4929-4937. DOI: 10.1109/cvpr.2016.533.
  31. Cao Z, Simon T, Wei S E, et al. Realtime multi-person 2d pose estimation using part affinity fields.Proceedings of the IEEE conference on computer vision and pattern recognition.2017: 7291-7299. DOI: 10.1109/cvpr.2017.143.
  32. Cheng B, Xiao B, Wang J, et al. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation.Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 5386-5395. DOI: 10.1109/cvpr42600.2020.00543.
  33. Wang A, Chen H, Liu L, et al. Yolov10: Real-time end-to-end object detection. Advances in Neural Information Processing Systems, 2025, 37: 107984-108011.
  34. Artacho B, Savakis A. Bapose: Bottom-up pose estimation with disentangled waterfall representations.Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 528-537. DOI: 10.1109/wacvw58289.2023.00059.
  35. Chen S, Ge C, Tong Z, et al. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 2022, 35: 16664-16678.

151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20