Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation

Yali Yang, Yuanping Xu, Chaolong Zhang, Zhijie Xu, Jian Huang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Citations (Scopus)


Although convolutional neural networks (CNNs) have become the mainstream for image processing and achieved great success in the past decade, due to the local characteristics, CNN is difficult to obtain global and long-range semantical information. Moreover, in some scenes, the pure RGB image-based model is difficult to accurately identify the pixel classification and finely segment the edge of objects. This study presents a hierarchical vision Transformer model named Swin-RGB-D to incorporate and exploit the depth information in depth images to supplement and enhance the ambiguous and obscure features in RGB images. In this design, RGB and depth images are used as the two inputs of the two-branch network. The upstream branch applies the Swin Transform which is capable of learning global continuous information from RGB images for segmentation; the other branch performs channel attention on depth image to abstract the feature correlation and dependency between channels and generates a weight matrix. Then matrix multiplication on the feature maps in each stage of the down-sampling process is performed for weighted multi-modal feature extraction. Then this study adds the fused maps to the up-sampled feature maps of the corresponding size, which sufficiently compensates for the distortion of feature in the sampling process. The experiment results on the two benchmark datasets show that the proposed model makes the network more sensitive to edge information.

Original languageEnglish
Title of host publicationSSPS 2022
Subtitle of host publicationProceedings of the 4th International Symposium on Signal Processing Systems
PublisherAssociation for Computing Machinery (ACM)
Number of pages6
ISBN (Electronic)9781450396103
Publication statusPublished - 25 Mar 2022
Event4th International Symposium on Signal Processing Systems - Virtual, Online, China
Duration: 25 Mar 202227 Mar 2022
Conference number: 4

Publication series

NameACM International Conference Proceeding Series
VolumePar F180473


Conference4th International Symposium on Signal Processing Systems
Abbreviated titleSSPS 2022
CityVirtual, Online


Dive into the research topics of 'Hierarchical Vision Transformer with Channel Attention for RGB-D Image Segmentation'. Together they form a unique fingerprint.

Cite this