Skip to content

Preset method

InvPT

Introduction

InvPT synchronously models multiple visual tasks within a unified framework. It mainly consists of three core components: the task-shared InvPT Transformer encoder, the task-specific preliminary decoder, and the InvPT Transformer decoder. The encoder learns general visual representations from the input images of all tasks. Then, the preliminary decoder generates task-specific features and preliminary predictions, which are supervised by the ground truth labels. The features and preliminary predictions of each task are combined and fed as a sequence into the InvPT Transformer decoder. The InvPT Transformer decoder adopts an inverted pyramid structure. It learns multi-task feature interactions while gradually increasing the resolution of the feature maps. It also combines the multi-scale features extracted by the encoder to generate refined task-specific representations and produce the final prediction results.

Citation

@inproceedings{ye2022inverted,
  title={Inverted pyramid multi-task transformer for dense scene understanding},
  author={Ye, Hanrong and Xu, Dan},
  booktitle={European Conference on Computer Vision},
  pages={514--530},
  year={2022},
  organization={Springer}
}