Preset method
InvPT
Introduction
InvPT synchronously models multiple visual tasks within a unified framework. It mainly consists of three core components: the task-shared InvPT Transformer encoder, the task-specific preliminary decoder, and the InvPT Transformer decoder. The encoder learns general visual representations from the input images of all tasks. Then, the preliminary decoder generates task-specific features and preliminary predictions, which are supervised by the ground truth labels. The features and preliminary predictions of each task are combined and fed as a sequence into the InvPT Transformer decoder. The InvPT Transformer decoder adopts an inverted pyramid structure. It learns multi-task feature interactions while gradually increasing the resolution of the feature maps. It also combines the multi-scale features extracted by the encoder to generate refined task-specific representations and produce the final prediction results.
Citation
@inproceedings{ye2022inverted,
title={Inverted pyramid multi-task transformer for dense scene understanding},
author={Ye, Hanrong and Xu, Dan},
booktitle={European Conference on Computer Vision},
pages={514--530},
year={2022},
organization={Springer}
}MoGE
Introduction
MoGE is an enhanced multi-task dense prediction method built upon the MLoRE (Mixture of Low-Rank Experts) framework. MLoRE adopts a decoder-focused architecture that integrates a task-shareing generic convolutional pathway into the standard Mixture-of-Experts (MoE) structure to explicitly model global inter-task relationships. Its expert networks employ low-rank convolutions to significantly reduce parameter count and computational cost, enabling scalable expansion of the number of experts, thereby enhancing representational capacity. Building on this foundation, MoGE further introduces a group-sparsity regularization that enforces structured grouping of experts, encouraging related tasks to share a common subset while allowing each task to activate its own specialized expert networks based on its unique characteristics—thereby enhancing both inter-task collaboration and task-specific modeling capabilities.
Citation
@inproceedings{kang2025mixture,
title={Mixture of Group Experts for Multi-task Dense Prediction},
author={Kang, Lei and Li, Jia and Huang, Hua},
booktitle={Chinese Conference on Pattern Recognition and Computer Vision (PRCV)},
year={2025},
organization={Springer}
}
@inproceedings{yang2024multi,
title={Multi-task dense prediction via mixture of low-rank experts},
author={Yang, Yuqi and Jiang, Peng-Tao and Hou, Qibin and Zhang, Hao and Chen, Jinwei and Li, Bo},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={27927--27937},
year={2024}
}