Abstract: Adapting large foundation segmentation models such as SAM2 to video object segmentation (VOS), especially for long sequences, is often limited by the high training cost of full fine-tuning.