Applied Sciences (Switzerland), cilt.15, sa.4, 2025 (SCI-Expanded)
Multi-object tracking (MOT) is an important task in computer vision, particularly in complex, dynamic environments with crowded scenes and frequent occlusions. Traditional tracking methods often suffer from identity switches (IDSws) and fragmented tracks (FMs), which limits their ability to maintain consistent object trajectories. In this paper, we present a novel framework, called ReTrackVLM, that integrates multimodal embedding from a visual language model (VLM) with a zero-shot re-identification (ReID) module to enhance tracking accuracy and robustness. ReTrackVLM leverages the rich semantic information from VLMs to distinguish objects more effectively, even under challenging conditions, while the zero-shot ReID mechanism enables robust identity matching without additional training. The system also includes a motion prediction module, powered by Kalman filtering, to handle object occlusions and abrupt movements. We evaluated ReTrackVLM on several widely used MOT benchmarks, including MOT15, MOT16, MOT17, MOT20, and DanceTrack. Our approach achieves state-of-the-art results, with improvements of 1.5% MOTA and a reduction of 10. 3% in IDSws compared to existing methods. ReTrackVLM also excels in tracking precision, recording a 91.7% precision on MOT17. However, in extremely dense scenes, the framework faces challenges with slight increases in IDSws. Despite the computational overhead of using VLMs, ReTrackVLM demonstrates the ability to track objects effectively in diverse scenarios.