ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration

Bayraktar, Ertuğrul

doi:10.3390/app15041907

ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration

Bayraktar E.

Applied Sciences (Switzerland), cilt.15, sa.4, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 15 Sayı: 4
Basım Tarihi: 2025
Doi Numarası: 10.3390/app15041907
Dergi Adı: Applied Sciences (Switzerland)
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Agricultural & Environmental Science Database, Applied Science & Technology Source, Communication Abstracts, INSPEC, Metadex, Directory of Open Access Journals, Civil Engineering Abstracts
Anahtar Kelimeler: multimodal embeddings, transformer-based multi-object tracking, visual language models, visual object detection, zero-shot re-identification
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Multi-object tracking (MOT) is an important task in computer vision, particularly in complex, dynamic environments with crowded scenes and frequent occlusions. Traditional tracking methods often suffer from identity switches (IDSws) and fragmented tracks (FMs), which limits their ability to maintain consistent object trajectories. In this paper, we present a novel framework, called ReTrackVLM, that integrates multimodal embedding from a visual language model (VLM) with a zero-shot re-identification (ReID) module to enhance tracking accuracy and robustness. ReTrackVLM leverages the rich semantic information from VLMs to distinguish objects more effectively, even under challenging conditions, while the zero-shot ReID mechanism enables robust identity matching without additional training. The system also includes a motion prediction module, powered by Kalman filtering, to handle object occlusions and abrupt movements. We evaluated ReTrackVLM on several widely used MOT benchmarks, including MOT15, MOT16, MOT17, MOT20, and DanceTrack. Our approach achieves state-of-the-art results, with improvements of 1.5% MOTA and a reduction of 10. 3% in IDSws compared to existing methods. ReTrackVLM also excels in tracking precision, recording a 91.7% precision on MOT17. However, in extremely dense scenes, the framework faces challenges with slight increases in IDSws. Despite the computational overhead of using VLMs, ReTrackVLM demonstrates the ability to track objects effectively in diverse scenarios.