Analysis and Evaluation of VLMs in multimodal scene understanding

Masterthesis

Introduction text (teasertext)

The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.

state-of-the-art analysis for scene and scenario extraction based on vision‑language and video‑language models.
Implementation of processing strategies that balance detection performance, computational load and model size for large‑scale datasets.
Conducting ablation studies across different model families (e.g., LLaMA‑Vision, Qwen‑VL, Florence 2) and network configurations.

PDF-announcement (opens in new tab)