Analysis and Evaluation of VLMs in multimodal scene understanding
Masterthesis
Introduction text (teasertext)
The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.
The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.
- state-of-the-art analysis for scene and scenario extraction based on vision‑language and video‑language models.
- Implementation of processing strategies that balance detection performance, computational load and model size for large‑scale datasets.
- Conducting ablation studies across different model families (e.g., LLaMA‑Vision, Qwen‑VL, Florence 2) and network configurations.