Analysis and Evaluation of VLMs in multimodal scene understanding

Masterthesis

Introduction text (teasertext)

The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.

The growing complexity of automated driving demands scalable pipelines to extract relevant traffic scenes from real-world data automatically. This work, in collaboration with Porsche Engineering, explores how state-of-the-art Vision-Language Models can identify and retrieve predefined driving scenarios in large-scale datasets.

  • state-of-the-art analysis for scene and scenario extraction based on vision‑language and video‑language models.
  • Implementation of processing strategies that balance detection performance, computational load and model size for large‑scale datasets.
  • Conducting ablation studies across different model families (e.g., LLaMA‑Vision, Qwen‑VL, Florence 2) and network configurations.

PDF-announcement (opens in new tab)