With the advent of large-scale visual-language models (LVLMs), research on their applications in multimodal environments has been greatly propelled, particularly in the field of video understanding. Traditional VideoQA benchmarks, while providing some quantitative metrics, have significant limitations. They often fail to comprehensively cover various aspects of video content, such as complex plot developments, subtle emotional expressions, and various detailed elements. Additionally, these traditional benchmarks are inadequate in assessing models' temporal understanding abilities, unable to fully and accurately measure models' comprehension and grasp of temporal information in videos.
To effectively overcome these limitations, we have conducted in-depth research and meticulous planning to introduce a new quantitative benchmark—MMBench-Video. The benchmark was designed to evaluate LVLMs' capabilities in video understanding in a more rigorous and comprehensive manner. MMBench-Video, in its construction, fully considers the diversity and complexity of videos. It innovatively integrates long videos from YouTube, which contain rich content and complex structures, better simulating various real-life video scenarios and providing models with more challenging and realistic testing environments.
Furthermore, MMBench-Video adopts a free-form question setting approach. This question format is closer to real-world applications and better reflects actual use-case scenarios. Through this method, we can more accurately examine models' performance when faced with various practical questions, not just the more fixed and patterned question forms in traditional benchmarks.
The design of MMBench-Video is meticulous, with one of its core goals being to delve into models' temporal reasoning abilities. All questions are manually annotated based on a carefully constructed capability taxonomy. This manual annotation ensures the accuracy and relevance of the questions, effectively guiding models to analyze and reason about temporal information in videos, thereby more accurately assessing models' abilities in temporal understanding.
In the evaluation process, we used GPT-4 for automatic evaluation. As an advanced language model, GPT-4 offers high accuracy and robustness. Compared to earlier LLM-based evaluation methods, it provides more reliable evaluation results. By leveraging GPT-4's capabilities, we can more accurately measure LVLMs' performance on the MMBench-Video benchmark, thus better understanding the strengths and weaknesses of models in video understanding.
Using the MMBench-Video benchmark, we conducted comprehensive and systematic evaluations. The evaluation subjects included proprietary and open-source LVLMs for images and videos. Through this extensive evaluation, we can gain a more comprehensive and in-depth understanding of the capabilities of different types of LVLMs in video understanding, providing an important basis for further improving and optimizing these models.
As a carefully crafted benchmark, MMBench-Video is undoubtedly a valuable resource for the research community. It provides a more scientific, accurate, and comprehensive standard for LVLM evaluation, helping to continuously improve and perfect LVLM evaluation methods. Moreover, it will greatly promote research progress in the field of video understanding, providing strong support for the development of more advanced and efficient video understanding models. Additionally, the evaluation code for MMBench-Video will be integrated into VLMEvalKit, further facilitating researchers in using this benchmark for related research and evaluation work, enhancing research efficiency and accuracy.