Pipeline
Long video datasets are challenging to build because of the significant manual effort required to select, watch, understand and annotate long videos with free-form natural language. Answering challenging questions about longer videos is often a multimodal task that may involve listening to the audio track in addition to watching the video. It may also be a non-linear task, because sometimes it may be necessary to rewind and rewatch key parts to answer a question. Proposing suitable high-level questions that are not trivially solved by observing only a few frames can also be tricky for people to do consistently and with adequate variety.
In order to solve this problem we propose a semi-automatic pipeline that first generates candidate multiple choice questions using a number of strong vision-language models (VLMs) and large language models (LLMs) with carefully designed prompts, and then lets human annotators filter and correct the proposed questions to reduce errors and bias. In order to reduce human effort, we leverage automatic tools to (1) find suitable videos, (2) extract useful signals, and then (3) automatically generate video-level captions, questions and answers.
Our pipeline begins with the selection of video content. We filter videos to increase visual and demographic diversity. We also remove videos with mostly static content as well as gaming videos and animated content. In the next stage, we extract two types of captions from the resulting videos: automatic speech recognition (ASR) captions and frame captions. For the latter, we prompt a VLM to describe video frames sampled at one frame per second. The next step summarizes these captions by segmenting the video into shots, grouping them by topics and prompting an LLM to summarize ASR and frame-level captions into shot-level captions.
Given these captions, the pipeline generates multiple-choice questions in two stages. In the first stage, we prompt an LLM to generate a set of challenging questions and answers, providing it with the video captions as context. In the second stage, we prompt the LLM with a generated question-answer pair and the video captions and ask it to generate four decoy answers. Decoys need to be incorrect but plausible answers to the question. The final stage of the pipeline is human verification, where we ask human raters to filter or correct incorrect questions, answers and decoys.