The long orbit to benchmarking long video understanding
Pipeline Long video datasets are challenging to build because of the significant manual effort required to select, watch, understand and annotate long videos with free-form natural language. Answering challenging questions about longer videos is often a multimodal task that may involve listening to the audio track in addition to watching the video. It may also […]