We introduce ARC-Hunyuan-Video-7B, a powerful multimodal model designed for understanding real-world short videos. Understanding user-generated videos is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. To address this challenge, ARC-Hunyuan-Video-7B processes visual, audio, and textual signals end-to-end for a deep, structured understanding of video through integrating and reasoning over multimodal cues. Stress test reports show an inference time of just 10 seconds for a one-minute video on H20 GPU, yielding an average of 500 tokens, with inference accelerated by the vLLM framework.

Compared to prior arts, we introduces a new paradigm of Structured Video Comprehension, with capabilities including:

  • Deep Understanding of Real-World Short Videos: ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator’s intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing.
  • Synchronized Audio-Visual Reasoning: The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review.
  • Precise Temporal Awareness: ARC-Hunyuan-Video-7B knows not just what happens, but when it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis.
  • Advanced Reasoning and Application Versatility: Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval.

We are open-sourcing ARC-Hunyuan-Video-7B including model checkpoint, inference code and API. The version supports Chinese and English videos and particularly excels at Chinese. We hope that ARC-Hunyuan-Video will contribute to advancing the field of structured video comprehension and inspire new developments in the comprehension of real-world short videos.

Model Capabilities

1. Joint Audio-Visual Reasoning for Complex Understanding

ARC-Hunyuan-Video-7B’s strength lies in its ability to integrate information from both the visual and audio streams. This is critical for real-world videos where the meaning is often conveyed through a combination of spoken words, sound effects, and on-screen texts (can be absent). We want to point out that although many short videos feature subtitles, these visual texts can be easily missed or only partially captured at low frame sampling rates. This makes processing the complete audio stream essential for a reliable comprehension of the spoken content. The capability of joint audio-visual reasoning unlocks a deeper level of understanding that video-only models cannot achieve. (Swipe right to see more cases.)

2. Fine-Grained Temporal Understanding and Summarization

A core strength of ARC-Hunyuan-Video-7B is its exceptional ability to understand the chronological flow of events. The model can break down a video into a sequence of coherent, time-stamped events. Through leveraging its explicit temporal awareness, achieved through our timestamp overlay mechanism, and joint visual-audio reasoning, our model can also pinpoint event localization with remarkable accuracy. This demonstrates that our model does not just see what happens, but understands precisely when it happens, a cornerstone of structured video comprehension. Such capability is pivotal for applications like video highlight generation and structured data extraction. (Swipe right to see more cases.)

3. High-Level Thematic and Creative Analysis

Beyond just describing what is happening, ARC-Hunyuan-Video-7B demontrates a remarkable ability for thematic reasoning, which is vital for understanding content focused on emotional expression and viewpoint delivery. It can identify the creator’s intent, analyze the emotional tone, and even comment on creative techniques like narrative structure or symbolism, showing a human-like grasp of the content’s underlying message. (Swipe right to see more cases.)

4. Long Video Structured Understanding

Although ARC-Hunyuan-Video-7B is primarily designed for understanding short-form videos (typically under five minutes), it also demonstrates exceptional capabilities in the structural analysis of long videos. This is achieved by segmenting the long video, performing inference on each segment individually, and then using a Large Language Model (LLM) to integrate the results. Benefiting from parallel vLLM inference, the end-to-end pipeline for processing a 40-minute video takes less than 3 minutes. This efficiency highlights its significant potential for practical, real-world applications.

Method

Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:

  • Fine-Grained Visual-Audio Synchronization: We adopt an extra audio encoder and develop a fine-grained visual-audio synchronization mechanism, which fuses visual and audio tokens corresponding to the same time interval for obtaining temporally aligned visual-audio inputs.
  • Explicit Temporal Awareness via Timestamp Overlay: In a simple yet highly effective design choice, we overlay the corresponding timestamp (in HH:MM:SS format) directly onto each video frame before it is processed by the vision encoder. This gives the model an explicit, direct signal for temporal localization.
  • Automated Bootstrapped Annotation Pipeline: We collect millions of real-world short videos and develop a totally automated bootstrapped annotation pipeline, where the model’s own outputs refine the annotations.
  • Comprehensive Training Regimen: A comprehensive training regimen is adopted based on the finding that grounding the model in objective tasks with Reinforcement Learning is key to unlocking high-quality, subjective understanding.

API service

We also provide access to the model via API, which is supported by vLLM. For details, please refer to the ARC-Hunyuan-Video-7B and ARC-Hunyuan-Video-7B-V0.

We release two versions: one is V0 version, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper, which is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning (It supports Chinese and English videos and particularly excels at Chinese). For videos longer than 5 minutes, we only support structured descriptions. We process these videos in 5-minute segments and use an LLM to integrate the inference results.

If you only need to understand and summarize short Chinese videos, we recommend using ARC-Hunyuan-Video-7B-V0.

Due to video file size limitations imposed by the deployment API, we compressed input video resolutions for our online demo and API services. Consequently, model performance in these interfaces may slightly deviate from the results reported in the paper. To reproduce the original performance, we recommend local inference.

Future Work

We observe that incorporating generic video datasets during training may inadvertently compromise the model’s capacity for real-world video understanding, potentially due to domain shift or noise introduced by non-real-world samples. To address this limitation, we plan to develop a dedicated model trained exclusively on rigorously curated real-world video data.