We introduce ARC-Hunyuan-Video-7B, a powerful multimodal model designed for
understanding real-world short videos.
Understanding user-generated videos is actually challenging due to their complex visual elements,
high
information density in both visuals and audio, and fast pacing that focuses on emotional expression
and viewpoint delivery.
To address this challenge, ARC-Hunyuan-Video-7B
processes visual, audio, and textual signals end-to-end for a deep, structured understanding of
video through integrating and reasoning over multimodal cues.
Stress test reports show an inference time of just 10 seconds for a one-minute video on H20 GPU,
yielding an average of 500 tokens, with inference accelerated by the vLLM framework.
Compared to prior arts, we introduces a new paradigm of Structured Video
Comprehension, with capabilities including:
Deep Understanding of Real-World Short Videos: ARC-Hunyuan-Video-7B excels at
analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond
surface-level descriptions to grasp the creator’s intent, emotional expression, and core
message by processing complex visual elements, dense audio cues, and rapid pacing.
Synchronized Audio-Visual Reasoning: The synchronization of raw visual and
audio signals allows our model to answer complex questions that are impossible to solve with
only one modality, such as understanding humor in a skit or details in a product review.
Precise Temporal Awareness: ARC-Hunyuan-Video-7B knows not just what
happens, but when it happens. It supports multi-granularity timestamped captioning,
temporal video grounding, and detailed event summarization, making it perfect for applications
like video search, highlight generation, and content analysis.
Advanced Reasoning and Application Versatility: Leveraging a comprehensive
multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B
demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for
diverse downstream applications like video tagging, recommendation, and retrieval.
We are open-sourcing ARC-Hunyuan-Video-7B including model checkpoint, inference code
and API. The version supports Chinese and English videos and particularly excels at Chinese. We hope
that ARC-Hunyuan-Video will contribute to advancing the field of structured video comprehension and
inspire new developments in the comprehension of real-world short videos.
1. Joint Audio-Visual Reasoning for
Complex Understanding#
ARC-Hunyuan-Video-7B’s strength lies in its ability to integrate information from both the
visual and audio streams. This is critical for real-world videos where the meaning is often conveyed
through a combination of spoken words, sound effects, and on-screen texts (can be absent). We want
to point out that although many short videos feature subtitles, these visual texts can be easily
missed or only partially captured at low frame sampling rates. This makes processing the complete
audio stream essential for a reliable
comprehension of the spoken content. The capability of joint audio-visual reasoning unlocks a deeper
level of understanding that video-only models cannot achieve. (Swipe right to see more cases.)
Question: Describe the video content. Thinking:
The video, titled 'POV: Parent Logic', presents a comedic skit about a parent's
perception of their child's activities. A young man, playing both the child and the
parent, begins by explaining his 'logic' (00:00:00 - 00:00:08). He states that when
he leaves and returns, seeing his child on the phone, he will assume the child has
been playing all the time, which will make him very angry. The scene then shifts to
illustrate a different scenario (00:00:08 - 00:00:16). The parent leaves and returns
to find the child doing homework. In this case, the parent's logic is that the child
only started homework the moment they came back, causing the parent to be angry.
Next, the scenario is reversed (00:00:16 - 00:00:24). The parent leaves and then
returns to find the child playing on the phone. The parent's assumption is that the
child started playing the instant they left, leading to anger. In the final
scenario, the parent leaves and returns to find the child doing homework (00:00:24 -
00:00:35). The parent's logic is that the child stopped doing homework the moment
they left and only started when they returned, again causing the parent to be angry.
Answer:
This video is a humorous and relatable skit that satirizes the 'parent logic'
phenomenon. The creator portrays a parent who, upon returning from leaving the
house, experiences a series of negative thoughts based on what he sees. The core of
the comedy lies in the parent's tendency to assume the worst in his child's
activities. For instance, if the child is on the phone when he leaves and is doing
homework when he returns, the parent assumes the child has only begun working the
instant he leaves. This flawed logic is exaggerated in the final scene, where the
parent assumes the child has stopped doing homework the moment he departs and only
started when he comes back, leading to a comically intense reaction. The video
effectively uses this exaggerated scenario to create a funny and shareable piece of
content about the unique perspective and often illogical assumptions parents make
about their children's behavior.
Question: List the detailed purchase advice for the iPhone 16 in bullet point. Thinking:
The user wants a comprehensive list of purchase advice. I need to identify all the
models mentioned and their specific features and price points to create a complete
guide. The video presents three main models with different price points and
features. I will list them in the order they are introduced: the budget-friendly
16e, the standard 16, and the high-end 16 Pro. For each model, I will state its
price range, its main selling point (e.g., battery, display, camera), and any key
features (e.g., chip, lighting). The 16e is for $600, the 16 is for $800, and the 16
Pro is for $1000 and up. I will format this information into a clear, ordered list.
Answer:
The video provides the following detailed purchase advice for the iPhone 16
models:
1. **iPhone 16e (00:00:02 - 00:00:12):** This is the cheapest option, priced at
$600. Its main selling point is a long-lasting battery and a reliable chip set
designed for users who are not tech-savvy. It features a 6.1-inch display.
2. **iPhone 16 (00:00:13 - 00:00:34):** This model starts at $800. Its key upgrades
include a brighter display with a dynamic island, a more powerful camera system with
an ultra-wide and macro capability, and a smaller enhancement like the MagSafe
charging ring. It comes in two sizes: a 6.1-inch model and a 6.7-inch model.
3. **iPhone 16 Pro (00:00:35 - 00:00:55):** These are the most advanced models,
starting at $1000. They feature 120Hz displays, a significantly improved camera
system with a telephoto lens, LiDAR for 3D mapping, better speakers and microphones,
a brighter flashlight, and a faster A18 Pro chip. They are available in two sizes: a
6.3-inch model and a 6.9-inch Pro Max model.
Question: Why is the man dissatisfied with the pork cutlet he cooked himself at
home? Thinking:
The video first shows the man cooking a pork cutlet at home. He batters it, fries
it, and cuts it open to reveal the inside. At 00:00:09, he holds up a slice and
says, 'So thin,' indicating his dissatisfaction with the thickness of the cutlet.
This sets up a comparison with the next scene.
Answer:
The man is dissatisfied with the pork cutlet because, after cutting it open, he
finds that the inside is 'so thin,' indicating the cutlet lacks the desired
thickness for the dish he intended to make.
2. Fine-Grained Temporal Understanding
and Summarization#
A core strength of ARC-Hunyuan-Video-7B is its exceptional ability to understand the chronological
flow of events. The model can break down a video into a sequence of coherent, time-stamped events.
Through leveraging its explicit temporal awareness, achieved through our timestamp overlay
mechanism, and joint visual-audio reasoning, our model can also pinpoint event localization with
remarkable accuracy. This demonstrates that our model does not just see what happens, but
understands precisely when it happens, a cornerstone of structured video comprehension. Such
capability is pivotal for applications like video highlight generation and structured data
extraction. (Swipe right to see more cases.)
Question: How many scenes appear in the video? Describe each scene in detail with
the time range. Thinking:
The video is a compilation of different morning routines. I need to identify the
distinct scenes and their corresponding timestamps. The first scene, labeled
'Italy', shows a man being served a cappuccino and croissant from a waiter (00:00:00
- 00:00:11). The second scene, labeled 'UK', features a man pouring tea from a
teapot into a cup and then drinking it (00:00:11 - 00:00:20). The third scene,
labeled 'Switzerland', shows a man in a shirt dancing energetically on the street in
front of a restaurant (00:00:20 - 00:00:23). The final scene, labeled 'USA', depicts
a man with a paper as a bib, eating pancakes while wearing sunglasses (00:00:23 -
00:00:31).
Answer:
1. **Italy (00:00:00 - 00:00:11):** A man in a black polo shirt is seated at an
outdoor cafe. A waiter brings him a cappuccino and a
croissant on a plate. He takes a bite of the croissant, then picks up his cup and
takes a sip of the coffee.
2. **UK (00:00:11 - 00:00:20):** A man in a brown shirt is at a cafe. He pours hot
tea from a white teapot into a cup, smiles, and then
lifts the cup to take a drink.
3. **Switzerland (00:00:20 - 00:00:23):** The scene opens with a view of a pink and
grey building. A man in a shirt and dark pants
suddenly runs and dances energetically on the street in front of the restaurant
window.
4. **USA (00:00:23 - 00:00:31):** A man in a grey shirt and sunglasses is seated at
a cafe table with a paper draped over his shoulder
like a bib. He smiles while looking at the camera.
Question: When can we observe that the woman in the video comment on the food she
made? Thinking:
The user wants to find the moment the woman tastes the food and comments on the
result. I need to locate the part of the video where she eats the treat and gives
her opinion. At 00:00:27, she takes a bite of the frozen treat. She then comments on
its texture and taste, saying 'As always, mine isn't pretty... that's really soft
and so delicious'. This comment segment ends at 00:00:33 when she gives her final
score. Therefore, the relevant time range is from when she starts eating and
commenting until she finishes her sentence.
Answer:
00:00:27 - 00:00:34
Beyond just describing what is happening, ARC-Hunyuan-Video-7B demontrates a remarkable ability for
thematic reasoning, which is vital for understanding content focused on emotional expression and
viewpoint delivery. It can identify the creator’s intent, analyze the emotional tone, and even
comment on creative techniques like narrative structure or symbolism, showing a human-like grasp of
the content’s underlying message. (Swipe right to see more cases.)
Although ARC-Hunyuan-Video-7B is primarily designed for understanding short-form videos (typically
under five minutes), it also demonstrates exceptional capabilities in the structural analysis of
long videos. This is achieved by segmenting the long video, performing inference on each segment
individually, and then using a Large Language Model (LLM) to integrate the results. Benefiting from
parallel vLLM inference, the end-to-end pipeline for processing a 40-minute video takes less than 3
minutes. This efficiency highlights its significant potential for practical, real-world
applications.
Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the
following key designs to meet the requirements of effective structured video comprehension:
Fine-Grained Visual-Audio Synchronization: We adopt an extra audio encoder and
develop a fine-grained visual-audio synchronization mechanism, which fuses visual and audio
tokens corresponding to the same time interval for obtaining temporally aligned visual-audio
inputs.
Explicit Temporal Awareness via Timestamp Overlay: In a simple yet highly
effective design choice, we overlay the corresponding timestamp (in HH:MM:SS format) directly
onto each video frame before it is processed by the vision encoder. This gives the model an
explicit, direct signal for temporal localization.
Automated Bootstrapped Annotation Pipeline: We collect millions of real-world
short videos and develop a totally automated bootstrapped annotation pipeline, where the model’s
own outputs refine the annotations.
Comprehensive Training Regimen: A comprehensive training regimen is adopted
based on the finding that grounding the model in objective
tasks with Reinforcement Learning is key to unlocking high-quality, subjective understanding.
We release two versions: one is V0 version, which only supports video description and summarization
in Chinese; the other is the version consistent with the model checkpoint and the one described in
the paper, which is capable of multi-granularity timestamped video captioning and summarization,
open-ended video question answering, temporal video grounding, and video reasoning (It supports
Chinese and English videos and particularly excels at Chinese). For videos longer than 5 minutes, we
only support structured descriptions. We process these videos in 5-minute segments and use an LLM to
integrate the inference results.
If you only need to understand and summarize short Chinese videos, we recommend using
ARC-Hunyuan-Video-7B-V0.
Due to video file size limitations imposed by the deployment API, we compressed input video
resolutions for our online demo and API services. Consequently, model performance in these
interfaces may slightly deviate from the results reported in the paper. To reproduce the original
performance, we recommend local inference.
We observe that incorporating generic video datasets during training may inadvertently compromise the
model’s capacity for real-world video understanding, potentially due to domain shift or noise
introduced by non-real-world samples. To address this limitation, we plan to develop a dedicated
model trained exclusively on rigorously curated real-world video data.
A huge thank you to a community member for creating this excellent video guide: YouTube Tutorial. It walks you through
the project's features and provides a full tutorial on how to set it up from scratch. The video also
reviews the model's performance, highlighting its precise Q&A and great summarization results. If
you want to see a practical demonstration of how to deploy and use our model, this is a great place
to start.