We introduce ARC-Hunyuan-Video-7B, a powerful multimodal model designed for understanding real-world short videos.
Understanding user-generated videos is actually challenging due to their complex visual elements, high
information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery.
To address this challenge, ARC-Hunyuan-Video-7B
processes visual, audio, and textual signals end-to-end for a deep, structured understanding of video through integrating and reasoning over multimodal cues.
Stress test reports show an inference time of just 10 seconds for a one-minute video on H20 GPU, yielding an average of 500 tokens, with inference accelerated by the vLLM framework.
Compared to prior arts, we introduces a new paradigm of Structured Video Comprehension, with capabilities including:
Deep Understanding of Real-World Short Videos: ARC-Hunyuan-Video-7B excels at analyzing user-generated content from platforms like WeChat Channels and TikTok. It goes beyond surface-level descriptions to grasp the creator’s intent, emotional expression, and core message by processing complex visual elements, dense audio cues, and rapid pacing.
Synchronized Audio-Visual Reasoning: The synchronization of raw visual and audio signals allows our model to answer complex questions that are impossible to solve with only one modality, such as understanding humor in a skit or details in a product review.
Precise Temporal Awareness: ARC-Hunyuan-Video-7B knows not just what happens, but when it happens. It supports multi-granularity timestamped captioning, temporal video grounding, and detailed event summarization, making it perfect for applications like video search, highlight generation, and content analysis.
Advanced Reasoning and Application Versatility: Leveraging a comprehensive multi-stage training regimen including Reinforcement Learning (RL), ARC-Hunyuan-Video-7B demonstrates strong reasoning capabilities. It supports zero-shot or few-shot fine-tuning for diverse downstream applications like video tagging, recommendation, and retrieval.
We are open-sourcing ARC-Hunyuan-Video-7B including model checkpoint, inference code and API. The version supports Chinese and English videos and particularly excels at Chinese. We hope that ARC-Hunyuan-Video will contribute to advancing the field of structured video comprehension and inspire new developments in the comprehension of real-world short videos.
1. Joint Audio-Visual Reasoning for Complex Understanding#
ARC-Hunyuan-Video-7B’s strength lies in its ability to integrate information from both the visual and audio streams. This is critical for real-world videos where the meaning is often conveyed through a combination of spoken words, sound effects, and on-screen texts (can be absent). We want to point out that although many short videos feature subtitles, these visual texts can be easily missed or only partially captured at low frame sampling rates. This makes processing the complete audio stream essential for a reliable
comprehension of the spoken content. The capability of joint audio-visual reasoning unlocks a deeper level of understanding that video-only models cannot achieve. (Swipe right to see more cases.)
Question: Describe the video content. Thinking:
The video, titled 'POV: Parent Logic', presents a comedic skit about a parent's perception of their child's activities. A young man, playing both the child and the parent, begins by explaining his 'logic' (00:00:00 - 00:00:08). He states that when he leaves and returns, seeing his child on the phone, he will assume the child has been playing all the time, which will make him very angry. The scene then shifts to illustrate a different scenario (00:00:08 - 00:00:16). The parent leaves and returns to find the child doing homework. In this case, the parent's logic is that the child only started homework the moment they came back, causing the parent to be angry. Next, the scenario is reversed (00:00:16 - 00:00:24). The parent leaves and then returns to find the child playing on the phone. The parent's assumption is that the child started playing the instant they left, leading to anger. In the final scenario, the parent leaves and returns to find the child doing homework (00:00:24 - 00:00:35). The parent's logic is that the child stopped doing homework the moment they left and only started when they returned, again causing the parent to be angry.
Answer:
This video is a humorous and relatable skit that satirizes the 'parent logic' phenomenon. The creator portrays a parent who, upon returning from leaving the house, experiences a series of negative thoughts based on what he sees. The core of the comedy lies in the parent's tendency to assume the worst in his child's activities. For instance, if the child is on the phone when he leaves and is doing homework when he returns, the parent assumes the child has only begun working the instant he leaves. This flawed logic is exaggerated in the final scene, where the parent assumes the child has stopped doing homework the moment he departs and only started when he comes back, leading to a comically intense reaction. The video effectively uses this exaggerated scenario to create a funny and shareable piece of content about the unique perspective and often illogical assumptions parents make about their children's behavior.
Question: List the detailed purchase advice for the iPhone 16 in bullet point. Thinking:
The user wants a comprehensive list of purchase advice. I need to identify all the models mentioned and their specific features and price points to create a complete guide. The video presents three main models with different price points and features. I will list them in the order they are introduced: the budget-friendly 16e, the standard 16, and the high-end 16 Pro. For each model, I will state its price range, its main selling point (e.g., battery, display, camera), and any key features (e.g., chip, lighting). The 16e is for $600, the 16 is for $800, and the 16 Pro is for $1000 and up. I will format this information into a clear, ordered list.
Answer:
The video provides the following detailed purchase advice for the iPhone 16 models:
1. **iPhone 16e (00:00:02 - 00:00:12):** This is the cheapest option, priced at $600. Its main selling point is a long-lasting battery and a reliable chip set designed for users who are not tech-savvy. It features a 6.1-inch display.
2. **iPhone 16 (00:00:13 - 00:00:34):** This model starts at $800. Its key upgrades include a brighter display with a dynamic island, a more powerful camera system with an ultra-wide and macro capability, and a smaller enhancement like the MagSafe charging ring. It comes in two sizes: a 6.1-inch model and a 6.7-inch model.
3. **iPhone 16 Pro (00:00:35 - 00:00:55):** These are the most advanced models, starting at $1000. They feature 120Hz displays, a significantly improved camera system with a telephoto lens, LiDAR for 3D mapping, better speakers and microphones, a brighter flashlight, and a faster A18 Pro chip. They are available in two sizes: a 6.3-inch model and a 6.9-inch Pro Max model.
Question: Why is the man dissatisfied with the pork cutlet he cooked himself at home? Thinking:
The video first shows the man cooking a pork cutlet at home. He batters it, fries it, and cuts it open to reveal the inside. At 00:00:09, he holds up a slice and says, 'So thin,' indicating his dissatisfaction with the thickness of the cutlet. This sets up a comparison with the next scene.
Answer:
The man is dissatisfied with the pork cutlet because, after cutting it open, he finds that the inside is 'so thin,' indicating the cutlet lacks the desired thickness for the dish he intended to make.
2. Fine-Grained Temporal Understanding and Summarization#
A core strength of ARC-Hunyuan-Video-7B is its exceptional ability to understand the chronological flow of events. The model can break down a video into a sequence of coherent, time-stamped events. Through leveraging its explicit temporal awareness, achieved through our timestamp overlay mechanism, and joint visual-audio reasoning, our model can also pinpoint event localization with remarkable accuracy. This demonstrates that our model does not just see what happens, but understands precisely when it happens, a cornerstone of structured video comprehension. Such capability is pivotal for applications like video highlight generation and structured data extraction. (Swipe right to see more cases.)
Question: How many scenes appear in the video? Describe each scene in detail with the time range. Thinking:
The video is a compilation of different morning routines. I need to identify the distinct scenes and their corresponding timestamps. The first scene, labeled 'Italy', shows a man being served a cappuccino and croissant from a waiter (00:00:00 - 00:00:11). The second scene, labeled 'UK', features a man pouring tea from a teapot into a cup and then drinking it (00:00:11 - 00:00:20). The third scene, labeled 'Switzerland', shows a man in a shirt dancing energetically on the street in front of a restaurant (00:00:20 - 00:00:23). The final scene, labeled 'USA', depicts a man with a paper as a bib, eating pancakes while wearing sunglasses (00:00:23 - 00:00:31).
Answer:
1. **Italy (00:00:00 - 00:00:11):** A man in a black polo shirt is seated at an outdoor cafe. A waiter brings him a cappuccino and a
croissant on a plate. He takes a bite of the croissant, then picks up his cup and takes a sip of the coffee.
2. **UK (00:00:11 - 00:00:20):** A man in a brown shirt is at a cafe. He pours hot tea from a white teapot into a cup, smiles, and then
lifts the cup to take a drink.
3. **Switzerland (00:00:20 - 00:00:23):** The scene opens with a view of a pink and grey building. A man in a shirt and dark pants
suddenly runs and dances energetically on the street in front of the restaurant window.
4. **USA (00:00:23 - 00:00:31):** A man in a grey shirt and sunglasses is seated at a cafe table with a paper draped over his shoulder
like a bib. He smiles while looking at the camera.
Question: When can we observe that the woman in the video comment on the food she made? Thinking:
The user wants to find the moment the woman tastes the food and comments on the result. I need to locate the part of the video where she eats the treat and gives her opinion. At 00:00:27, she takes a bite of the frozen treat. She then comments on its texture and taste, saying 'As always, mine isn't pretty... that's really soft and so delicious'. This comment segment ends at 00:00:33 when she gives her final score. Therefore, the relevant time range is from when she starts eating and commenting until she finishes her sentence.
Answer:
00:00:27 - 00:00:34
Beyond just describing what is happening, ARC-Hunyuan-Video-7B demontrates a remarkable ability for thematic reasoning, which is vital for understanding content focused on emotional expression and viewpoint delivery. It can identify the creator’s intent, analyze the emotional tone, and even comment on creative techniques like narrative structure or symbolism, showing a human-like grasp of the content’s underlying message. (Swipe right to see more cases.)
Although ARC-Hunyuan-Video-7B is primarily designed for understanding short-form videos (typically under five minutes), it also demonstrates exceptional capabilities in the structural analysis of long videos. This is achieved by segmenting the long video, performing inference on each segment individually, and then using a Large Language Model (LLM) to integrate the results. Benefiting from parallel vLLM inference, the end-to-end pipeline for processing a 40-minute video takes less than 3 minutes. This efficiency highlights its significant potential for practical, real-world applications.
Specifically, ARC-Hunyuan-Video-7B is built on top of the Hunyuan-7B vision-language model with the following key designs to meet the requirements of effective structured video comprehension:
Fine-Grained Visual-Audio Synchronization: We adopt an extra audio encoder and develop a fine-grained visual-audio synchronization mechanism, which fuses visual and audio tokens corresponding to the same time interval for obtaining temporally aligned visual-audio inputs.
Explicit Temporal Awareness via Timestamp Overlay: In a simple yet highly effective design choice, we overlay the corresponding timestamp (in HH:MM:SS format) directly onto each video frame before it is processed by the vision encoder. This gives the model an explicit, direct signal for temporal localization.
Automated Bootstrapped Annotation Pipeline: We collect millions of real-world short videos and develop a totally automated bootstrapped annotation pipeline, where the model’s own outputs refine the annotations.
Comprehensive Training Regimen: A comprehensive training regimen is adopted based on the finding that grounding the model in objective
tasks with Reinforcement Learning is key to unlocking high-quality, subjective understanding.
We release two versions: one is V0 version, which only supports video description and summarization in Chinese; the other is the version consistent with the model checkpoint and the one described in the paper, which is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning (It supports Chinese and English videos and particularly excels at Chinese). For videos longer than 5 minutes, we only support structured descriptions. We process these videos in 5-minute segments and use an LLM to integrate the inference results.
If you only need to understand and summarize short Chinese videos, we recommend using ARC-Hunyuan-Video-7B-V0.
Due to video file size limitations imposed by the deployment API, we compressed input video resolutions for our online demo and API services. Consequently, model performance in these interfaces may slightly deviate from the results reported in the paper. To reproduce the original performance, we recommend local inference.
We observe that incorporating generic video datasets during training may inadvertently compromise the model’s capacity for real-world video understanding, potentially due to domain shift or noise introduced by non-real-world samples. To address this limitation, we plan to develop a dedicated model trained exclusively on rigorously curated real-world video data.