LiveCC – A real-time video commentary model open-sourced jointly by ByteDance and the National University of Singapore

What is LiveCC?

LiveCC is a real-time video commentary model jointly developed by the Show Lab team at the National University of Singapore and ByteDance. It is extensively trained using automatic speech recognition (ASR) subtitles. Like a professional commentator, LiveCC quickly analyzes video content and synchronously generates natural, fluent audio or text commentary.

LiveCC introduces two key datasets:

Live-CC-5M for pretraining
Live-WhisperX-526K for high-quality supervised fine-tuning

Additionally, LiveCC designed the LiveSports-3K benchmark to evaluate the model’s real-time video commentary capabilities.

Experiments demonstrate that LiveCC excels in real-time video commentary and video question answering tasks, showing low latency and high-quality generation.

LiveCC – A real-time video commentary model open-sourced jointly by ByteDance and the National University of Singapore

Key Features of LiveCC

Real-time Video Commentary
Generates continuous, human-like commentary in real time based on video content. Applicable to sports events, news broadcasting, educational videos, and more.
Video Question Answering
Answers questions related to the video content, helping users better understand events and details within the video.
Low-latency Processing
Processes video streams with extremely low latency (under 0.5 seconds per frame), enabling real-time applications.
Multi-scenario Adaptation
Suitable for various types of video content, including sports, news, education, and entertainment.

Technical Principles of LiveCC

Streaming Training Method
ASR words are densely interleaved with video frames according to timestamps. This trains the model to learn time-aligned vision-language relationships, simulating the human real-time perception of video content.
Large-scale Datasets
Built two datasets from YouTube video ASR subtitles:
- Live-CC-5M for pretraining
- Live-WhisperX-526K for high-quality supervised fine-tuning
  These datasets provide rich training resources for the model.
Model Architecture
Based on the Qwen2-VL architecture, LiveCC combines a vision encoder with a language model. It processes video frames and text inputs, predicting text tokens autoregressively while using video tokens as non-predictive context.
Real-time Inference
LiveCC processes input videos frame-by-frame during inference, generating commentary in real time. The model caches previous prompts, visual frames, and generated texts to speed up language decoding.
Evaluation Method
Evaluated using the LiveSports-3K benchmark and the LLM-as-a-judge framework to compare the quality of commentary generated by different models.

Project Links for LiveCC

🌐 Official Website: https://showlab.github.io/livecc/
🧑‍💻 GitHub Repository: https://github.com/showlab/livecc
🤗 Hugging Face Model Hub: https://huggingface.co/collections/chenjoya/livecc
📄 arXiv Technical Paper: https://arxiv.org/pdf/2504.16030
🚀 Online Demo: https://huggingface.co/spaces/chenjoya/LiveCC

Application Scenarios for LiveCC

Sports Events
Deliver real-time commentary and match analysis, enhancing the viewer experience.
News Reporting
Assist in real-time interpretation of news, improving the depth and professionalism of reports.
Education
Generate instructional commentary for educational videos to support skill training and learning.
Entertainment Media
Provide real-time plot explanations for movies and TV shows to boost engagement and interactivity.
Smart Assistants
Offer real-time information based on video content to enhance user interaction experiences.