MiniMax-Remover – An AI-based video object removal method that achieves high-quality removal results

What is MiniMax-Remover?

MiniMax-Remover is a novel video object removal method designed to address common issues in existing technologies such as hallucinated objects, visual artifacts, and slow inference speed. It adopts a two-stage approach:

Stage 1 utilizes a simplified version of the DiT (Diffusion in Time) architecture by removing text inputs and cross-attention layers, resulting in a more lightweight and efficient model.
Stage 2 applies a minimax optimization strategy to distill the model, identifying adversarial noise and training it to generate high-quality results under these challenging conditions.

The method requires only 6 sampling steps and does not rely on classifier-free guidance (CFG), enabling state-of-the-art video object removal with significantly improved inference efficiency.

Key Features of MiniMax-Remover

Efficient Video Object Removal: Leveraging a two-stage design, the first stage uses a streamlined DiT architecture without text inputs or cross-attention, yielding a faster and leaner model. The second stage uses minimax optimization to distill the remover and enhance editing quality and inference speed.
Fast Inference: With only 6 sampling steps and no reliance on CFG, the method achieves cutting-edge video object removal performance while drastically improving speed.
High-Quality Removal Results: By using an internal maximization step to identify adversarial input noise and an external minimization step to train the model under these conditions, MiniMax-Remover avoids hallucinated objects and visual artifacts, ensuring robust and reliable performance.

Technical Principles of MiniMax-Remover

Stage 1: Model Architecture Optimization
MiniMax-Remover first adopts a simplified DiT (Diffusion in Time) architecture, removing text inputs and cross-attention layers to create a more lightweight and efficient model. The goal is to reduce model complexity and boost inference speed while maintaining core video object removal capabilities.
Stage 2: Minimax Optimization
Building on the first stage, the model is further distilled using a minimax optimization strategy to improve editing quality and inference speed. This involves:
- Internal Maximization: Identifying adversarial input noise (“bad noise”) that may cause removal failures. These noises simulate worst-case scenarios to help the model learn to handle such challenges.
- External Minimization: Training the model to generate high-quality results even under these adversarial conditions. This allows the model to maintain strong performance in worst-case scenarios.

Project Website

Official Website: https://minimax-remover.github.io/

Application Scenarios of MiniMax-Remover

Film and TV Post-production: In the post-production of movies, TV shows, and commercials, MiniMax-Remover can quickly and reliably remove unwanted elements such as misplaced props, extra actors, or logos—greatly reducing time and cost.
Video Content Creation: For content creators, MiniMax-Remover helps remove distracting elements (e.g., background people, billboards), resulting in cleaner, more professional videos and enhanced viewer experience.
Video Restoration and Enhancement: In the restoration of old or damaged videos, the tool can remove defects or blemishes, improving clarity and completeness.
Visual Effects Production: As a preprocessing tool in VFX workflows, MiniMax-Remover can clear original elements from footage, providing a clean canvas for adding new visual effects.