SAM 3 – Meta’s Open-Source Vision Segmentation Model

What Is SAM 3?

SAM 3 (Segment Anything Model 3) is Meta AI’s latest advanced computer vision model capable of detecting, segmenting, and tracking objects in images and videos using text, examples, and visual prompts. The model supports open-vocabulary phrase inputs and offers powerful cross-modal interaction, enabling real-time refinement of segmentation results. SAM 3 delivers exceptional performance on image and video segmentation tasks—twice that of existing systems—and supports zero-shot learning. It also expands into the field of 3D reconstruction, powering applications such as home-setup previews, creative video editing, and scientific research, providing strong momentum for the future of computer vision.

Key Features of SAM 3

1. Multimodal Prompting

SAM 3 can detect and segment objects in images and videos using text, example images, and visual prompts such as clicks or bounding boxes, adapting to various user needs.

2. Image and Video Segmentation

It can detect and segment all matching objects in images and track objects across video frames, with support for real-time interactive refinement.

3. Zero-Shot Learning

SAM 3 handles unseen concepts via open-vocabulary text prompts, enabling segmentation of new object categories without additional training.

4. Real-Time Interactivity

Users can refine results by adding extra prompts (such as clicks or boxes), correcting model errors and improving the segmentation quality.

5. Cross-Domain Applications

SAM 3 is widely used in creative media tools (e.g., Instagram Edits), home-decoration previews (e.g., Facebook Marketplace), and scientific fields such as wildlife monitoring.

How SAM 3 Works

Unified Model Architecture

SAM 3 is built on a unified architecture that supports both image and video segmentation. It combines a powerful visual encoder (such as the Meta Perception Encoder) with a text encoder to process open-vocabulary prompts.
The model includes:

an image-level detector
a memory-based video tracker
Both share the same visual encoder.

Multimodal Input Processing

Text Encoder: Converts text prompts into feature vectors that guide segmentation.
Visual Encoder: Encodes images or video frames into visual features for object detection and segmentation.
Fusion Encoder: Merges text and visual features to produce conditioned image features for segmentation tasks.

Presence Head

To improve classification performance, SAM 3 introduces a Presence Head that predicts whether the target concept exists in the image or video. This helps decouple recognition and localization, enhancing accuracy and efficiency.

Large-Scale Data Engine

Meta built a high-efficiency data engine that combines human annotations with AI-assisted labeling, producing high-quality annotations for over 4 million unique concepts. This diverse data coverage ensures strong generalization across visual domains and tasks.

Zero-Shot Learning

Leveraging pre-trained visual and language encoders, SAM 3 can identify and segment new object categories from open-vocabulary text prompts without additional training.

Real-Time Interactivity

Users can refine segmentation results by adding prompts like clicks or box selections. This interactive loop helps the model better align with user intent.

Video Tracking and Segmentation

For video tasks, SAM 3 uses a memory-based tracker to maintain spatiotemporal consistency. The tracker uses detection outputs and historical memory to generate high-quality masks and propagate them across frames.

Project Links

Official Website: https://ai.meta.com/sam3/
GitHub Repository: https://github.com/facebookresearch/sam3/
Online Demo: https://www.aidemos.meta.com/segment-anything

Application Scenarios

1. Creative Media Tools

Creators can quickly apply visual effects to people or objects in videos, improving production efficiency.

2. Home Decoration Preview

On Facebook Marketplace, SAM 3 powers “Room Preview,” allowing users to visualize how furniture or décor would look in their own spaces.

3. Scientific Applications

SAM 3 supports wildlife monitoring and ocean research, helping scientists analyze animal behaviors through video.

4. 3D Reconstruction

SAM 3D can reconstruct 3D objects or humans from a single image, setting a new standard for real-world 3D reconstruction and assisting VR/AR applications.