Manzano – An image understanding and generation model developed by Apple

What is Manzano？

Manzano is a new multimodal large language model (LLM) developed by Apple that unifies both image understanding and image generation. The model employs a hybrid vision tokenizer that converts images into continuous embeddings for understanding tasks, and into discrete image tokens for generation tasks. At its core, Manzano uses an autoregressive LLM decoder capable of predicting both text and image tokens. Additionally, Manzano is equipped with a diffusion decoder to transform generated image tokens into pixel-level images. This enables Manzano to excel at both understanding and generation tasks, with performance scaling as the model size increases.

Key Features of Manzano

Image Understanding: The model can interpret image content and answer questions related to images.
Image Generation: Generates high-quality images from text prompts, supporting complex instructions and producing creative, detailed outputs.
Image Editing: Supports text-driven editing tasks, such as style transfer, local modifications, and content expansion.
Multimodal Interaction: Combines text and image information to handle complex multimodal tasks, including mixed text–image Q&A and creative generation.

Technical Principles of Manzano

Hybrid Vision Tokenizer:
- Continuous Embeddings: Used for image understanding tasks, encoding images into continuous vectors that preserve rich semantic information.
- Discrete Tokens: Used for image generation tasks, encoding images into discrete tokens that facilitate autoregressive generation.
Autoregressive LLM Decoder: Processes both text and image tokens, predicting the next token (text or image). This unified approach enables joint learning across multimodal tasks, allowing the model to handle both understanding and generation.
Diffusion Decoder: Converts generated discrete image tokens into pixel-level images, leveraging the powerful generative capacity of diffusion models to ensure high quality and fine details.
Unified Training Framework: Pretrained on large-scale text and image datasets to learn general language and visual representations. The model is further trained on high-quality subsets to boost performance, and fine-tuned on task-specific data for better domain adaptation.

Project Link

arXiv technical paper: https://arxiv.org/pdf/2509.16197

Application Scenarios of Manzano

Image Understanding: Applied to visual question answering (VQA), assisting doctors in quickly and accurately interpreting medical images, answering questions, and supporting diagnosis.
Image Generation: In creative design, generates high-quality images from textual descriptions, providing inspiration and assets for advertising, game art, and more.
Image Editing: For content creators, enables text-driven image editing (style transfer, local modifications, etc.), allowing fast realization of creative ideas.
Document Understanding: In document processing, interprets embedded images to support content extraction, analysis, and Q&A, improving office productivity.
Multimodal Interaction: In smart education, combines text and images to provide more intuitive and vivid learning experiences—for example, visually explaining complex scientific concepts.