The generation principle of GPT – 4o’s new image model: What exactly is an autoregressive model? Why is it so amazing?

You may have heard that OpenAI’s newly released GPT-4o can generate high-quality images fluently. However, unlike the previously popular “diffusion models” such as Midjourney, DALL·E, and Stable Diffusion, GPT-4o’s image generation adopts a seemingly simple yet magical approach: the autoregressive model.

So, what exactly does autoregression mean? And how does GPT-4o manage to generate clear images pixel by pixel and region by region?

What is autoregressive image generation?

Let’s start by breaking down the term “autoregressive”:
• Auto means automatic, indicating that the model operates without requiring additional intervention.
• Regressive means that the model predicts subsequent information based on what has already been generated.

Here’s a simple analogy:

You are hand – drawing a painting. You won’t complete the whole picture all at once. Instead, you will start from a small area and gradually expand outwards. Each stroke is determined by what you have drawn before to decide the direction of the next stroke.

The core idea of the autoregressive model is similar to this painting process. Specifically, for GPT-4o, it works as follows:
• The model starts from the top and generates the image row by row, moving downward step by step;
• At each step, the model refers to the previously generated pixel information to predict the content of the next pixel (or group of pixels);
• This process repeats continuously, gradually painting the complete image.

This is completely different from the diffusion model. The diffusion model is more like splashing paint all over the paper (adding noise) first, and then gradually erasing the unnecessary parts step by step until a clear image remains.

Why use autoregression instead of diffusion?

Although diffusion models are excellent, they have obvious drawbacks:
• They start with complete noise, making it impossible to see any outline of the image in the initial stage.
• It is difficult to “guide” the generation process step by step, as the model primarily generates the image all at once.
• It is challenging to make detailed modifications and edits during the process.

On the other hand, GPT-4o’s autoregressive generation method has two clear advantages:

1. Stronger Coherence
Since each step of generation refers to the content generated previously, GPT-4o achieves finer control over the coherence of images. It’s like when we write an article, we first outline the structure and then write paragraph by paragraph, with each sentence closely connected to the previous one, resulting in a smoother flow.

Let me give you a down-to-earth example:
Suppose you ask an AI to draw a cat. If you use a diffusion model, it might initially present just a blurry blob, with the cat’s form only becoming clear in the later stages. However, GPT-4o would start by sketching out the general outline of the cat right away, and then gradually refine each detail, such as the eyes, ears, and fur. This approach makes the generation process feel more “human-like.”

2. More Precise Editing Capabilities
Another significant advantage of autoregression is its ability to achieve precise local modifications. Since the image is generated sequentially, users can intervene at any time to modify specific parts, and the AI-generated regions that follow will automatically adapt based on the content of the modification.

For example:
Suppose an AI is generating a landscape painting from top to bottom. If you suddenly want more clouds in the sky halfway through, you only need to give instructions during the stage of generating the sky. The AI can then make immediate adjustments in the next step and generate cloud shapes that meet your expectations, without having to regenerate the entire image from scratch.

Through the web interface of ChatGPT, we can use the browser’s built-in developer tools to observe some interesting details:
• Generating content line by line from top to bottom
The process of GPT-4o generating images is like painting, starting from the top and gradually filling in the content.

• The initial contour quickly emerges and is then gradually refined.
This is similar to how a painter first quickly sketches out the general outline of the composition and then gradually adds details.

• Areas that have already been generated may be repeatedly adjusted.
Even if certain local areas have been generated, subsequent generation processes may still make significant adjustments to these areas. This indicates that the model has a clear global coherence optimization strategy—similar to how a writer might revise earlier parts of a text after completing a paragraph, in order to make the entire passage more fluent.

• Generating simple images is significantly faster.
If you simply ask to generate a simple image of an apple, the model can almost instantly display it. However, if you request a complex scene (such as a bustling city street), the process will take noticeably longer, and multiple “intermediate images” will appear during the process. This suggests that GPT-4o may also utilize a technique called “speculative decoding,” which predicts and corrects the results of multiple steps in advance to improve efficiency.

• Additional Background Removal Mechanism
GPT-4o seems to possess a certain external background removal capability: initially, it displays a “pseudo-transparent” checkerboard background, while the actual background removal is only completed after generation. This step appears to be an externally appended post-processing procedure rather than an inherent feature of GPT-4o itself.

Technical Difficulties and the Miracles Achieved

The biggest challenge OpenAI has successfully overcome with this model is how to balance generation quality and speed in the autoregressive generation approach. Autoregressive models typically require massive parameters and computational resources to maintain image quality. However, GPT-4o has managed to achieve both speed and high quality, leaving many industry insiders amazed:
“Remarkably, GPT-4o has achieved effects comparable to, or even better than, diffusion models using the autoregressive approach. It’s truly unbelievable.”

The achievement of this feat undoubtedly involves highly efficient model design and optimization algorithms.

What does this mean for ordinary people?

The success of GPT-4o represents a new stage in AI image generation technology:
• It allows us to interact with design more effortlessly, enabling AI to quickly and accurately generate the desired content.
• It makes image editing more intuitive, as if painting step by step with AI, allowing you to adjust every detail at will.
• It could even lead the future of the visual creative field, freeing creators from the limitations of one-time image generation and enabling them to enjoy the freedom of interacting and adjusting at any time.

Ultimately, this breakthrough in technology is not only an achievement in computer science but also a reminder to us:
“The real progress of technology lies not in replacing humans, but in providing everyone with a better ‘paintbrush’ to freely depict the world of their own.”

Perhaps what GPT – 4o tells us is not just what AI can do, but how we really want to use it.