FlexIP – Tencent’s Personalized Image Generation and Editing Framework

What is FlexIP?

FlexIP is a flexible subject attribute editing framework for image synthesis proposed by Tencent, aiming to balance identity preservation and personalized editing in image generation. The framework adopts a dual-adapter architecture, decoupling identity preservation from personalized editing, and ensuring identity integrity through high-level semantic concepts and low-level spatial details. A dynamic weight gating mechanism allows users to flexibly control the trade-off between identity retention and style personalization via parameterized adjustments, transforming the traditional binary trade-off into a continuous control surface. FlexIP incorporates a multimodal data training strategy, optimizing the identity locking and deformation capabilities of the adapters separately based on image and video data, thereby further enhancing generation robustness.

FlexIP – Tencent's Personalized Image Generation and Editing Framework

The main functions of FlexIP

Dual Adapter Decoupling Design: For the first time, explicit separation is introduced between the Identity Preservation Adapter and the Personalization Adapter. The Identity Preservation Adapter combines advanced semantic concepts with low-level spatial details to ensure identity integrity. The Personalization Adapter, on the other hand, interacts with text and visual CLS tokens, absorbs meaningful visual cues, and places textual modifications within a coherent visual context, avoiding feature competition to achieve more precise control.
Dynamic Weight Gating Mechanism: Dynamically balances identity preservation and editing intensity through continuously adjustable parameters, transforming the traditional binary trade-off into a continuous parameter control surface. This enables flexible control ranging from subtle adjustments to significant transformations, allowing users to fine-tune the generation effects as needed.
Modality-Aware Training Strategy: Adapts adapter weights according to data characteristics (static images/video frames). Image data strengthens identity locking, while video data optimizes temporal deformation, enhancing generation robustness.
Cross-Attention Mechanism: Maintains adapters that capture multi-granularity visual features (e.g., facial details) through cross-attention, enhancing identity robustness.
Dynamic Interpolation: The weight gating mechanism supports real-time adjustment of adapter contributions, forming a continuous “control surface.”
Multi-Modality Data Training: Combines image and video data to separately optimize the identity locking and deformation capabilities of adapters.

Performance Comparison of FlexIP

Quantitative Comparison
◦ Overall Ranking: In terms of the overall ranking (mRank) metric, FlexIP outperforms all other methods, indicating its superior comprehensive performance across multiple key indicators.
◦ Personalization Capability: In personalized evaluation, FlexIP achieves a score of 0.284 on CLIP-T, slightly lower than λ-Eclipse. However, λ-Eclipse achieves its result at the cost of sacrificing subject retention capability. FlexIP, on the other hand, maintains strong subject features while achieving a high level of personalization.
◦ Identity Preservation Capability: In terms of identity preservation, FlexIP achieves high scores of 0.873 on CLIP-I and 0.739 on DINO-I, significantly outperforming other methods and demonstrating its strong advantage in preserving image details and semantic consistency.
◦ Image Quality: In image quality evaluation, FlexIP scores 0.598 on CLIP-IQA and 639 in aesthetics, indicating that its generated images are not only of high quality but also possess better aesthetic appeal.
◦ User Study: In the user of practical applications, FlexIP performs exceptionally well in two metrics: flexibility (Flex) and identity preservation (ID-Pres). All 60 evaluators unanimously agreed that the images generated by FlexIP best align with the text semantics and preserve the subject features most effectively.
Qualitative Comparison
◦ Fidelity: FlexIP demonstrates excellent fidelity in image generation, highly preserving the main features and details of reference images. Even during personalized editing, it maintains high image quality and realism.
◦ Editability: FlexIP exhibits significant advantages in editability, capable of generating diverse editing results based on different text instructions, meeting users’ personalized needs across various scenarios.
◦ Identity Consistency: In terms of identity consistency, FlexIP stably maintains the main features across different reference images. Even during significant transformations or stylized edits, it ensures the identity consistency of the subject, avoiding the common issue of identity abruptness found in traditional methods.
◦ Comparison with Existing Methods: In a qualitative comparison with five state-of-the-art methods, FlexIP-generated images show significant improvements in fidelity, editability, and identity consistency, better meeting users’ needs for high-fidelity personalized image generation.

The project address of FlexIP

Project official website: http://flexip-tech.github.io/flexip/#/
arXiv technical paper: https://arxiv.org/pdf/2504.07405

Application scenarios of FlexIP

Art Creation: FlexIP can flexibly edit images according to the needs of artists while maintaining the identity characteristics of the subjects.
Advertising Design: In the field of advertising design, FlexIP can help designers quickly generate image content that meets brand requirements. Through the dynamic weight gating mechanism, designers can flexibly adjust the style, scene, and details of advertising images while maintaining the brand image.
Film and Television Production: FlexIP can be used for visual effects and character design in film and television production. It can flexibly adjust the appearance of characters while maintaining the identity consistency of the characters.
Game Development: In game development, FlexIP can be used for the generation and editing of characters and scenes. Developers can quickly generate diverse character images through this framework while maintaining the core characteristics of the characters.