NoteLLM – A multimodal large – model framework for note recommendation launched by REDnote

What is NoteLLM?

NoteLLM is a multimodal large language model (LLM) framework developed by Xiaohongshu for note recommendation. It leverages compressed embeddings of generated notes and automatically generated tags and categories. By combining the powerful semantic understanding of LLMs with contrastive learning and instruction tuning techniques, NoteLLM significantly improves the accuracy and relevance of note recommendations.

NoteLLM-2 builds on NoteLLM by introducing multimodal inputs. It adopts an end-to-end fine-tuning strategy that integrates visual encoders with LLMs to address the issue of neglected visual information. NoteLLM-2 proposes two mechanisms—multimodal in-context learning (mICL) and late fusion—to further enhance multimodal representation capabilities and significantly boost performance in multimodal recommendation tasks. This framework has demonstrated strong recommendation capabilities and is already deployed in Xiaohongshu’s production recommendation system.

NoteLLM – A multimodal large - model framework for note recommendation launched by REDnote

Key Features of NoteLLM

Automatic Tag and Category Generation: Automatically generates tags and categories for notes to enhance the quality of note embeddings.
Enhanced User Experience: Provides more accurate recommendations to increase user engagement and satisfaction on the platform.
Multimodal Note Recommendation: Combines text and image information to create more comprehensive note representations, improving recommendation accuracy and relevance.
Addressing Visual Information Neglect: Enhances the representation of visual data through mICL and late fusion mechanisms.

Technical Principles of NoteLLM

Note Compression Prompt: A specially designed prompt template compresses note content into a special token and simultaneously generates tags and categories.
Contrastive Learning: Constructs related note pairs based on co-occurrence patterns in user behavior data and trains the model to enhance semantic representation of note embeddings.
Instruction Tuning: Uses instruction tuning to help the LLM better understand task requirements and produce high-quality tags and categories.
Multimodal In-Context Learning (mICL): Separates multimodal content into visual and textual components, compresses them into two modal-specific tokens, and balances attention between modalities using contrastive learning.
Late Fusion: Directly integrates visual information at the output stage of the LLM, preserving more raw visual data and avoiding the information loss that can occur with early fusion.
End-to-End Fine-Tuning: Combines any existing LLM with a visual encoder, allowing for customized and efficient multimodal representation models without requiring pre-alignment.

Project Links for NoteLLM

GitHub Repository: https://github.com/Applied-Machine-Learning-Lab/NoteLLM
arXiv Papers:
- NoteLLM: https://arxiv.org/pdf/2403.01744
- NoteLLM2: https://arxiv.org/pdf/2405.16789

Application Scenarios of NoteLLM

Personalized Note Recommendation: Precisely recommends relevant content from massive note databases based on user interests and behaviors, enhancing content discovery.
Cold Start Note Recommendation: Helps newly published notes gain visibility quickly through content similarity-based recommendations.
Tag and Category Generation: Automatically generates relevant tags and categories to improve content searchability and help users find interesting content more efficiently.
Multimodal Content Recommendation: Processes both text and image data to generate more comprehensive note representations, enhancing recommendation performance.
Content Creation Assistance: Provides creators with inspiration and suggestions such as keywords, tags, and related notes to support content production.