StarVector – An open-source multimodal vision-language model supporting SVG generation from images and text

What is StarVector?

StarVector is an open-source multimodal vision-language model jointly developed by ServiceNow Research, Mila – Quebec AI Institute, and ETS Montreal. It focuses on converting images and text into scalable vector graphics (SVG) code. The model adopts a multimodal architecture, capable of processing both image and text information simultaneously, and operates directly in the SVG code space to generate standard, editable SVG files. StarVector is trained on the SVG-Stack dataset, which contains over 2 million SVG samples, and is available in two scales, StarVector-1B and StarVector-8B, to meet different needs.

The main functions of StarVector

Image-to-SVG: Directly convert images into SVG code, achieving vectorization of the images.
Text-to-SVG: Generate corresponding SVG graphics based on text instructions.

The Technical Principle of StarVector

Multimodal Architecture: StarVector adopts a multimodal architecture, seamlessly integrating vision and language models. It utilizes a visual encoder (such as Vision Transformer or CLIP image encoder) to extract visual features from images. These features are then mapped into the embedding space of the language model through an adapter, generating visual tokens. These visual tokens, along with text embeddings, are input into the language model, enabling unified processing of both images and text.
Image Encoding and Visual Token Generation: The image encoder (e.g., Vision Transformer) divides the input image into small patches and converts them into hidden features. These features are projected into the embedding space of the language model via a non-linear adapter, forming visual tokens. This process captures key visual features of the image, such as shape, color distribution, and structural layout.
Language Model and SVG Code Generation: StarVector employs a language model based on StarCoder. During training, the model undergoes supervised learning through the task of predicting the next token in the SVG code sequence. In the inference phase, the model autoregressively predicts SVG code based on the visual tokens derived from the input image.
Training on Large-Scale Datasets: StarVector is trained on the SVG-Stack dataset, which contains over 2 million SVG samples. This diverse dataset supports a variety of tasks, including image-to-SVG and text-to-SVG generation. To comprehensively evaluate model performance, StarVector introduces the SVG-Bench evaluation benchmark.
Performance Advantages: StarVector demonstrates exceptional performance in both image-to-SVG and text-to-SVG tasks. The generated SVG files are more compact and semantically rich, effectively leveraging SVG primitives. In the SVG-Bench benchmark, StarVector outperforms traditional methods and deep learning baseline models across multiple metrics.

The project address of StarVector

Project official website: https://starvector.github.io/
Github repository: https://github.com/joanrod/star-vector
arXiv technical paper: https://arxiv.org/pdf/2312.11556

Application scenarios of StarVector

Icon Generation: Quickly generate SVG icons based on text descriptions or image inputs for use in web navigation bars, buttons, etc.
Art Creation: Artists can use StarVector to transform creative sketches or text descriptions into vector artworks for easy subsequent editing and modification.
Animation Production: The generated SVG graphics can serve as the basic elements for animation production and be further developed into dynamic effects.
Programming Education: Students can learn the generation and editing of SVG code through StarVector, improving their programming and graphic design skills.
Technical Chart Generation: Generate technical charts, such as flowcharts and structure diagrams, based on text descriptions for use in engineering documents and technical specifications.
Data Visualization: Visualize data into SVG graphics for easy display on web pages or in reports while maintaining the editability and extensibility of the graphics.