Apertus – Switzerland’s First Open-Source Large-Scale Language Model

What is Apertus？

Apertus is Switzerland’s first large-scale, open, multilingual large language model, jointly developed by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS). It comes in two versions, with 70B and 8B parameters, trained on massive multilingual data. Notably, 40% of the training data is non-English, covering languages such as Swiss German and Romansh that are often underrepresented in LLMs. Apertus is built on a decoder-only Transformer architecture, leveraging the new xIELU activation function and the AdEMAMix optimizer. The model is fully open-source, with weights, datasets, and training details all publicly available, enabling users to run it on their own servers while maintaining full control over their data.

Key Features of Apertus

Text Generation: Produces coherent and contextually relevant text based on user prompts.
Multilingual Support: Covers more than 1,811 languages, including many low-resource languages previously overlooked in LLMs.
Transparency and Openness: Model weights, training data, and details are fully open, giving users autonomy to deploy it on their own servers.
Long-Context Handling: Supports long-context processing, making it suitable for complex tasks.

Technical Foundations of Apertus

Model Architecture: Built on a dense decoder-only Transformer architecture with 8B (32 layers / 32 attention heads) and 70B (80 layers / 64 attention heads) parameter scales. It integrates the xIELU activation, RMSNorm normalization, RoPE positional encoding, and grouped-query attention to enhance efficiency and long-context handling.
Pretraining Objective: Uses the Goldfish objective function, which randomly masks tokens to prevent overfitting exact context mappings. This mitigates verbatim memorization while preserving downstream task performance. Pretraining data is sourced entirely from publicly available content, respecting opt-out preferences, and avoiding copyrighted, non-licensed, toxic, or personally identifiable information.
Pretraining Data: Trained on over 15 trillion tokens across more than 1,800 languages, sourced from diverse domains including high-quality web data, code, and math datasets. Data filtering mechanisms include honoring robots.txt restrictions, removing PII, and excluding toxic content. To boost multilingual strength, a large proportion of the dataset is non-English.
Training Process: Uses the AdEMAMix optimizer and WSD learning rate scheduling for stable and efficient training. Context length was progressively extended, enabling the model to handle sequences up to 65,536 tokens.
Post-Training: Instruction tuning and alignment were applied using the QRPO algorithm, refining the model to generate safer, more useful, and human-aligned outputs. This makes Apertus more effective at following instructions and producing aligned content.

Project Resources

Official Website: https://www.swiss-ai.org/apertus
HuggingFace Model Hub: https://huggingface.co/collections/swiss-ai/apertus-llm-68b699e65415c231ace3b059
Technical Report: https://github.com/swiss-ai/apertus-tech-report

Application Scenarios of Apertus

Multilingual Dialogue Systems: Building multilingual chatbots, customer service tools, and cross-lingual communication systems.
Code Generation and Assistance: Generate code snippets from natural language descriptions, enhancing developer productivity in software engineering.
Education and Learning Support: Produce educational content, answer academic questions, and provide study recommendations for online learning platforms and tutoring systems.
Content Creation: Assist in writing articles, stories, and news reports, offering inspiration and drafting support for creators.
Translation Services: Provide high-quality multilingual translation, supporting cross-cultural information exchange.