Mu – A lightweight language model developed by Microsoft

What is Mu？

Mu is a lightweight language model developed by Microsoft, featuring only 330 million parameters. It is optimized for efficient deployment on NPUs and edge devices. Built on an encoder-decoder architecture, Mu leverages hardware-aware optimization, model quantization, and task-specific fine-tuning to achieve response speeds exceeding 100 tokens per second. Mu is integrated into Windows Settings, enabling users to control system settings via natural language commands—such as adjusting screen brightness or changing mouse pointer size. Despite being 10 times smaller than Phi-3.5-mini in terms of parameter size, Mu delivers comparable performance. Its innovations include Dual LayerNorm, Rotary Positional Embeddings (RoPE), and Grouped-Query Attention (GQA), which enhance training stability and inference efficiency.

Key Features of Mu

System Settings Adjustment: Users can modify system settings using natural language commands like “Make the mouse pointer larger” or “Adjust the screen brightness.”
Low Latency Responses: Mu offers fast, on-device responses with speeds exceeding 100 tokens per second, ensuring smooth user interactions.
Integrated in Windows Settings: Mu is embedded within the Windows Settings search bar, allowing users to input natural language commands that are automatically interpreted and executed.
Support for Hundreds of Settings: Mu can manage a wide range of system configurations, covering most everyday user needs.

Technical Principles of Mu

Encoder-Decoder Architecture: Mu uses an encoder to convert input text into latent representations and a decoder to generate corresponding outputs.
Hardware-Aware Optimization: The model is optimized for NPU deployment by adjusting its architecture and parameter layout to suit hardware parallelism and memory constraints.
Model Quantization: Post-training quantization (PTQ) techniques convert floating-point weights and activations into 8-bit or 16-bit integers, significantly reducing memory and computation requirements while maintaining accuracy.

Transformer Innovations:

Dual Layer Normalization: Applies LayerNorm before and after each sub-layer to stabilize training by maintaining well-distributed activations.
Rotary Positional Embeddings (RoPE): Utilizes complex-domain rotation-based positional encodings, enabling strong extrapolation capabilities for long sequences and avoiding the limitations of absolute positional embeddings.
Grouped-Query Attention (GQA): Shares keys and values across attention head groups to reduce parameter and memory usage while maintaining head diversity and model efficiency.

Training Techniques:

Pretraining was conducted using A100 GPUs.
Knowledge distillation was performed from the Phi model.
Task-specific fine-tuning utilized Low-Rank Adaptation (LoRA) methods to enhance performance in targeted applications.

Project URL

Official blog: https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/

Application Scenarios for Mu

System Settings Adjustment: Mu interprets natural language commands to quickly modify Windows settings, eliminating the need for manual navigation and improving usability.
Real-Time Interaction: With over 100 tokens per second response speed on-device, Mu is ideal for interactive real-time applications.
Multilingual Support: Mu supports various natural languages, accurately interpreting and executing commands in different linguistic contexts.
Accessibility Enhancement: Mu empowers users with visual or motor impairments to control settings through voice commands, enhancing system accessibility.
Future Expansion: With strong scalability, Mu has the potential to evolve into a more general-purpose assistant capable of managing schedules, handling files, and executing broader tasks.