PC Agent – E: An intelligent agent training framework jointly launched by Shanghai Jiao Tong University and SII
What is PC Agent-E?
PC Agent-E is a highly efficient intelligent agent training framework jointly developed by Shanghai Jiao Tong University and SII. Using only 312 human-annotated computer usage trajectories, the framework leverages the Claude 3.7 Sonnet model to synthesize diverse action decisions, significantly improving data quality. The framework consists of four key components: trajectory collection, thought chain completion, trajectory augmentation, and agent training. PC Agent-E achieves a 241% performance improvement on the WindowsAgentArena-V2 benchmark, surpassing Claude 3.7 Sonnet’s extended thinking mode, making it the new state-of-the-art (SOTA) open-source PC agent for Windows systems.
Key Features of PC Agent-E
-
Efficient Training: Requires only 312 human-labeled trajectories and enhances model performance through data augmentation.
-
Cross-Platform Generalization: Demonstrates strong cross-platform capabilities in the OSWorld benchmark, adaptable to different operating systems.
-
Task Execution: Capable of performing various complex tasks such as file operations, software usage, and web browsing.
-
Data Augmentation: Synthesizes diverse action decisions to enrich trajectory data and improve model generalization.
Technical Principles of PC Agent-E
-
Trajectory Collection: Uses the PC Tracker tool to record human operation trajectories, including task descriptions, screen captures, and keyboard/mouse actions. A simple annotation process collects a small but high-quality set of human demonstrations.
-
Thought Completion: Employs the Claude 3.7 Sonnet model to supplement each action step with underlying reasoning. Given the task description, action history, and current state, the model generates human-like chains of thought.
-
Trajectory Boost: Generates multiple action decisions for each trajectory step using the Claude 3.7 Sonnet model. This captures task diversity and enriches the dataset.
-
Agent Training: Trains on the open-source Qwen2.5-VL-72B model using a streamlined end-to-end framework that ensures effective learning and execution.
-
Evaluation and Validation: Performance is validated on benchmarks like WindowsAgentArena-V2 and OSWorld. By adjusting the number of synthesized actions, the framework verifies the critical impact of trajectory augmentation on performance.
Project Links for PC Agent-E
-
Project Website: https://gair-nlp.github.io/PC-Agent-E/
-
GitHub Repository: https://github.com/GAIR-NLP/PC-Agent-E
-
HuggingFace Model Hub: https://huggingface.co/henryhe0123/PC-Agent-E
-
arXiv Technical Paper: https://arxiv.org/pdf/2505.13909
Application Scenarios for PC Agent-E
-
Office Automation: Automatically completes tasks such as document editing and data analysis to improve office productivity.
-
Software Testing: Simulates user operations to detect bugs and issues, enhancing software quality.
-
Educational Assistance: Serves as a virtual teaching assistant to help students complete computer tasks and provide real-time guidance.
-
Assistive Technology for the Disabled: Offers support functionalities to make computer usage more accessible for individuals with disabilities.
-
Cross-Platform Compatibility: Enables seamless task execution across different operating systems.