Cradle: Empowering Foundation Agents towards General Computer Control
We introduce Cradle, a framework that enables foundation agents to operate any computer software using only screenshots as input and keyboard/mouse actions as output — no task-specific APIs required.
We are excited to introduce Cradle, a framework for General Computer Control (GCC) that enables foundation agents to operate arbitrary software using only pixels and natural language — exactly how humans interact with computers.
Motivation
Most existing agent frameworks are brittle: they depend on task-specific APIs, hand-crafted interfaces, or structured environment observations. This severely limits their applicability to the real, messy world of commercial software.
Cradle’s core insight: the screen is the universal interface. Every piece of software exposes its state through pixels, and every user action reduces to keyboard and mouse inputs. By treating screenshots as observations and keyboard/mouse commands as actions, a single framework can operate any software without per-application engineering.
Framework
Cradle is built around six tightly integrated modules:
- Information Gathering — processes raw screenshots into structured observations
- Self-Reflection — evaluates the outcome of past actions and identifies errors
- Task Inference — infers the current sub-goal from context and memory
- Skill Curation — builds and refines a library of reusable skills from experience
- Action Planning — generates executable action sequences toward the current goal
- Memory — maintains episodic and semantic memory across long interaction horizons
The skill curation module is key to self-improvement: as the agent operates, it distills successful interaction patterns into reusable skills, progressively expanding its capabilities on new tasks without retraining.
Results
Cradle was evaluated across both commercial games and real productivity software:
| Domain | Application | Highlight |
|---|---|---|
| AAA Game | Red Dead Redemption 2 | First agent to follow storylines and complete 40-minute missions |
| Simulation | Cities: Skylines | City planning and management tasks |
| Life Sim | Stardew Valley | Multi-day farming and harvest sequences |
| Trading | Dealer’s Life 2 | 93.6% transaction completion rate |
| Browser | Chrome | Web navigation and form filling |
| Outlook | Email composition and management | |
| Video | CapCut | Video editing workflows |
On the OSWorld benchmark, Cradle achieves 7.81% success rate without relying on any internal APIs — demonstrating genuine generality.
⭐ 2.5k GitHub stars
Demo Videos
Links
- Paper: arXiv:2403.03186
- GitHub: BAAI-Agents/Cradle
- Project Page: baai-agents.github.io/Cradle