Google Enters CUA Battlefield with Gemini 2.5 Computer Use: AI Directly Controls Browser
Google Enters CUA Battlefield with Gemini 2.5 Computer Use: AI Directly Controls Browser
Google's Computer Use model is here!
In the early hours of today, Google DeepMind made a major announcement releasing Gemini 2.5 Computer Use, a computer usage model based on Gemini 2.5.
Considering that Google just released Chrome DevTools (MCP) a few days ago, the birth of Gemini 2.5 Computer Use is not particularly surprising. Simply put, similar to OpenAI's Computer-Using Agent (CUA), DeepMind's model allows AI to directly control users' browsers. Based on visual understanding and reasoning capabilities, this model can help users perform operations such as clicking, scrolling, and typing in browsers.
Official Demonstrations
Let's look at two official demonstrations:
Prompt: From https://tinyurl.com/pet-care-signup, get all details for any pet with a California residency and add them as a guest in my spa CRM at https://pet-luxe-spa.web.app/. Then, set up a follow up visit appointment with the specialist Anima Lavar for October 10th anytime after 8am. The reason for the visit is the same as their requested treatment.
Prompt: My art club brainstormed tasks ahead of our fair. The board is chaotic and I need your help organizing the tasks into some categories I created. Go to sticky-note-jam.web.app and ensure notes are clearly in the right sections. Drag them there if not.
https://www.youtube.com/watch?v=slOLc1nkKY0
As we can see, whether it's collecting web information and executing actions, or organizing messy notes, Gemini 2.5 Computer Use completed the tasks very accurately and with considerable speed.
Performance Benchmarks
On relevant benchmarks, Gemini 2.5 Computer Use's performance has reached SOTA (State-of-the-Art) levels:

The model also shows superior speed performance compared to other competing models:

Currently, developers can access these capabilities through Google AI Studio and Vertex AI's Gemini API. Users can also try it in Browserbase's hosted demo environment (supports up to 5-minute workflows only, and doesn't support user takeover): https://gemini.browserbase.com/
Real-World Testing
MockSphere used this demo environment for several attempts. Overall, Gemini 2.5 Computer Use has high accuracy when completing simple tasks, but tends to fail on slightly more complex tasks.
For example, when executing simple tasks like "find the John Wick page on Wikipedia," the model performed very successfully.
However, as soon as the task becomes slightly more complex, the model fails. For example, "find the John Wick page on Wikipedia, summarize its information, and provide a Chinese version." Additionally, tasks like "open the Nobel Prize official website and provide this year's Nobel Prize announcement schedule" and the following task were not successfully completed:
System Card Release
DeepMind has also released the Gemini 2.5 Computer Use system card:
Gemini 2.5 Computer Use Model Card

How Gemini 2.5 Computer Use Works
The model's core capabilities are implemented through the new computer_use tool in the Gemini API, which developers need to run in a loop process.
The input should include:
- User requests
- Screenshots of the current environment
- History of recently executed actions
Additionally, the input can specify whether to exclude specific functions from the default supported UI actions and add custom functions.
After analyzing these inputs, the model generates responses, typically function calls representing UI actions (such as clicking or typing). In certain operations (such as purchasing behavior), the model will also request user confirmation. The client then executes these actions.
After action execution is completed, the system returns the latest screenshot and current URL as function responses to the model, restarting the loop.
This iterative process continues until the task is completed, an error occurs, or it's terminated due to security mechanisms or user decision.
Google states that the current Gemini 2.5 Computer Use model is mainly optimized for web browsers, but also shows strong potential in mobile UI control. However, it has not yet been optimized for desktop operating system-level control.
Security Mechanism Design
Google also shared their security mechanism design for this model in their blog.
Google stated: "Building agents responsibly is the only way to make AI benefit everyone. AI agents that can directly operate computers bring unique risks, including malicious user usage, unexpected model behavior, and prompt injection and fraud in web environments. Therefore, we place great emphasis on security protection in our design."
In the Gemini 2.5 Computer Use model, Google directly integrates security mechanisms during the training phase to address three main types of risks (detailed in the system card).
Additionally, Google provides developers with security control options to prevent the model from automatically executing potentially high-risk or harmful operations, such as:
- Damaging system integrity
- Endangering security
- Bypassing CAPTCHAs
- Controlling medical devices
Google's implemented control measures include:
- Per-step Safety Service: During the inference phase, an independent safety service evaluates each action the model intends to execute.
- System Instructions: Developers can set that agents must refuse or request user confirmation before specific high-risk operations.
Youtube Reports
Github Link: https://github.com/google/computer-use-preview
Conclusion
Google DeepMind's high-profile entry with Gemini 2.5 Computer Use not only demonstrates leading performance on multiple benchmarks but also officially brings the competition in the AI agent field into a white-hot stage.
From OpenAI to Anthropic, and now Google, tech giants are racing to define the future of how we interact with computers. Although current models still appear immature when facing complex real-world tasks, this is precisely the true portrayal before the dawn of technology. What we see today is not just a new model, but a clear signal: the dominance of keyboards and mice is being challenged, and an era of directly driving the digital world through natural language is accelerating toward us.