GPT-5.4 Achieves 75% on OSWorld: AI Now Rivals Human Desktop Productivity

The Transition From Assistant to Agent

GPT-5.4's Thinking variant has officially surpassed human-level performance on desktop task benchmarks, specifically the OSWorld-Verified test, where it scored 75.0%—a 27.7 percentage point increase over GPT-5.2. This capability for native computer use at the operating system level enables GPT-5.4 to act as a truly autonomous agent, navigating files, browsers, and terminal interfaces with minimal human intervention.

What This Means: OpenAI unveiled GPT-5.4 with a 1-million-token context window and the ability to autonomously execute multi-step workflows across software environments. On the OSWorld-V benchmark—which simulates real desktop productivity tasks—the model scored 75%, slightly above the human baseline of 72.4%. It also matched or exceeded professional performance on a majority of knowledge-work scenarios, marking a significant shift from AI as a chat tool to AI as a collaborator.

The Hidden Story: This isn't just a capability leap—it's a fundamental shift in what AI can do. Humans used to control the computer; now the AI does. For knowledge workers, this is existential. But it also means entire categories of labor—data entry, document processing, routine coding—are becoming obsolete faster than anyone predicted.

Contradictions: While OpenAI is automating desktops, Anthropic is building 10T models for reasoning, and Google is pursuing efficiency. No one agrees on what actually matters for the next phase.

GPT-5.4 Achieves 75% on OSWorld: AI Now Rivals Human Desktop Productivity

The Transition From Assistant to Agent

Sources

Neuro-Symbolic AI Achieves 100x Energy Reduction—The Turning Point

GPT-5.4 Achieves Human-Level Desktop Task Performance—But What Does That Actually Mean?

Anthropic's Claude Mythos 5 Arrives: The First 10-Trillion-Parameter Frontier Model