Blog
Insight

Mercor Published Research on AI Agents, Then Started Quietly Deleting Their Tweets About It

09 February 2026
5 min read
Doyle Irvin

Mercor Raised $350M at a $10B Valuation, Published Peer-Reviewed Research on AI Agents, Then Started Quietly Deleting Their Tweets About It

Here's what happened.

Mercor partnered with consultants from McKinsey, BCG, Deloitte, Goldman Sachs, and top law firms to build APEX-Agents, a benchmark testing how well AI agents complete real professional deliverables in Google Workspace.

The "Can AI Do It" Results Were Bad

  • Gemini 3 Flash: 24% success rate
  • GPT-5.2: 23%
  • Claude Opus 4.5: 18.4%

Even the best models failed 75% of actual workflows. Less than 25% of tasks completed on first attempt. Under 40% success even after multiple retries.

The paper is peer-reviewed and still live on arXiv. But Mercor's tweets promoting it? Quietly disappearing. Their biggest customers are the AI labs this research embarrasses.

Yet The Models Are Smart Enough

This research confirms what we've been saying: raw model intelligence doesn't become reliable work output on its own. The models are capable. What's missing is context, workflow design, guardrails, and the ability to spread what works across teams.

Every company is investing in AI, and almost none are seeing the returns they expected. They've tried letting teams sign up for whatever tools they want, and ended up with shadow AI sprawl, security gaps, and no visibility into what's working. They've tried consolidating on an enterprise license of ChatGPT, and wondered what it was actually doing for productivity. They've tried workshops, trainings, prompt libraries, and internal champions.

The problem isn't the AI. It's everything around it. Most employees don't know what's possible and don't have time to figure it out. A handful of power users figure it out. Everyone else tries a few prompts, hits a wall, and goes back to the old way of working. The 10x productivity gains stay locked inside the 10% who didn't need help in the first place.

The Research Shows The Missing Layer Isn’t Intelligence—It’s Product, Domain Logic, and Orchestration

This is a main argument of the paper: "The missing layer isn't intelligence it seems—it's product, domain logic, and orchestration."

This is why we built elvex. The models are rapidly becoming commodities. Where the rubber meets the road is the app layer—the institutionalization of workflows, input/output, use cases, and more. You need a platform that embeds company context, builds in domain logic, and makes it easy to share winning workflows so adoption compounds across the organization.

When "using AI" feels as natural as "doing your job," that's when real adoption happens. Other platforms give your employees AI. elvex turns them into AI-native employees.

Research paper: https://arxiv.org/pdf/2601.14242

author profile picture
Head of Marketing
elvex