Screenshot from JobBench web.

ACTION researchers at the University of Washington have released a new evaluation standard for the real-world utility of AI agents. JobBench, now live at job-bench.github.io, applies a human lens to the issue of AI in the workplace. Rather than asking what work AI agents can absorb, the benchmark considers what tasks human professionals want AI to absorb, and then measures AI models' ability to do the work. 

In the real world, an agent's value isn't just in how smart it is, but in how reliable it is at doing the work humans find exhausting. The centerpiece of the project is the JobBench Leaderboard, which ranks models based on their success rate per action and resilience to boredom/drift. Early results indicate that even the most "intelligent" frontier models often struggle with the sustained precision required for high-volume, tedious tasks, frequently hallucinating or skipping steps after repeated iterations.

JobBench is a collaborative open-source project dedicated to the rigorous evaluation of AI agents in workplace settings. It focuses on the replication of complex, repetitive professional tasks to ensure that the next generation of AI is both capable and dependable.

Date