Skip to content
AI IntelligenceApr 5, 2026AI Intelligence
Article

A Google study finds that standard AI benchmarks systematically ignore how humans disagree in evaluations.

The usual three to five human raters per test example are often insufficient for reliable results.

Data Cube AI EditorialSource: The Decoder
01

Source Brief

A Google study finds that standard AI benchmarks systematically ignore how humans disagree in evaluations. The usual three to five human raters per test example are often insufficient for reliable results.