Agents That Know When to Ask: New Benchmark Exposes Help-Seeking Gap

Researchers introduced HiL-Bench, a benchmark that measures whether AI agents know when to ask for help rather than guess on incomplete tasks. Current benchmarks reward execution correctness on fully specified problems, masking a critical failure mode: agents that make lucky guesses score the same as those that would have escalated for clarification. The new benchmark surfaces human-validated blockers through progressive exploration and uses Ask-F1, a metric balancing question precision against blocker recall, to measure selective escalation skill. Testing across coding and SQL domains shows frontier models recover only a fraction of their full-information performance when deciding whether to ask, but reinforcement learning on shaped Ask-F1 rewards makes this judgment trainable.
Researchers introduced HiL-Bench, a benchmark that measures whether AI agents know when to ask for help rather than guess on incomplete tasks. Current benchmarks reward execution correctness on fully specified problems, masking a critical failure mode: agents that make lucky guesses score the same as those that would have escalated for clarification. The new benchmark surfaces human-validated blockers through progressive exploration and uses Ask-F1, a metric balancing question precision against blocker recall, to measure selective escalation skill. Testing across coding and SQL domains shows frontier models recover only a fraction of their full-information performance when deciding whether to ask, but reinforcement learning on shaped Ask-F1 rewards makes this judgment trainable.
- Current AI benchmarks are blind to a critical failure mode: agents that guess on incomplete specs score identically to those that would ask for help
- HiL-Bench introduces Ask-F1 metric to measure selective escalation, capturing the tension between over-asking and silent guessing while preventing gaming through question spam
- Frontier models show a universal judgment gap, recovering only a fraction of full-information performance when deciding whether to escalate, driven by overconfidence, missed uncertainty signals, and imprecise escalation
- RL training on shaped Ask-F1 rewards improves both help-seeking quality and task pass rates in a 32B model, with gains transferring across domains without learning domain-specific heuristics
This work exposes a fundamental gap in how AI agents are evaluated and trained. Benchmarks that reward only execution correctness on well-specified tasks create a false signal of capability, hiding poor judgment under ambiguity. As agents move into real-world deployment where specifications are inherently incomplete, the ability to recognize and escalate uncertainty becomes as critical as raw problem-solving skill.
- Existing benchmarks systematically underestimate agent failure rates by not penalizing confident guessing on ambiguous inputs, meaning current performance claims may not translate to production reliability
- Help-seeking behavior is a learnable skill that can be improved through RL training, suggesting that judgment gaps are not inherent limitations but rather optimization targets that existing training methods have overlooked
- The transfer of help-seeking improvements across domains indicates agents can learn general uncertainty detection rather than memorizing domain-specific escalation rules, pointing toward more robust generalization
Our Briefing
Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.
No spam. Unsubscribe any time.



