TurnBench

A benchmark for evaluating conversational turn-taking: end-of-turn and interruption detection on real annotated two-speaker conversations.

Score on dev set View conversations

Reference baselines

Two degenerate models bound the tradeoff on the public dev set: an energy VAD that fires at every silence (fast, but fooled by most pauses) and a model that never fires. The ideal model has high recall, low false-positive rate, and low latency.

Model	EOT Recall ↑	EOT FP ↓	EOT Lat P50 ↓	INT Recall ↑	INT FP ↓	INT Lat P50 ↓
Energy VAD	0.520	0.551	-120 ms	0.988	0.390	136 ms
No events	0.000	0.000	—	0.000	0.000	—