TurnBench

A benchmark for evaluating conversational turn-taking: end-of-turn and interruption detection on real annotated two-speaker conversations.

spk 1spk 2mid-turn pauseEOTINTbackchannelstime →

Reference baselines

Two degenerate models bound the tradeoff on the public dev set: an energy VAD that fires at every silence (fast, but fooled by most pauses) and a model that never fires. The ideal model has high recall, low false-positive rate, and low latency.

ModelEOT Recall EOT FP EOT Lat P50 INT Recall INT FP INT Lat P50
Energy VAD0.5200.551-120 ms0.9880.390136 ms
No events0.0000.0000.0000.000