TurnBench
A benchmark for evaluating conversational turn-taking: end-of-turn and interruption detection on real annotated two-speaker conversations.
Reference baselines
Two degenerate models bound the tradeoff on the public dev set: an energy VAD that fires at every silence (fast, but fooled by most pauses) and a model that never fires. The ideal model has high recall, low false-positive rate, and low latency.
| Model | EOT Recall ↑ | EOT FP ↓ | EOT Lat P50 ↓ | INT Recall ↑ | INT FP ↓ | INT Lat P50 ↓ |
|---|---|---|---|---|---|---|
| Energy VAD | 0.520 | 0.551 | -120 ms | 0.988 | 0.390 | 136 ms |
| No events | 0.000 | 0.000 | — | 0.000 | 0.000 | — |