The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP Main 2024)
Bandhav Veluri1,2, Benjamin N Peloquin1, Bokai Yu1,
Hongyu Gong1, Shyam Gollakota2
1Meta AI, 2University of Washington, Seattle
{bandhav, gshyam}@cs.washington.edu, hygong@meta.com
Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms.
SyncLLM is an auto-regressive decoder-only transformer model that can function as a full-duplex dialogue agent. In the following figure, at current time step (chunk N in the figure), SyncLLM’s context contains interleaved chunks of the LLM’s speech until the current chunk, and the user’s speech corresponding to all but the current chunk. To be in synchrony with the user, the LLM must generate its next chunk (chunk N+1) before the end of the current chunk. As a result, SyncLLM first generates an estimated user’s chunk, which is in-turn appended to the context and used to predict its next chunk.
SyncLLM is trained with a simple next-token prediction objective with full-duplex spoken dialogues formatted in the following format. (Top row) We represent spoken dialogue as interleaved chunks of HuBERT tokens, where the chunk size determines the frequency of the synchronization token [S0]. (Middle row) We train SyncLLM to generate interleaved chunks of deduplicated HuBERT tokens along with periodic synchronization tokens. (Bottom row) We interpolate deduplicated tokens in each chunk to obtain spoken dialogue sequence in the original format.
Audio samples generated by following model configurations when provided with a full-duplex spoken dialogue prompt:
Prompt | Ground-truth continuation |
Generated continuation | ||
---|---|---|---|---|
SyncLLM-F | dGSLM | SyncLLM-F-C |