SyncLLM

Synchronous LLMs as Full-Duplex Dialogue Agents

[Paper]

The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP Main 2024)

Bandhav Veluri^1,2, Benjamin N Peloquin¹, Bokai Yu¹,
Hongyu Gong¹, Shyam Gollakota²

¹Meta AI, ²University of Washington, Seattle

{bandhav, gshyam}@cs.washington.edu, hygong@meta.com

Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms.

Latency tolerant interaction

SyncLLM is an auto-regressive decoder-only transformer model that can function as a full-duplex dialogue agent. In the following figure, at current time step (chunk N in the figure), SyncLLM’s context contains interleaved chunks of the LLM’s speech until the current chunk, and the user’s speech corresponding to all but the current chunk. To be in synchrony with the user, the LLM must generate its next chunk (chunk N+1) before the end of the current chunk. As a result, SyncLLM first generates an estimated user’s chunk, which is in-turn appended to the context and used to predict its next chunk.

Training

SyncLLM is trained with a simple next-token prediction objective with full-duplex spoken dialogues formatted in the following format. (Top row) We represent spoken dialogue as interleaved chunks of HuBERT tokens, where the chunk size determines the frequency of the synchronization token [S0]. (Middle row) We train SyncLLM to generate interleaved chunks of deduplicated HuBERT tokens along with periodic synchronization tokens. (Bottom row) We interpolate deduplicated tokens in each chunk to obtain spoken dialogue sequence in the original format.

Audio samples

Audio samples generated by following model configurations when provided with a full-duplex spoken dialogue prompt:

SyncLLM-F: SyncLLM model trained on the Fisher dataset.
dGSLM: dGSLM trained on Fisher dataset.
SyncLLM-F-C: SyncLLM trained on Fisher interacting with SyncLLM trained on Candor assuming a 160 ms latency.

Prompt	Ground-truth continuation	Generated continuation
Prompt	Ground-truth continuation	SyncLLM-F	dGSLM	SyncLLM-F-C