Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio + Text Foundation Model
Liquid AI has unveiled LFM2-Audio-1.5B, a compact foundation model designed to process audio + text inputs and produce audio + text outputs in a single end-to-end architecture. Unlike traditional stacks that chain ASR → LLM → TTS, LFM2-Audio embeds and generates both modalities within a unified backbone — enabling lower latency and simplified pipelines.
🚀 Key Innovations
Unified Backbone with Disentangled I/O
The model extends Liquid's LFM2 language backbone (≈ 1.2B parameters) to treat audio and text as first-class tokens. Audio inputs are encoded as continuous embeddings (direct from waveform chunks, ~80 ms windows), while outputs are generated as discrete audio codes. This separation avoids input discretization artifacts while preserving autoregressive decoding for both modalities.
Audio Encoder + Decoder Modules
- Encoder: a FastConformer (≈ 115M parameters) that encodes raw audio into embeddings
- Decoder: an RQ-Transformer that predicts discrete codes from a Mimi codec (8 quantized codebooks)
- The combined system supports a contextual length up to 32,768 tokens (shared across text/audio) and handles vocabularies of size 65,536 (text) and 2049 × 8 (audio)
- The model runs in bfloat16 precision and is released under LFM Open License v1.0
- Currently supports English
Two Generation Modes
- Interleaved generation — for live, speech-to-speech conversational agents (alternating text and audio tokens) to reduce perceived latency
- Sequential generation — for classic ASR or TTS use cases (i.e. input modality → output modality turn by turn)
Low-Latency Performance
The Liquid AI team reports < 100 ms latency from a 4 s audio prompt to the model's first audio output — a metric roughly representing perceived responsiveness in interactive settings. They claim this outpaces comparable models under their setup.
📊 Benchmarks & Comparisons
Liquid provides evaluations on VoiceBench, a benchmark suite of nine audio-assistant tasks, yielding an overall score of 56.78. Some task-level sub-scores (e.g. AlpacaEval, CommonEval, WildVoice) are shown in the announcement.
In ASR (speech recognition) benchmarks, LFM2-Audio reportedly matches or beats Whisper-large-v3-turbo on datasets like AMI (15.36 vs 16.13 WER) and LibriSpeech-clean (2.03 vs 2.10 WER).
Liquid contrasts its model against larger competitors (e.g. Qwen2.5-Omni 3B, Moshi 7B) in its reported tables.
💡 Why This Matters in Voice AI
Most current "omni" or voice-assistant stacks are modular: ASR → LLM → TTS. These pipelines introduce interface overhead, latency, and possible brittleness between components.
LFM2-Audio's single-backbone approach:
- Reduces glue logic and interface complexity
- Enables interleaved decoding, letting the model begin audio output before the entire response is fully generated
- Provides a unified foundation for ASR, TTS, classification, and conversational agents, all in one model
For developers, this means simpler engineering, lower end-to-end latency, and more tight integration between speech and language tasks.
Liquid also provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.
📌 Summary & Outlook
LFM2-Audio-1.5B is an important step toward genuinely unified audio + text AI systems, especially for real-time, on-device or resource-constrained settings. By combining continuous-input embeddings with discrete output codes and a unified backbone, it sidesteps common drawbacks of pipeline systems. The latency metrics and competitive ASR performance make it a credible contender in the voice AI space.
That said, real-world robustness (accents, noise, multilingual expansion) and comparisons to other emerging audio-language models (e.g. newer models in the public domain) will be key to watch.