Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio + Text Foundation Model

Liquid AI has unveiled LFM2-Audio-1.5B, a compact foundation model designed to process audio + text inputs and produce audio + text outputs in a single end-to-end architecture. Unlike traditional stacks that chain ASR → LLM → TTS, LFM2-Audio embeds and generates both modalities within a unified backbone — enabling lower latency and simplified pipelines.

🚀 Key Innovations

Unified Backbone with Disentangled I/O

The model extends Liquid's LFM2 language backbone (≈ 1.2B parameters) to treat audio and text as first-class tokens. Audio inputs are encoded as continuous embeddings (direct from waveform chunks, ~80 ms windows), while outputs are generated as discrete audio codes. This separation avoids input discretization artifacts while preserving autoregressive decoding for both modalities.

Audio Encoder + Decoder Modules

Encoder: a FastConformer (≈ 115M parameters) that encodes raw audio into embeddings
Decoder: an RQ-Transformer that predicts discrete codes from a Mimi codec (8 quantized codebooks)
The combined system supports a contextual length up to 32,768 tokens (shared across text/audio) and handles vocabularies of size 65,536 (text) and 2049 × 8 (audio)
The model runs in bfloat16 precision and is released under LFM Open License v1.0
Currently supports English

Two Generation Modes

Interleaved generation — for live, speech-to-speech conversational agents (alternating text and audio tokens) to reduce perceived latency
Sequential generation — for classic ASR or TTS use cases (i.e. input modality → output modality turn by turn)

Low-Latency Performance

The Liquid AI team reports < 100 ms latency from a 4 s audio prompt to the model's first audio output — a metric roughly representing perceived responsiveness in interactive settings. They claim this outpaces comparable models under their setup.

📊 Benchmarks & Comparisons

Liquid provides evaluations on VoiceBench, a benchmark suite of nine audio-assistant tasks, yielding an overall score of 56.78. Some task-level sub-scores (e.g. AlpacaEval, CommonEval, WildVoice) are shown in the announcement.

In ASR (speech recognition) benchmarks, LFM2-Audio reportedly matches or beats Whisper-large-v3-turbo on datasets like AMI (15.36 vs 16.13 WER) and LibriSpeech-clean (2.03 vs 2.10 WER).

Liquid contrasts its model against larger competitors (e.g. Qwen2.5-Omni 3B, Moshi 7B) in its reported tables.

💡 Why This Matters in Voice AI

Most current "omni" or voice-assistant stacks are modular: ASR → LLM → TTS. These pipelines introduce interface overhead, latency, and possible brittleness between components.

LFM2-Audio's single-backbone approach:

Reduces glue logic and interface complexity
Enables interleaved decoding, letting the model begin audio output before the entire response is fully generated
Provides a unified foundation for ASR, TTS, classification, and conversational agents, all in one model

For developers, this means simpler engineering, lower end-to-end latency, and more tight integration between speech and language tasks.

Liquid also provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

📌 Summary & Outlook

LFM2-Audio-1.5B is an important step toward genuinely unified audio + text AI systems, especially for real-time, on-device or resource-constrained settings. By combining continuous-input embeddings with discrete output codes and a unified backbone, it sidesteps common drawbacks of pipeline systems. The latency metrics and competitive ASR performance make it a credible contender in the voice AI space.

That said, real-world robustness (accents, noise, multilingual expansion) and comparisons to other emerging audio-language models (e.g. newer models in the public domain) will be key to watch.

Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio + Text Foundation Model

🚀 Key Innovations

Unified Backbone with Disentangled I/O

Audio Encoder + Decoder Modules

Two Generation Modes

Low-Latency Performance

📊 Benchmarks & Comparisons

💡 Why This Matters in Voice AI

📌 Summary & Outlook

🔍 References & Further Reading

Recent News

Welcome to MockSphere News

Google Launches Chrome DevTools MCP: AI Agents Can …

Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio …

Google Enters CUA Battlefield with Gemini 2.5 Computer …

Related Categories

🚀 Join MockSphere Today!

Unlock Your AI/ML Learning Potential

🚀 Key Innovations

Unified Backbone with Disentangled I/O

Audio Encoder + Decoder Modules

Two Generation Modes

Low-Latency Performance

📊 Benchmarks & Comparisons

💡 Why This Matters in Voice AI

📌 Summary & Outlook

🔍 References & Further Reading

Recent News

Welcome to MockSphere News

Google Launches Chrome DevTools MCP: AI Agents Can …

Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio …

Google Enters CUA Battlefield with Gemini 2.5 Computer …

Related Categories