Liquid AI Launches LFM2-Audio-1.5B: A Unified, Low-Latency Audio + Text Foundation Model

Published on October 01, 2025 MockSphere News Team 5 min read Industry Updates
ai audio machine-learning foundation-models technology
Liquid AI unveils LFM2-Audio-1.5B, a compact foundation model that processes audio + text inputs and produces audio + text outputs in a single end-to-end architecture, enabling lower latency and simplified pipelines.

Liquid AI has unveiled LFM2-Audio-1.5B, a compact foundation model designed to process audio + text inputs and produce audio + text outputs in a single end-to-end architecture. Unlike traditional stacks that chain ASR → LLM → TTS, LFM2-Audio embeds and generates both modalities within a unified backbone — enabling lower latency and simplified pipelines.

🚀 Key Innovations

Unified Backbone with Disentangled I/O

The model extends Liquid's LFM2 language backbone (≈ 1.2B parameters) to treat audio and text as first-class tokens. Audio inputs are encoded as continuous embeddings (direct from waveform chunks, ~80 ms windows), while outputs are generated as discrete audio codes. This separation avoids input discretization artifacts while preserving autoregressive decoding for both modalities.

Audio Encoder + Decoder Modules

  • Encoder: a FastConformer (≈ 115M parameters) that encodes raw audio into embeddings
  • Decoder: an RQ-Transformer that predicts discrete codes from a Mimi codec (8 quantized codebooks)
  • The combined system supports a contextual length up to 32,768 tokens (shared across text/audio) and handles vocabularies of size 65,536 (text) and 2049 × 8 (audio)
  • The model runs in bfloat16 precision and is released under LFM Open License v1.0
  • Currently supports English

Two Generation Modes

  1. Interleaved generation — for live, speech-to-speech conversational agents (alternating text and audio tokens) to reduce perceived latency
  2. Sequential generation — for classic ASR or TTS use cases (i.e. input modality → output modality turn by turn)

Low-Latency Performance

The Liquid AI team reports < 100 ms latency from a 4 s audio prompt to the model's first audio output — a metric roughly representing perceived responsiveness in interactive settings. They claim this outpaces comparable models under their setup.

📊 Benchmarks & Comparisons

Liquid provides evaluations on VoiceBench, a benchmark suite of nine audio-assistant tasks, yielding an overall score of 56.78. Some task-level sub-scores (e.g. AlpacaEval, CommonEval, WildVoice) are shown in the announcement.

In ASR (speech recognition) benchmarks, LFM2-Audio reportedly matches or beats Whisper-large-v3-turbo on datasets like AMI (15.36 vs 16.13 WER) and LibriSpeech-clean (2.03 vs 2.10 WER).

Liquid contrasts its model against larger competitors (e.g. Qwen2.5-Omni 3B, Moshi 7B) in its reported tables.

💡 Why This Matters in Voice AI

Most current "omni" or voice-assistant stacks are modular: ASR → LLM → TTS. These pipelines introduce interface overhead, latency, and possible brittleness between components.

LFM2-Audio's single-backbone approach:

  • Reduces glue logic and interface complexity
  • Enables interleaved decoding, letting the model begin audio output before the entire response is fully generated
  • Provides a unified foundation for ASR, TTS, classification, and conversational agents, all in one model

For developers, this means simpler engineering, lower end-to-end latency, and more tight integration between speech and language tasks.

Liquid also provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

📌 Summary & Outlook

LFM2-Audio-1.5B is an important step toward genuinely unified audio + text AI systems, especially for real-time, on-device or resource-constrained settings. By combining continuous-input embeddings with discrete output codes and a unified backbone, it sidesteps common drawbacks of pipeline systems. The latency metrics and competitive ASR performance make it a credible contender in the voice AI space.

That said, real-world robustness (accents, noise, multilingual expansion) and comparisons to other emerging audio-language models (e.g. newer models in the public domain) will be key to watch.

🔍 References & Further Reading

Published: October 01, 2025
Reading time: 5 minutes
Updated: October 07, 2025
Share this article: