Realtime Twilio Audio Transformation for Azure Speech Recognition using FFmpeg in .NET
This article is being written and improvedTable of Contents
- Introduction
- Repository
- Problem: Twilio Audio vs Azure STT Requirements
- Why Not NAudio?
- Why FFmpeg and Why as a Process?
- FFmpeg Command for On-the-Fly Conversion
- Streaming Chunks in .NET: Code Example
- Testing: Quality, Latency, and Edge Cases
- Legal Note
- Conclusion
Introduction
Live speech recognition is a must-have for any advanced automated dialogue system especially when your app needs to actually talk to real users in real time, not just passively record calls. Twilio makes it easy to capture live audio from phone calls, but turning that audio into high-quality, actionable text using Azure Speech-to-Text (STT) isn’t trivial. There’s a technical mismatch that developers only notice after running into strange bugs, noise, or low recognition accuracy.
This article shows the core challenges and a robust, modern solution for streaming Twilio call audio to Azure STT with perfect quality, low latency, and no reliance on heavyweight .NET libraries. Instead, we use a proven open source tool FFmpeg as an external process, which may feel unusual if you’re used to “pure C#” integrations, but gives maximum flexibility, quality, and cross-platform control.
Repository
This project is a practical companion and code example for the article
Problem: Twilio Audio vs Azure STT Requirements
Let’s start with the core technical incompatibility.
- Twilio delivers live audio as raw, 8 kHz, 8-bit, mono mu-law an efficient format for telephony, but not suitable for most modern speech recognition engines.
- Azure Speech-to-Text expects PCM WAV: 16-bit, mono, and either 8 kHz or 16 kHz sample rate.
Here’s a quick comparison:
Feature | Twilio Output | Azure STT Required Input | Match? |
---|---|---|---|
Encoding | mu-law (raw) | PCM (WAV, 16-bit) | ❌ |
Sample Rate | 8 kHz | 8 kHz or 16 kHz | ✔️ |
Bit Depth | 8-bit | 16-bit | ❌ |
Container | none (raw bytes) | Raw PCM or WAV | ✔️ |
Channels | mono | mono | ✔️ |
That means you can’t just pipe Twilio’s audio stream directly into Azure STT resampling, re-encoding, and proper chunking are all required for accurate, real-time transcription.
Why Not NAudio?
I noticed a significant drop in recognition accuracy when using NAudio for audio conversion. The audio quality degraded enough to cause noticeable distortions in the transcribed text, and there was audible noise especially when resampling. These artifacts made live speech recognition unreliable for real-world scenarios.
Tool | Resampling Quality | Artifacts/Noise | Integration Type | License |
---|---|---|---|---|
NAudio | Low/Unstable | Yes (clicks, hiss, artifacts) | Library (can embed in app) | MS-PL |
FFmpeg | High | No | External process (must be present in runtime environment) | LGPL/GPL |
Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.
Why FFmpeg and Why as a Process?
FFmpeg delivers consistently high-quality audio conversion, making it a better choice for real-time scenarios. But just as important is how we integrate FFmpeg with our .NET app.
By running FFmpeg as an external process rather than embedding it as a library, we gain several critical advantages:
- Licensing: Invoking FFmpeg as a standalone process keeps your project free from LGPL/GPL obligations, regardless of how FFmpeg was built or distributed.
- Integration Simplicity: No need for complex wrappers or native interop just pass arguments and handle streams.
- Portability & Isolation: FFmpeg runs as a separate executable, making upgrades and troubleshooting easier, and isolating crashes or memory leaks from your main process.
Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.
Installing FFmpeg
- macOS: Install via Homebrew:
brew install ffmpeg
If you don’t have Homebrew installed, follow the instructions at brew.sh first.
- Linux (Debian/Ubuntu):
sudo apt update
sudo apt install ffmpeg
- Linux (RHEL/CentOS/Fedora): Enable EPEL repository if needed, then:
sudo dnf install epel-release # Only if you don't have EPEL already
sudo dnf install ffmpeg
On older CentOS/RHEL systems, you might need to use RPM Fusion or compile FFmpeg from source.
- Windows:
Download the latest static build from the FFmpeg official website or from gyan.dev.
Unpack the archive and add thebin
directory to yourPATH
environment variable so thatffmpeg.exe
is available in your terminal.
Good luck adding environment variables and fighting with Windows PATH! 😅
Tip for Azure App Service and Cloud Deployments: If you need to run FFmpeg in a cloud environment such as Azure App Service, the most robust approach is to dockerize your application.
This allows you to fully control the runtime environment, package FFmpeg alongside your app, and ensure it works consistently in production.
FFmpeg command for on-the-fly conversion
To convert Twilio’s raw mu-law
audio to 16-bit PCM
in real time, we launch FFmpeg as a background process and stream the audio through its standard input/output pipes.
Here’s the exact FFmpeg command used in this project:
ffmpeg \
-loglevel error \
-fflags nobuffer \
-avioflags direct \
-fflags discardcorrupt \
-probesize 32 \
-analyzeduration 0 \
-f mulaw \
-ar 8000 \
-ac 1 \
-i pipe:0 \
-ar 16000 \
-acodec pcm_s16le \
-f s16le pipe:1
Explanation
Flag | Purpose |
---|---|
-loglevel error | Show only errors; hides warnings and info for cleaner output |
-fflags nobuffer | Disables internal buffering — lowers latency for live input |
-avioflags direct | Uses direct I/O to reduce buffering delay |
-fflags discardcorrupt | Discards corrupted packets instead of failing the stream |
-probesize 32 | Limits the number of bytes FFmpeg probes to detect format (faster start) |
-analyzeduration 0 | Minimizes the time FFmpeg spends analyzing input (reduces latency) |
-f mulaw | Declares the input format as mu-law |
-ar 8000 | Input audio sample rate is 8 kHz |
-ac 1 | Mono channel input |
-i pipe:0 | Reads input from stdin |
-ar 16000 | Resamples output to 16 kHz |
-acodec pcm_s16le | Sets output encoding to 16-bit signed little-endian PCM |
-f s16le | Declares output as raw PCM stream |
pipe:1 | Sends output to stdout |
_ffmpegProcess = new Process
{
StartInfo = new ProcessStartInfo
{
FileName = "ffmpeg", // Or full path to ffmpeg executable
Arguments =
"-loglevel error -fflags nobuffer -avioflags direct -fflags discardcorrupt -probesize 32 -analyzeduration 0 -f mulaw -ar 8000 -ac 1 -i pipe:0 -ar 16000 -acodec pcm_s16le -f s16le pipe:1",
RedirectStandardInput = true,
RedirectStandardOutput = true,
RedirectStandardError = true,
UseShellExecute = false,
CreateNoWindow = true
}
};
_ffmpegProcess.Start();
Converting chunks: code example
Once FFmpeg is running in the background, we interact with it via standard input/output streams. Here’s the flow:
- Twilio sends base64-encoded
mu-law
audio chunks over WebSocket. - We decode and buffer the raw bytes.
- Once enough audio is collected, we write it into FFmpeg's
stdin
. - The
s16le
-encoded result is read fromstdout
and pushed into Azure’sPushAudioInputStream
.
RealtimeAudioConverter.cs (GitHub)
This component handles buffering and communication with FFmpeg:
public async Task<byte[]> WriteChunkAsync(byte[] chunk)
{
if (_ffmpegProcess == null)
throw new InvalidOperationException("FFmpeg process is not started.");
_buffer.Write(chunk, 0, chunk.Length);
if (_buffer.Length < BufferThreshold) return [];
var dataToSend = _buffer.ToArray();
_buffer.SetLength(0);
await _ffmpegProcess.StandardInput.BaseStream.WriteAsync(dataToSend, 0, dataToSend.Length);
await _ffmpegProcess.StandardInput.BaseStream.FlushAsync();
var buffer = new byte[4096];
var bytesRead = await _ffmpegProcess.StandardOutput.BaseStream.ReadAsync(buffer, 0, buffer.Length);
if (bytesRead > 0)
{
return buffer.AsSpan(0, bytesRead).ToArray();
}
return [];
}