Realtime Twilio Audio Transformation for Azure Speech Recognition using FFmpeg in .NET

Introduction
Repository
Problem: Twilio Audio vs Azure STT Requirements
Why Not NAudio?
Why FFmpeg and Why as a Process?
- Installing FFmpeg
FFmpeg Command for On-the-Fly Conversion
Streaming Chunks in .NET: Code Example
Testing: Quality, Latency, and Edge Cases
Legal Note
Conclusion

Introduction

Live speech recognition is a must-have for any advanced automated dialogue system especially when your app needs to actually talk to real users in real time, not just passively record calls. Twilio makes it easy to capture live audio from phone calls, but turning that audio into high-quality, actionable text using Azure Speech-to-Text (STT) isn’t trivial. There’s a technical mismatch that developers only notice after running into strange bugs, noise, or low recognition accuracy.

This article shows the core challenges and a robust, modern solution for streaming Twilio call audio to Azure STT with perfect quality, low latency, and no reliance on heavyweight .NET libraries. Instead, we use a proven open source tool FFmpeg as an external process, which may feel unusual if you’re used to “pure C#” integrations, but gives maximum flexibility, quality, and cross-platform control.

Repository

yaroslavyushchenko / TwilioAzureFfmpegRealtime

This project is a practical companion and code example for the article

Updated 3 weeks ago

Problem: Twilio Audio vs Azure STT Requirements

Let’s start with the core technical incompatibility.

Twilio delivers live audio as raw, 8 kHz, 8-bit, mono mu-law an efficient format for telephony, but not suitable for most modern speech recognition engines.
Azure Speech-to-Text expects PCM WAV: 16-bit, mono, and either 8 kHz or 16 kHz sample rate.

Here’s a quick comparison:

Feature	Twilio Output	Azure STT Required Input	Match?
Encoding	mu-law (raw)	PCM (WAV, 16-bit)	❌
Sample Rate	8 kHz	8 kHz or 16 kHz	✔️
Bit Depth	8-bit	16-bit	❌
Container	none (raw bytes)	Raw PCM or WAV	✔️
Channels	mono	mono	✔️

That means you can’t just pipe Twilio’s audio stream directly into Azure STT resampling, re-encoding, and proper chunking are all required for accurate, real-time transcription.

Why Not NAudio?

I noticed a significant drop in recognition accuracy when using NAudio for audio conversion. The audio quality degraded enough to cause noticeable distortions in the transcribed text, and there was audible noise especially when resampling. These artifacts made live speech recognition unreliable for real-world scenarios.

Tool	Resampling Quality	Artifacts/Noise	Integration Type	License
NAudio	Low/Unstable	Yes (clicks, hiss, artifacts)	Library (can embed in app)	MS-PL
FFmpeg	High	No	External process (must be present in runtime environment)	LGPL/GPL

Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.

Why FFmpeg and Why as a Process?

FFmpeg delivers consistently high-quality audio conversion, making it a better choice for real-time scenarios. But just as important is how we integrate FFmpeg with our .NET app.

By running FFmpeg as an external process rather than embedding it as a library, we gain several critical advantages:

Licensing: Invoking FFmpeg as a standalone process keeps your project free from LGPL/GPL obligations, regardless of how FFmpeg was built or distributed.
Integration Simplicity: No need for complex wrappers or native interop just pass arguments and handle streams.
Portability & Isolation: FFmpeg runs as a separate executable, making upgrades and troubleshooting easier, and isolating crashes or memory leaks from your main process.

Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.

Installing FFmpeg

macOS: Install via Homebrew:

brew install ffmpeg

If you don’t have Homebrew installed, follow the instructions at brew.sh first.

Linux (Debian/Ubuntu):

sudo apt update
sudo apt install ffmpeg

Linux (RHEL/CentOS/Fedora): Enable EPEL repository if needed, then:

sudo dnf install epel-release  # Only if you don't have EPEL already
sudo dnf install ffmpeg

On older CentOS/RHEL systems, you might need to use RPM Fusion or compile FFmpeg from source.

Windows: Download the latest static build from the FFmpeg official website or from gyan.dev.
Unpack the archive and add the bin directory to your PATH environment variable so that ffmpeg.exe is available in your terminal.
Good luck adding environment variables and fighting with Windows PATH! 😅

Tip for Azure App Service and Cloud Deployments: If you need to run FFmpeg in a cloud environment such as Azure App Service, the most robust approach is to dockerize your application.
This allows you to fully control the runtime environment, package FFmpeg alongside your app, and ensure it works consistently in production.

FFmpeg command for on-the-fly conversion

To convert Twilio’s raw mu-law audio to 16-bit PCM in real time, we launch FFmpeg as a background process and stream the audio through its standard input/output pipes.

Here’s the exact FFmpeg command used in this project:

ffmpeg \
  -loglevel error \
  -fflags nobuffer \
  -avioflags direct \
  -fflags discardcorrupt \
  -probesize 32 \
  -analyzeduration 0 \
  -f mulaw \
  -ar 8000 \
  -ac 1 \
  -i pipe:0 \
  -ar 16000 \
  -acodec pcm_s16le \
  -f s16le pipe:1

Explanation

Flag	Purpose
`-loglevel error`	Show only errors; hides warnings and info for cleaner output
`-fflags nobuffer`	Disables internal buffering — lowers latency for live input
`-avioflags direct`	Uses direct I/O to reduce buffering delay
`-fflags discardcorrupt`	Discards corrupted packets instead of failing the stream
`-probesize 32`	Limits the number of bytes FFmpeg probes to detect format (faster start)
`-analyzeduration 0`	Minimizes the time FFmpeg spends analyzing input (reduces latency)
`-f mulaw`	Declares the input format as mu-law
`-ar 8000`	Input audio sample rate is 8 kHz
`-ac 1`	Mono channel input
`-i pipe:0`	Reads input from stdin
`-ar 16000`	Resamples output to 16 kHz
`-acodec pcm_s16le`	Sets output encoding to 16-bit signed little-endian PCM
`-f s16le`	Declares output as raw PCM stream
`pipe:1`	Sends output to stdout

_ffmpegProcess = new Process
{
    StartInfo = new ProcessStartInfo
    {
        FileName = "ffmpeg", // Or full path to ffmpeg executable
        Arguments =
            "-loglevel error -fflags nobuffer -avioflags direct -fflags discardcorrupt -probesize 32 -analyzeduration 0 -f mulaw -ar 8000 -ac 1 -i pipe:0 -ar 16000 -acodec pcm_s16le -f s16le pipe:1",
        RedirectStandardInput = true,
        RedirectStandardOutput = true,
        RedirectStandardError = true,
        UseShellExecute = false,
        CreateNoWindow = true
    }
};

_ffmpegProcess.Start();

Converting chunks: code example

Once FFmpeg is running in the background, we interact with it via standard input/output streams. Here’s the flow:

Twilio sends base64-encoded mu-law audio chunks over WebSocket.
We decode and buffer the raw bytes.
Once enough audio is collected, we write it into FFmpeg's stdin.
The s16le-encoded result is read from stdout and pushed into Azure’s PushAudioInputStream.

RealtimeAudioConverter.cs (GitHub)

This component handles buffering and communication with FFmpeg:

public async Task<byte[]> WriteChunkAsync(byte[] chunk)
{
    if (_ffmpegProcess == null)
        throw new InvalidOperationException("FFmpeg process is not started.");

    _buffer.Write(chunk, 0, chunk.Length);
    if (_buffer.Length < BufferThreshold) return [];
    var dataToSend = _buffer.ToArray();
    _buffer.SetLength(0);

    await _ffmpegProcess.StandardInput.BaseStream.WriteAsync(dataToSend, 0, dataToSend.Length);
    await _ffmpegProcess.StandardInput.BaseStream.FlushAsync();

    var buffer = new byte[4096];
    var bytesRead = await _ffmpegProcess.StandardOutput.BaseStream.ReadAsync(buffer, 0, buffer.Length);

    if (bytesRead > 0)
    {
        return buffer.AsSpan(0, bytesRead).ToArray();
    }

    return [];
}

Table of Contents