Realtime Twilio Audio Transformation for Azure Speech Recognition using FFmpeg in .NET

This article is being written and improved

Table of Contents


Introduction

Live speech recognition is a must-have for any advanced automated dialogue system especially when your app needs to actually talk to real users in real time, not just passively record calls. Twilio makes it easy to capture live audio from phone calls, but turning that audio into high-quality, actionable text using Azure Speech-to-Text (STT) isn’t trivial. There’s a technical mismatch that developers only notice after running into strange bugs, noise, or low recognition accuracy.

This article shows the core challenges and a robust, modern solution for streaming Twilio call audio to Azure STT with perfect quality, low latency, and no reliance on heavyweight .NET libraries. Instead, we use a proven open source tool FFmpeg as an external process, which may feel unusual if you’re used to “pure C#” integrations, but gives maximum flexibility, quality, and cross-platform control.

Repository

owner.avatar_urlyaroslavyushchenko / TwilioAzureFfmpegRealtime

This project is a practical companion and code example for the article

0
0
0
C#
Updated 3 weeks ago

Problem: Twilio Audio vs Azure STT Requirements

Let’s start with the core technical incompatibility.

  • Twilio delivers live audio as raw, 8 kHz, 8-bit, mono mu-law an efficient format for telephony, but not suitable for most modern speech recognition engines.
  • Azure Speech-to-Text expects PCM WAV: 16-bit, mono, and either 8 kHz or 16 kHz sample rate.

Here’s a quick comparison:

FeatureTwilio OutputAzure STT Required InputMatch?
Encodingmu-law (raw)PCM (WAV, 16-bit)
Sample Rate8 kHz8 kHz or 16 kHz✔️
Bit Depth8-bit16-bit
Containernone (raw bytes)Raw PCM or WAV✔️
Channelsmonomono✔️

That means you can’t just pipe Twilio’s audio stream directly into Azure STT resampling, re-encoding, and proper chunking are all required for accurate, real-time transcription.


Why Not NAudio?

I noticed a significant drop in recognition accuracy when using NAudio for audio conversion. The audio quality degraded enough to cause noticeable distortions in the transcribed text, and there was audible noise especially when resampling. These artifacts made live speech recognition unreliable for real-world scenarios.

ToolResampling QualityArtifacts/NoiseIntegration TypeLicense
NAudioLow/UnstableYes (clicks, hiss, artifacts)Library (can embed in app)MS-PL
FFmpegHighNoExternal process (must be present in runtime environment)LGPL/GPL

Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.


Why FFmpeg and Why as a Process?

FFmpeg delivers consistently high-quality audio conversion, making it a better choice for real-time scenarios. But just as important is how we integrate FFmpeg with our .NET app.

By running FFmpeg as an external process rather than embedding it as a library, we gain several critical advantages:

  • Licensing: Invoking FFmpeg as a standalone process keeps your project free from LGPL/GPL obligations, regardless of how FFmpeg was built or distributed.
  • Integration Simplicity: No need for complex wrappers or native interop just pass arguments and handle streams.
  • Portability & Isolation: FFmpeg runs as a separate executable, making upgrades and troubleshooting easier, and isolating crashes or memory leaks from your main process.

Legal Note: If you use FFmpeg only as an external process (not as a library), your project is not affected by FFmpeg’s GPL or LGPL licensing, no matter how FFmpeg was built. For more, see the FFmpeg Legal FAQ.

Installing FFmpeg

brew install ffmpeg

If you don’t have Homebrew installed, follow the instructions at brew.sh first.

  • Linux (Debian/Ubuntu):
sudo apt update
sudo apt install ffmpeg
  • Linux (RHEL/CentOS/Fedora): Enable EPEL repository if needed, then:
sudo dnf install epel-release  # Only if you don't have EPEL already
sudo dnf install ffmpeg

On older CentOS/RHEL systems, you might need to use RPM Fusion or compile FFmpeg from source.

  • Windows: Download the latest static build from the FFmpeg official website or from gyan.dev.
    Unpack the archive and add the bin directory to your PATH environment variable so that ffmpeg.exe is available in your terminal.
    Good luck adding environment variables and fighting with Windows PATH! 😅

Tip for Azure App Service and Cloud Deployments: If you need to run FFmpeg in a cloud environment such as Azure App Service, the most robust approach is to dockerize your application.
This allows you to fully control the runtime environment, package FFmpeg alongside your app, and ensure it works consistently in production.


FFmpeg command for on-the-fly conversion

To convert Twilio’s raw mu-law audio to 16-bit PCM in real time, we launch FFmpeg as a background process and stream the audio through its standard input/output pipes.

Here’s the exact FFmpeg command used in this project:

ffmpeg \
  -loglevel error \
  -fflags nobuffer \
  -avioflags direct \
  -fflags discardcorrupt \
  -probesize 32 \
  -analyzeduration 0 \
  -f mulaw \
  -ar 8000 \
  -ac 1 \
  -i pipe:0 \
  -ar 16000 \
  -acodec pcm_s16le \
  -f s16le pipe:1

Explanation

FlagPurpose
-loglevel errorShow only errors; hides warnings and info for cleaner output
-fflags nobufferDisables internal buffering — lowers latency for live input
-avioflags directUses direct I/O to reduce buffering delay
-fflags discardcorruptDiscards corrupted packets instead of failing the stream
-probesize 32Limits the number of bytes FFmpeg probes to detect format (faster start)
-analyzeduration 0Minimizes the time FFmpeg spends analyzing input (reduces latency)
-f mulawDeclares the input format as mu-law
-ar 8000Input audio sample rate is 8 kHz
-ac 1Mono channel input
-i pipe:0Reads input from stdin
-ar 16000Resamples output to 16 kHz
-acodec pcm_s16leSets output encoding to 16-bit signed little-endian PCM
-f s16leDeclares output as raw PCM stream
pipe:1Sends output to stdout
_ffmpegProcess = new Process
{
    StartInfo = new ProcessStartInfo
    {
        FileName = "ffmpeg", // Or full path to ffmpeg executable
        Arguments =
            "-loglevel error -fflags nobuffer -avioflags direct -fflags discardcorrupt -probesize 32 -analyzeduration 0 -f mulaw -ar 8000 -ac 1 -i pipe:0 -ar 16000 -acodec pcm_s16le -f s16le pipe:1",
        RedirectStandardInput = true,
        RedirectStandardOutput = true,
        RedirectStandardError = true,
        UseShellExecute = false,
        CreateNoWindow = true
    }
};

_ffmpegProcess.Start();

Converting chunks: code example

Once FFmpeg is running in the background, we interact with it via standard input/output streams. Here’s the flow:

  1. Twilio sends base64-encoded mu-law audio chunks over WebSocket.
  2. We decode and buffer the raw bytes.
  3. Once enough audio is collected, we write it into FFmpeg's stdin.
  4. The s16le-encoded result is read from stdout and pushed into Azure’s PushAudioInputStream.

RealtimeAudioConverter.cs (GitHub)

This component handles buffering and communication with FFmpeg:

public async Task<byte[]> WriteChunkAsync(byte[] chunk)
{
    if (_ffmpegProcess == null)
        throw new InvalidOperationException("FFmpeg process is not started.");

    _buffer.Write(chunk, 0, chunk.Length);
    if (_buffer.Length < BufferThreshold) return [];
    var dataToSend = _buffer.ToArray();
    _buffer.SetLength(0);

    await _ffmpegProcess.StandardInput.BaseStream.WriteAsync(dataToSend, 0, dataToSend.Length);
    await _ffmpegProcess.StandardInput.BaseStream.FlushAsync();

    var buffer = new byte[4096];
    var bytesRead = await _ffmpegProcess.StandardOutput.BaseStream.ReadAsync(buffer, 0, buffer.Length);

    if (bytesRead > 0)
    {
        return buffer.AsSpan(0, bytesRead).ToArray();
    }

    return [];
}

Testing: Quality, Latency, and Edge Cases

Conclusion