how i built a local first audio transcription: building a privacy-first voice processing pipeline
I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here’s how it works:

🎤 audio capture & device management
- supports both input (microphones) and output devices (system audio)
- handles multi-channel audio devices through smart channel mixing
- implements device hot-plugging and graceful error handling
- uses tokio channels for efficient async communication
🔊 audio processing pipeline
1. channel conversion
— converts multi-channel audio to mono using weighted averaging
— handles various sample formats (f32, i16, i32, i8)
— implements real-time resampling to 16khz for whisper compatibility
2. signal processing
— normalizes audio using RMS and peak normalization
— implements spectral subtraction for noise reduction
— uses realfft for efficient fourier transforms
— maintains audio quality while reducing background noise
3. voice activity detection (vad)
— dual vad engine support: webrtc (lightweight) and silero (more accurate)
— configurable sensitivity levels (low/medium/high)
— uses sliding window analysis for robust speech detection
— implements frame history for better context awareness
🤖 transcription engine
- primary: whisper (tiny/large-v3/large-v3-turbo)
- fallback: deepgram api integration
- smart overlap handling:
// handles cases where audio chunks might cut sentences
if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
// strip overlapping content and merge transcripts
}
💾 storage & optimization
- uses h265 encoding for efficient audio storage
- implements a local sqlite database for metadata
- stores raw audio chunks with timestamps
- maintains reference to original audio for verification
🔒 privacy features
- completely local processing by default
- optional pii removal
- configurable data retention policies
- no cloud dependencies unless explicitly enabled
🧠 experimental features
- context-aware post-processing using llama-3.2–1b
- speaker diarization using voice embeddings
- local vector db for speaker identification over months
- adaptive noise profiling
🔧 technical stack
- rust + tokio for async processing
- tauri for cross-platform support
- onnx runtime and huggingface/candle for ml inference
- crossbeam channels for thread communication
📊 performance considerations
- efficient memory usage through streaming processing
- minimal cpu overhead through smart buffering
- configurable quality/performance tradeoffs
- automatic resource management
result:
it’s open source btw!
https://github.com/mediar-ai/screenpipe
drop any question!