how i built a local first audio transcription: building a privacy-first voice processing pipeline

louis030195
2 min readOct 30, 2024

I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here’s how it works:

🎤 audio capture & device management

- supports both input (microphones) and output devices (system audio)
- handles multi-channel audio devices through smart channel mixing
- implements device hot-plugging and graceful error handling
- uses tokio channels for efficient async communication

🔊 audio processing pipeline

1. channel conversion
— converts multi-channel audio to mono using weighted averaging
— handles various sample formats (f32, i16, i32, i8)
— implements real-time resampling to 16khz for whisper compatibility

2. signal processing
— normalizes audio using RMS and peak normalization
— implements spectral subtraction for noise reduction
— uses realfft for efficient fourier transforms
— maintains audio quality while reducing background noise

3. voice activity detection (vad)
— dual vad engine support: webrtc (lightweight) and silero (more accurate)
— configurable sensitivity levels (low/medium/high)
— uses sliding window analysis for robust speech detection
— implements frame history for better context awareness

🤖 transcription engine

- primary: whisper (tiny/large-v3/large-v3-turbo)
- fallback: deepgram api integration
- smart overlap handling:

// handles cases where audio chunks might cut sentences
if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
// strip overlapping content and merge transcripts
}

💾 storage & optimization

- uses h265 encoding for efficient audio storage
- implements a local sqlite database for metadata
- stores raw audio chunks with timestamps
- maintains reference to original audio for verification

🔒 privacy features

- completely local processing by default
- optional pii removal
- configurable data retention policies
- no cloud dependencies unless explicitly enabled

🧠 experimental features

- context-aware post-processing using llama-3.2–1b
- speaker diarization using voice embeddings
- local vector db for speaker identification over months
- adaptive noise profiling

🔧 technical stack

- rust + tokio for async processing
- tauri for cross-platform support
- onnx runtime and huggingface/candle for ml inference
- crossbeam channels for thread communication

📊 performance considerations

- efficient memory usage through streaming processing
- minimal cpu overhead through smart buffering
- configurable quality/performance tradeoffs
- automatic resource management

result:

it’s open source btw!

https://github.com/mediar-ai/screenpipe

drop any question!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

louis030195
louis030195

Written by louis030195

Chief Executive. Bookworm. AI Engineer. I write about code, AI, OSS, PKM, business, and books

No responses yet

Write a response