how i built a local first audio transcription: building a privacy-first voice processing pipeline

2 min readOct 30, 2024

I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here’s how it works:

🎤 audio capture & device management

- supports both input (microphones) and output devices (system audio)
- handles multi-channel audio devices through smart channel mixing
- implements device hot-plugging and graceful error handling
- uses tokio channels for efficient async communication

🔊 audio processing pipeline

1. channel conversion
— converts multi-channel audio to mono using weighted averaging
— handles various sample formats (f32, i16, i32, i8)
— implements real-time resampling to 16khz for whisper compatibility

2. signal processing
— normalizes audio using RMS and peak normalization
— implements spectral subtraction for noise reduction
— uses realfft for efficient fourier transforms
— maintains audio quality while reducing background noise

3. voice activity detection (vad)
— dual vad engine support: webrtc (lightweight) and silero (more accurate)
— configurable sensitivity levels (low/medium/high)
— uses sliding window analysis for robust speech detection
— implements frame history for better context awareness

🤖 transcription engine

- primary: whisper (tiny/large-v3/large-v3-turbo)
- fallback: deepgram api integration
- smart overlap handling:

// handles cases where audio chunks might cut sentences
 if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
 // strip overlapping content and merge transcripts
 }

💾 storage & optimization

- uses h265 encoding for efficient audio storage
- implements a local sqlite database for metadata
- stores raw audio chunks with timestamps
- maintains reference to original audio for verification

🔒 privacy features

- completely local processing by default
- optional pii removal
- configurable data retention policies
- no cloud dependencies unless explicitly enabled

🧠 experimental features

- context-aware post-processing using llama-3.2–1b
- speaker diarization using voice embeddings
- local vector db for speaker identification over months
- adaptive noise profiling

🔧 technical stack

- rust + tokio for async processing
- tauri for cross-platform support
- onnx runtime and huggingface/candle for ml inference
- crossbeam channels for thread communication

📊 performance considerations

- efficient memory usage through streaming processing
- minimal cpu overhead through smart buffering
- configurable quality/performance tradeoffs
- automatic resource management

result:

it’s open source btw!

https://github.com/mediar-ai/screenpipe

drop any question!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by louis030195

174 Followers

8 Following

Chief Executive. Bookworm. AI Engineer. I write about code, AI, OSS, PKM, business, and books

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Level Up Coding

Jacob Bennett

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Jan 7

263

Basic Python voice bot using Realtime OpenAI API

CodeX

alex buzunov

Basic Python voice bot using Realtime OpenAI API

As I found myself with some downtime, it felt natural to use this opportunity to refine my interviewing skills. To streamline the process…

Oct 20, 2024

Lists

Staff picks

827 stories1648 saves

Stories to Help You Level-Up at Work

19 stories948 saves

Self-Improvement 101

20 stories3355 saves

Productivity 101

20 stories2819 saves

Building Your Own Streamlit Docker Container: A Step-by-Step Guide

Unicorn Day

Building Your Own Streamlit Docker Container: A Step-by-Step Guide

Why Build Your Own Streamlit Docker Container?

Oct 14, 2024

How I Am Using a Lifetime 100% Free Server

Harendra

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Oct 26, 2024

170

Predict

Will Lockett

This Is How Tesla Will Die

The vultures are circling the tech giant.

6d ago

138

Building a voice-enabled Python FastAPI app using OpenAI’s Realtime API

The Deep Hub

Jayesh Sharma

Building a voice-enabled Python FastAPI app using OpenAI’s Realtime API

Learnings from my experience using the OpenAI Realtime API in my FastAPI websockets app with function calling.

Oct 11, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams