YouTube embedding pipeline

A local-first pipeline for audio download, Whisper transcription, text embeddings, and audio embeddings.

Status: active · GitHub


What it is

A local YouTube-to-embedding pipeline with two paths: talks become Whisper transcripts plus text embeddings, and music becomes sampled audio chunks plus AST-style audio fingerprints.

Outputs are written to Parquet with run metadata so later search and retrieval work has something durable to build on.

GitHub