Matt Stanbrell

Hello Internet Search

Inspired by _DavidSmith's Podcast Search, I transcribed every episode of the Hello Internet podcast using whisper.cpp. After cleaning up the transcripts into well-formatted sentences, I grouped consecutive related sentences into 'chunks', using this approach. I then created embeddings for each chunk using OpenAI's text-embedding-ada-002. These are stored in a Postgres database with the pgvector extension. When you enter a search query, an embedding is generated for that query and the database is searched for the most similar chunks. This works well, notably even when your query doesn't contain the exact words found in the transcipt.
Update: bge-base-en-v1.5 is now used to create embeddings, and Cloudflare Vectorize and KV are used for storage, not Postgres. This is much faster and gave me experience with Cloudflare tools.