How to Fix Data Ingestion Updates in RAG
One of the main problems with RAG systems is keeping the knowledge base fresh and accurate. - Internal documents are constantly updated in Google Drive. - Public information on the website (blog, product pages, docs) changes regularly. - Old files need to be removed, otherwise the AI risks retrieving outdated data. I just built an ingestion workflow for a SaaS client that solves these issues. Here’s how it works: 1. Continuous monitoring - Google Drive triggers for file creation, updates, and deletions. - Monthly website scraping with Firecrawl to refresh all key URLs. 2. Smart updates - Each document is hashed. If hash unchanged → skip. - If changed → old embeddings are deleted from Postgres/PGVector and replaced with new ones. - Deleted files in Drive also delete their vectors automatically. 3. Metadata for better retrieval - GPT-4.1 classifies every document as **internal** or **external** and generates a one-sentence summary. - Metadata like `file_id`, `doc_type`, and `summary` ensures more precise retrieval. 4. Vectorization pipeline - Content is normalized, split into chunks with overlap. - OpenAI embeddings are created and stored in **PGVector**. - A record manager table tracks file IDs + hashes. Result: The RAG agent always has access to the latest, cleaned, and properly categorized knowledge, both from internal docs and external web pages. No stale data, no duplicates, no hallucinations from outdated sources. If you’re building RAG systems, I’d argue this ingestion & update layer is the real bottleneck for accuracy, not just the retrieval model itself. Hope that helps!