🚀 Project Overview: Designing a contract analysis system for a multi-portfolio family office with 15,000+ vendor agreements spanning 200+ companies. Building an intelligent RAG system that needs to:
- Auto-process contracts from existing Google Drive infrastructure
- Stream new documents via N8N automation pipelines
- Support complex legal queries across the full document corpus
- Enable secure client access to the knowledge base
🛠️ Current Tech Stack:
- Document Storage: Google Drive (client's existing setup)
- Workflow Engine: N8N for document processing automation
- Vector Database: Pinecone for semantic search capabilities
- Parsing Engine: Evaluating Llama Index Cloud vs Dockling (on-premise)
⚖️ Architecture Decision: Favoring Llama Index Cloud due to superior legal document understanding and managed infrastructure, though client security policies may require on-premise parsing with Dockling.
📈 Scale Considerations: At 15K+ legal documents, I'm prioritizing architecture validation over rapid prototyping to avoid performance bottlenecks during production.
🎯 Seeking Community Input:
- Pinecone Performance: Real-world experience with 10K+ document collections? Latency/cost insights?
- Document Pipeline: Google Drive → N8N → Vector DB optimization strategies?
- Legal Parsing: Comparative experiences with Llama Index vs Dockling on contract documents?
- Access Control: Implementing secure multi-client access patterns with Pinecone?
🔧 Implementation Scope: Beyond vendor contracts, integrating internal policies and corporate agreements for comprehensive legal intelligence. Target use case: "Find all contracts with auto-renewal clauses expiring in Q3."
🧠 Learning from Veterans: If you were architecting a legal document RAG system today, what would be your non-negotiable design principles? Where did previous implementations hit unexpected complexity?