Hi everyone,
I’ve built an AI automation workflow that converts PDF, DOC, and PPT files into text and stores them in Pinecone.
PDF and DOC files work correctly and generate good embeddings.
PowerPoint files convert to text successfully, but the text is not stored properly as vectors, and the embeddings are inaccurate.
I suspect the issue is related to PPT text structure, chunking, or preprocessing.
My questions:
- What is the best way to structure PPT text (slide-wise, bullet-wise, or section-wise) before embedding?
- Are there recommended chunk sizes or metadata formats for PPT files?
- Has anyone built or seen a working RAG workflow for PowerPoint documents?
Any guidance, examples, or references would be greatly appreciated.
Thanks in advance for your support 🙏