🧐 Turn PDFs into Clean, LLM-Ready Data
PDFs lock content into complex layouts, making it difficult for LLMs to process text, tables, and images effectively.
Dolphin is an open source parsing framework that converts PDFs into structured formats such as Markdown, HTML, LaTeX, and JSON.
🛠️ How It Works
  1. Layout analysis - Detects and sequences elements according to the document’s natural reading order.
  2. Parallel parsing - Processes each element with specialized prompts tailored to different content types (text blocks, tables, figures, etc.).
🗝️ Key Features
  • Two-stage “analyze-then-parse” pipeline powered by a single VLM
  • Strong performance on complex document parsing tasks
  • Reading-order-aware element sequencing
  • Specialized prompts for different document elements
  • Efficient parallel parsing for faster results
It’s 100% Open Source 🙌🏻
14
9 comments
Mišel Čupković
5
🧐 Turn PDFs into Clean, LLM-Ready Data
AI Automation Society
skool.com/ai-automation-society
A community for mastering AI-driven automation and AI agents. Learn, collaborate, and optimize your workflows!
Leaderboard (30-day)
Powered by