🧐 Turn PDFs into Clean, LLM-Ready Data
PDFs lock content into complex layouts, making it difficult for LLMs to process text, tables, and images effectively.
Dolphin is an open source parsing framework that converts PDFs into structured formats such as Markdown, HTML, LaTeX, and JSON.
šŸ› ļø How It Works
  1. Layout analysis - Detects and sequences elements according to the document’s natural reading order.
  2. Parallel parsing - Processes each element with specialized prompts tailored to different content types (text blocks, tables, figures, etc.).
šŸ—ļø Key Features
  • Two-stage ā€œanalyze-then-parseā€ pipeline powered by a single VLM
  • Strong performance on complex document parsing tasks
  • Reading-order-aware element sequencing
  • Specialized prompts for different document elements
  • Efficient parallel parsing for faster results
It’s 100% Open Source šŸ™ŒšŸ»
16
12 comments
MiÅ”el Čupković
6
🧐 Turn PDFs into Clean, LLM-Ready Data
AI Automation Society
skool.com/ai-automation-society
Learn to get paid for AI solutions, regardless of your background.
Leaderboard (30-day)
Powered by