Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

Hugging Face has unveiled FinePDFs, the largest publicly available corpus built entirely from PDFs. The dataset spans 475 million documents in 1,733 languages, totaling roughly 3 trillion tokens. At 3.65 terabytes in size, FinePDFs introduces a new dimension to open training datasets by tapping into a resource long considered too complex and expensive to process.

https://www.infoq.com/news/2025/09/finepdfs/

0 comments

AI DevOps Ansible Community

skool.com/ai-devops-ansible-community-6317

AI DevOps Mastermind by Luca Berton: AI, DevOps, Kubernetes & Terraform. Access 50+ hours of courses, hands-on labs, and career-boosting mentorship!

Members

Online

Admin

The RoboNuggets Community

Bring people together around your passion and get paid.