Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs
Hugging Face has unveiled FinePDFs, the largest publicly available corpus built entirely from PDFs. The dataset spans 475 million documents in 1,733 languages, totaling roughly 3 trillion tokens. At 3.65 terabytes in size, FinePDFs introduces a new dimension to open training datasets by tapping into a resource long considered too complex and expensive to process.
2
0 comments
Luca Berton
4
Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs
powered by
AI DevOps Ansible Community
skool.com/ai-devops-ansible-community-6317
AI DevOps Mastermind by Luca Berton: AI, DevOps, Kubernetes & Terraform. Access 50+ hours of courses, hands-on labs, and career-boosting mentorship!
Build your own community
Bring people together around your passion and get paid.
Powered by