Is there anyone experience, how to handle an issue, where the bulk PDFs are so different to each other, but all contains important tables, graphs, images. This is a real example of a researcher DB.
For example, there are 150 PDFs, where about 130 contains bunch of not simply readable or stuff.
150 PDF, with avg. 25 pages, cca. 4000 pages with 1000 images, tables, graphs.
Instead of a regular “extract and go” work steps (ok, we can prepare for 15 type of doc, thats not problem) this needs completely different approach.
And this might be similar issue with the user manuals and assembly instructions. They must be correctly shown at a right place. How we can handle that? How the dataset will be setted up?