Re: Search Pages

Posted by BigKev On 2010/11/2 18:17:32
Keith,

I dont think you are understanding. PDF's can contain both textual data, and images. Most PDFs contain both. The text as text, and the images as images. All the PDF content on the website are made from scanned pages. So each page is a image, not text. Basically like taking a photograph of each page. So you can't simply extract the part number for the pages, as they are not data anymore. It's a picture of the data.

The only way to turn a scanned image back into data is to run it through an OCR software. OCR stands for optical character recognition. This is specialized software that look at blobs of pixels in the image and tries to detect if they are letters and numerals. To make the OCR work even half way decent you really have to start with a very high quality scan. The pdf files here on the website are not scanned for that purpose, but scanned for web downloading. So there is a significant amount of compression applied to them to make the files small. This regrades the pixel data and makes the pages almost impossible to get an accurate OCR from.

I know all about this as this is the type of software I write for a living.

So there are only two options for getting the Parts Manuals into a database format. Hand keying all the data into a spreadsheet, and then doing a mass import into the database. Or rescanning the Parts Manuals from clean sources in high quality mode, and then OCR the pages into a excel spreadsheet. That would then have to be double checked for OCR errors. Either way, this would take a long, long time. If someone want to volunteer, there I would be more than happy for them to do it, and I will build all the database backend to support it.

Just as a point of reference it takes me about 8 hours to hand-key all the information in from the each of Packard Directories (list of dealers) into the Dealership List. I have already done 4 and have 3 more to do.

This Post was from: https://packardinfo.com/xoops/html/modules/newbb/viewtopic.php?post_id=63348