Packard Forum Index » General

Topic: Search Pages

Bottom

Previous Topic

Next Topic Register To Post

« 1 2 (3)

PackardV8

Re: Search Pages

#21

Home away from home

ok. Suppose someone sends me a PDF or i download a PDF to my 'puter. Further suppose the pdf contains just one page of the 55-56 Packard Parts Catalouge.
What canned utility (if any exist ) do i use, to say, strip off (capture) the left most 10 bytes from the left most side of the page. Those 10 bytes would be the part numbers.

What's not clear to me is how File/Record handling is accomplished in the PC world.

To me the ENTIRE parts catalouge is a FILE. Each printed line on any given page is a single RECORD of say 80 bytes (old card image format).

It is my understanding that MSFT has an MSFT Cobol application or package that can run on a PC, altho very slowly i'm sure.

Accomplishing the 2-file match is childs play in cobol using any of the old IBM, Honeywell/GE file structures with Cobol formating.

My problem is that i don't know how PDF's are structured in terms of what is a record, record length, ASCII or ANSI character sets, etc.

WE're not dealing with any graphics here. Only byte-by-byte text characters.

Posted on: 2010/11/2 16:00

VAPOR LOCK demystified: See paragraph SEVEN of PMCC documentaion as listed in post #11 of the following thread:f
packardinfo.com/xoops/html/modules/newbb/viewtopic.php?topic_id=7245

BigKev

Re: Search Pages

#22

Webmaster

Keith,

I dont think you are understanding. PDF's can contain both textual data, and images. Most PDFs contain both. The text as text, and the images as images. All the PDF content on the website are made from scanned pages. So each page is a image, not text. Basically like taking a photograph of each page. So you can't simply extract the part number for the pages, as they are not data anymore. It's a picture of the data.

The only way to turn a scanned image back into data is to run it through an OCR software. OCR stands for optical character recognition. This is specialized software that look at blobs of pixels in the image and tries to detect if they are letters and numerals. To make the OCR work even half way decent you really have to start with a very high quality scan. The pdf files here on the website are not scanned for that purpose, but scanned for web downloading. So there is a significant amount of compression applied to them to make the files small. This regrades the pixel data and makes the pages almost impossible to get an accurate OCR from.

I know all about this as this is the type of software I write for a living.

So there are only two options for getting the Parts Manuals into a database format. Hand keying all the data into a spreadsheet, and then doing a mass import into the database. Or rescanning the Parts Manuals from clean sources in high quality mode, and then OCR the pages into a excel spreadsheet. That would then have to be double checked for OCR errors. Either way, this would take a long, long time. If someone want to volunteer, there I would be more than happy for them to do it, and I will build all the database backend to support it.

Just as a point of reference it takes me about 8 hours to hand-key all the information in from the each of Packard Directories (list of dealers) into the Dealership List. I have already done 4 and have 3 more to do.

Posted on: 2010/11/2 18:17

-BigKev

1954 Packard Clipper Deluxe Touring Sedan -> Registry | Project Blog

1937 Packard 115-C Convertible Coupe -> Registry | Project Blog

JWL

Re: Search Pages

#23

Home away from home

Kev, Scanned photos can be in .jpg and .pdf formats, correct?

(o{I}o)

Posted on: 2010/11/3 10:58

We move toward
And make happen
What occupies our mind... (W. Scherer)

BigKev

Re: Search Pages

#24

Webmaster

Photos should be in jpg format. PDFs should be for documents, and anything that is multiple pages.

The Photo Archive here on the website only takes JPGs.

Posted on: 2010/11/3 11:35

-BigKev

1954 Packard Clipper Deluxe Touring Sedan -> Registry | Project Blog

1937 Packard 115-C Convertible Coupe -> Registry | Project Blog

BigKev

Re: Search Pages

#25

Webmaster

The advanced Search Box is now fixed and the filters at the bottom now are honored. So if you want to search only a specific area now you can.

Thanks,

Posted on: 2010/11/3 11:50

-BigKev

1954 Packard Clipper Deluxe Touring Sedan -> Registry | Project Blog

1937 Packard 115-C Convertible Coupe -> Registry | Project Blog

Board index » Discussion » General » Search Pages

Top

Previous Topic

Next Topic

« 1 2 (3)

	';
Hello and welcome to Packard Motor Car Information! If you're new here, please register for a free account.