Happy Easter and welcome to Packard Motor Car Information! If you're new here, please register for a free account.  
Login
Username:

Password:

Remember me



Lost Password?

Register now!
FAQ's
Main Menu
Recent Forum Topics
Who is Online
148 user(s) are online (87 user(s) are browsing Forums)

Members: 1
Guests: 147

CartRich, more...
Helping out...
PackardInfo is a free resource for Packard Owners that is completely supported by user donations. If you can help out, that would be great!

Donate via PayPal
Video Content
Visit PackardInfo.com YouTube Playlist

Donate via PayPal



(1) 2 »

OCR
#1
Not too shy to talk
Not too shy to talk

jgrohn
See User information
Most of the documents that are posted in the literature archive are pdf images that someone has scanned. The documents have been very useful, and a big thanks to everyone who has contributed to them. The one thing that I wished I could do is search through the manuals on my computer electronically.

For example, I would like to be able to search all the documents and find any documents including a keyword. Once finding the document I would like to open the file and have it highlight text showing my search word. As an example I have two screenshots searching the word "steering".

I was able to make that work for documents that are most relevant to me by downloading the documents and running them through an OCR (Optical character recognition) converter. The converter reads in a pdf file, and generates a new pdf file, which is just like the original, but includes the text information along with it. The new set of pdf files, is larger than the original, however once the file has been converted, it is searchable on any computer.

In order to do this, I ended up buying a license for Cisdem converter. Even though the OCR might not be perfect, I have been very impressed with how well the search is working on the files. The down side to this is that the resulting pdf files are larger than the originals. For example, the directory "1955-1956 Parts and Accessories List" is 41.5Mb before conversion, and 293.2Mb after the conversion.

So here are some questions. Would other forum users find it useful to be able to search the pdf documents? Should I upload the documents that I have already converted to some shared location? Would it make sense to run the conversion on the whole literature archive? I know there are costs associated with hosting data on the web, any thoughts?

Thanks,

John

Attach file:



jpg  SearchFiles.jpg (87.21 KB)
225379_63049c8507030.jpg 2696X600 px

jpg  SearchInFile.jpg (292.15 KB)
225379_63049c9ac7821.jpg 1858X1134 px

Posted on: 2022/8/23 4:23
 Top  Print 
 


Re: OCR
#2
Home away from home
Home away from home

Wat_Tyler
See User information
Not sure I understand exactly what you're talking about, but I think I get the idea. Sounds quite helpful.


Maybe "talk" to BigKev. If there are costs associated, I think this bunch would pony up to make a good thing happen.

Posted on: 2022/8/23 5:56
If you're not having fun, maybe it's your own damned fault.
 Top  Print 
 


Re: OCR
#3
Forum Ambassador
Forum Ambassador

HH56
See User information
I am one of several who did some of the early scans that are on the website and at the time the website was started OCR for a home system was not very accurate yet was still fairly expensive. It is a good idea though and I know Kev had thought about it then and again a few years ago. Not sure if it was time constraint to run the pdfs thru again, inaccuracy, or the large file size that is the reason it did not go forward.

I am not familiar with the program you used but one concern might be the huge file size could be an issue for some -- either just viewing or worse, trying to download and also for the storage space needed on the server. File size is something that he tried to keep at a compromise level early on because of server bandwidth and lack of high speed internet in many places which is still an issue for some. The aim was a file not so small that image quality suffered vs so large images you could blow up to the size of a newspaper and not see any pixellation. As I recall, at the time the website was started Kev tried to keep ordinary images at around 1200 pixels in width because the average CRT monitor was not really capable of displaying much more. PDFs were larger but he still tried to keep to a relatively low size.

A question now would be after running the pdfs thru OCR does the Cisderm program have a file reduction capability or could they be run thru another program for file reduction without completely destroying the OCR functionality. I realize the OCR info has to be stored somewhere but so does image detail and pdf format info. PDFs do seem to have a lot of extraneous info. When I did the 54 sales brochure Kev posted a few months ago, high res photos with the camera software and then compiling and organizing the pdf for display in an other program resulted in the native file being huge -- many many GBs. Running it thru file reduction in PDF Expert brought it down to less than 20 MB and the images and page format were still excellent. Maybe you could experiment and see what would happen with an OCRd file.

Posted on: 2022/8/23 9:35
Howard
 Top  Print 
 


Re: OCR
#4
Not too shy to talk
Not too shy to talk

jgrohn
See User information
I tried compressing some of the generated PDFs with little success. If others are interested in giving it a try, here is a link with some sample generated files:
https://www.dropbox.com/sh/f9x3shj2137g2mm/AAAdFOU_5IEusrrK_5W23lCsa?dl=0

Posted on: 2022/8/24 5:03
 Top  Print 
 


Re: OCR
#5
Home away from home
Home away from home

Guscha
See User information
Quote:
...Would other forum users find it useful to be able to search the pdf documents? ... Would it make sense to run the conversion on the whole literature archive?...


John (jgrohn), thanks for the offer. Yes, that would be useful. I've tried to compress one of your files after changing the color model (RGB/CMYK/grey) but my effort was unsuccessful. The same goes for my attempt to change your PDF into MS Word. How about downsizing your pdf-files by help of b&w-scans?

Posted on: 2022/8/24 15:28
The story of ZIS-110, ZIS-115, ZIL-111 & Chaika GAZ-13 on www.guscha.de
 Top  Print 
 


Re: OCR
#6
Home away from home
Home away from home

Fish'n Jim
See User information
I worked on a project to convert all the plant batch records to digital files. The need for OCR is if you have non-recognizable text, like hand written, otherwise, it should be able to search the text fields, if it's one of the recognizable ones. You used to have to save with OCR, but not mure that's the case anymore. I'm not up to date anymore and don't care to be which comes with age.
What happened was there wasn't enough manpower to scan all the records, and no one ever accessed the files once they were scanned. It was part of a bigger IT pie in the sky project to automate the plant records, and directly enter data into SAP, but like most pie in the sky, it fell on someone's face.
Make sure you know how much effort and how much use it'll be given and of course the cost. May not be so rosey. Unless you're researching individual systems or something, better to just pull up the file and read it. Most big documents have table of contents.

Posted on: 2022/8/24 19:35
 Top  Print 
 


Re: OCR
#7
Home away from home
Home away from home

MJG
See User information
I have had some experience in this area with my work. My company leverages a few different commercial products for invoice processing: Datacap and Abbyy. Both are fairly robust but still have limitations such as 400 dpi resolution (doesn't seem to be the problem here) and correct orientation of the scan. Both still require a lot of manual interpretation.

My team conducts major capital and construction fraud/overbilling investigations. We leverage Adobe's PDF converter for our work, we find it to be quite good. We will convert prime contractor and sub-contractor third-party support invoicing to word documents and excel files to conduct our work. We do not care about the images or company logos.. it's all about the "Benjamins" that could make your files smaller, i.e. just convert to word files - all text. It still has limitations and requires a LOT of manipulation with unstructured data. The file above if converted to excel would be a hot mess and need a LOT of manipulation to clean it up. Often I will tell my auditors - only spend the time if the "juice is worth the squeeze."

Mike

Posted on: 2022/8/24 19:54
1948 Custom Eight Victoria Convertible
Others:
1941 Cadillac Series 62 Deluxe Convertible Coupe
1956 Oldsmobile 88 Sedan
 Top  Print 
 


Re: OCR
#8
Home away from home
Home away from home

Packard Don
See User information
Isn’t this why we digitized and OCRed the Factory Parts List? The 1948-1954 has been fully proofed line by line but it’s not yet been put online with the corrections but even so it is a very handy tool that is already searchable. The corrected data is on my site too in a format which emulates the printed book.

Posted on: 2022/8/25 0:25
 Top  Print 
 


Re: OCR
#9
Not too shy to talk
Not too shy to talk

jgrohn
See User information
I just wanted to wrap things up here. If anyone is interested in OCR versions of the 1955-56 documentation, they are available at my original drop box link:
https://www.dropbox.com/sh/f9x3shj2137g2mm/AAAdFOU_5IEusrrK_5W23lCsa?dl=0

They are generated files, and so the handwritten parts are not 100% correct, but they are good enough for me.

At this point, I am not planning on running the conversion on other files. However, if there is a specific set of files, that someone would like me to run the OCR on, feel free to get in touch with me.

Posted on: 2022/8/29 4:58
 Top  Print 
 


Re: OCR
#10
Just popping in
Just popping in

borhansometimes
See User information
Your struggle of wanting to search through scanned documents electronically is super relatable. It can be quite a hassle to manually sift through pages just to find a specific keyword. Your solution of using OCR to convert the documents into searchable pdfs is pretty smart, and I'm glad it's been working well for you!
As for running the conversion on the whole literature archive, it might be a bit too ambitious and time-consuming, but it would definitely be a game-changer for anyone looking to search through the entire archive. I use Smart Engines OCR SDK to make the process more feasible. Check it out!

Posted on: 2023/3/17 6:38
 Top  Print 
 




(1) 2 »




Search
Recent Photos
Photo of the Day
Recent Registry
Website Comments or Questions?? Click Here Copyright 2006-2024, PackardInfo.com All Rights Reserved