Re: OCR
|
||||
---|---|---|---|---|
Home away from home
|
Not sure I understand exactly what you're talking about, but I think I get the idea. Sounds quite helpful.
Maybe "talk" to BigKev. If there are costs associated, I think this bunch would pony up to make a good thing happen.
Posted on: 2022/8/23 5:56
|
|||
If you're not having fun, maybe it's your own damned fault.
|
||||
|
Re: OCR
|
||||
---|---|---|---|---|
Forum Ambassador
|
I am one of several who did some of the early scans that are on the website and at the time the website was started OCR for a home system was not very accurate yet was still fairly expensive. It is a good idea though and I know Kev had thought about it then and again a few years ago. Not sure if it was time constraint to run the pdfs thru again, inaccuracy, or the large file size that is the reason it did not go forward.
I am not familiar with the program you used but one concern might be the huge file size could be an issue for some -- either just viewing or worse, trying to download and also for the storage space needed on the server. File size is something that he tried to keep at a compromise level early on because of server bandwidth and lack of high speed internet in many places which is still an issue for some. The aim was a file not so small that image quality suffered vs so large images you could blow up to the size of a newspaper and not see any pixellation. As I recall, at the time the website was started Kev tried to keep ordinary images at around 1200 pixels in width because the average CRT monitor was not really capable of displaying much more. PDFs were larger but he still tried to keep to a relatively low size. A question now would be after running the pdfs thru OCR does the Cisderm program have a file reduction capability or could they be run thru another program for file reduction without completely destroying the OCR functionality. I realize the OCR info has to be stored somewhere but so does image detail and pdf format info. PDFs do seem to have a lot of extraneous info. When I did the 54 sales brochure Kev posted a few months ago, high res photos with the camera software and then compiling and organizing the pdf for display in an other program resulted in the native file being huge -- many many GBs. Running it thru file reduction in PDF Expert brought it down to less than 20 MB and the images and page format were still excellent. Maybe you could experiment and see what would happen with an OCRd file.
Posted on: 2022/8/23 9:35
|
|||
Howard
|
||||
|
Re: OCR
|
||||
---|---|---|---|---|
Quite a regular
|
I tried compressing some of the generated PDFs with little success. If others are interested in giving it a try, here is a link with some sample generated files:
dropbox.com/sh/f9x3shj2137g2mm/AAAdFOU_5IEusrrK_5W23lCsa?dl=0
Posted on: 2022/8/24 5:03
|
|||
|
Re: OCR
|
||||
---|---|---|---|---|
Home away from home
|
Quote:
...Would other forum users find it useful to be able to search the pdf documents? ... Would it make sense to run the conversion on the whole literature archive?... John (jgrohn), thanks for the offer. Yes, that would be useful. I've tried to compress one of your files after changing the color model (RGB/CMYK/grey) but my effort was unsuccessful. The same goes for my attempt to change your PDF into MS Word. How about downsizing your pdf-files by help of b&w-scans?
Posted on: 2022/8/24 15:28
|
|||
The story of ZIS-110, ZIS-115, ZIL-111 & Chaika GAZ-13 on www.guscha.de
|
||||
|
Re: OCR
|
||||
---|---|---|---|---|
Home away from home
|
I worked on a project to convert all the plant batch records to digital files. The need for OCR is if you have non-recognizable text, like hand written, otherwise, it should be able to search the text fields, if it's one of the recognizable ones. You used to have to save with OCR, but not mure that's the case anymore. I'm not up to date anymore and don't care to be which comes with age.
What happened was there wasn't enough manpower to scan all the records, and no one ever accessed the files once they were scanned. It was part of a bigger IT pie in the sky project to automate the plant records, and directly enter data into SAP, but like most pie in the sky, it fell on someone's face. Make sure you know how much effort and how much use it'll be given and of course the cost. May not be so rosey. Unless you're researching individual systems or something, better to just pull up the file and read it. Most big documents have table of contents.
Posted on: 2022/8/24 19:35
|
|||
|
Re: OCR
|
||||
---|---|---|---|---|
Home away from home
|
I have had some experience in this area with my work. My company leverages a few different commercial products for invoice processing: Datacap and Abbyy. Both are fairly robust but still have limitations such as 400 dpi resolution (doesn't seem to be the problem here) and correct orientation of the scan. Both still require a lot of manual interpretation.
My team conducts major capital and construction fraud/overbilling investigations. We leverage Adobe's PDF converter for our work, we find it to be quite good. We will convert prime contractor and sub-contractor third-party support invoicing to word documents and excel files to conduct our work. We do not care about the images or company logos.. it's all about the "Benjamins" that could make your files smaller, i.e. just convert to word files - all text. It still has limitations and requires a LOT of manipulation with unstructured data. The file above if converted to excel would be a hot mess and need a LOT of manipulation to clean it up. Often I will tell my auditors - only spend the time if the "juice is worth the squeeze." Mike
Posted on: 2022/8/24 19:54
|
|||
1948 Custom Eight Victoria Convertible
Others: 1941 Cadillac Series 62 Deluxe Convertible Coupe 1956 Oldsmobile 88 Sedan |
||||
|
Re: OCR
|
||||
---|---|---|---|---|
Home away from home
|
Isn’t this why we digitized and OCRed the Factory Parts List? The 1948-1954 has been fully proofed line by line but it’s not yet been put online with the corrections but even so it is a very handy tool that is already searchable. The corrected data is on my site too in a format which emulates the printed book.
Posted on: 2022/8/25 0:25
|
|||
|
Re: OCR
|
||||
---|---|---|---|---|
Quite a regular
|
I just wanted to wrap things up here. If anyone is interested in OCR versions of the 1955-56 documentation, they are available at my original drop box link:
dropbox.com/sh/f9x3shj2137g2mm/AAAdFOU_5IEusrrK_5W23lCsa?dl=0 They are generated files, and so the handwritten parts are not 100% correct, but they are good enough for me. At this point, I am not planning on running the conversion on other files. However, if there is a specific set of files, that someone would like me to run the OCR on, feel free to get in touch with me.
Posted on: 2022/8/29 4:58
|
|||
|
Re: OCR
|
||||
---|---|---|---|---|
Just popping in
|
Your struggle of wanting to search through scanned documents electronically is super relatable. It can be quite a hassle to manually sift through pages just to find a specific keyword. Your solution of using OCR to convert the documents into searchable pdfs is pretty smart, and I'm glad it's been working well for you!
As for running the conversion on the whole literature archive, it might be a bit too ambitious and time-consuming, but it would definitely be a game-changer for anyone looking to search through the entire archive. I use Smart Engines OCR SDK to make the process more feasible. Check it out!
Posted on: 2023/3/17 6:38
|
|||
|