![]() A good text recognition program to use with Linux is the optical character recognition (OCR). ImageMagik utilities: There are other questions on the SuperUser site about using ImageMagik that you might use to help you do the conversion. A text recognition program lets you add a text layer to the data. Gs -SDEVICE=tiffg4 -r600圆00 -sPAPERSIZE=letter -sOutputFile=filename_%04d.tif -dNOPAUSE -dBATCH - filename Gs: The below command should convert multipage pdf to individual tiff files. (i.e I couldn't find a linux pdf2text converter that does OCR). You might also find the pdf toolkit of use.Ī full list of pdf software here on wikipedia.Įdit: Since you do need OCR capabilities, I think you'll have to try a different tack. If it's not on your machine, you'll have to install the poppler-utils package sudo apt-get install poppler-utils For example, it does not retain any PDF metadata. ![]() Please note that the above script is very rudimentary. Gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf Hocr2pdf -i "$page" -o "$base.pdf" < "$base.html" # OCR each page individually and convert into PDFĬuneiform -f hocr -o "$base.html" "$page" I have a linux server with over 8000 PDfs and need to know which PDFs have been ocrd and which ones havent. Gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tiff" -dNOPAUSE -dBATCH - "$input" # extract images of the pages (note: resolution hard-coded) # Run OCR on a multi-page PDF file and create a new pdf with the Sadly, the program does not appear to support creating multi-page PDFs, so you might have to create a script to handle them: #!/bin/bash ![]() I have used hocr2pdf to recreate PDFs out of the original image-only PDFs and OCR results. This way you can create "searchable" PDFs from which you can copy text. The nice thing about it is that it can output position information for the OCR text in hOCR format, so that it becomes possible to put the text back in in the correct position in a hidden layer of a PDF file. While it appears to be essentially undocumented apart from a brief README file, I've found the OCR results quite good. Be sure to have the ImageMagick C++ libraries installed to have support for essentially any input image format (otherwise it will only accept BMP). Introduction to OCR and Searchable PDFs Learn OCR best practices and how to begin an OCR project using ABBYY FineReader, Adobe Acrobat Pro, or Tesseract with this guide. No binary packages seem to be available, so you need to build it from source. I have had success with the BSD-licensed Linux port of Cuneiform OCR system.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |