go-ocr - A tool for extracting text from scanned documents (via OCR), with user-defined post-processing

  •        18

A tool for extracting plain text from scanned documents (pdf or djvu), with user-defined postprocessing. Once I had a task of OCR'ing a number of scanned documents in pdf format. I quickly built a pipeline of the tools to extract images from the input files and to convert them to plain text, but then I realised that modern OCR software is still less than ideal in terms of recognising text, so a good deal of postprocessing was needed in order to remove at least some of those OCR artefacts and irregularities. I ended up with a long pipeline of sed/grep filters which also had to be adjusted per each document and per each document language. What I wanted was a tool that could combine the OCR tools invocation with filters application, also giving an easy way of modifying and combining the filter definitions.

https://github.com/maxim2266/go-ocr

Tags
Implementation
License
Platform

   




Related Projects

Images 2 OpenXML

  •    DotNet

Images2OpenXML its an application that uses Office 2007 OCR API to convert images generated by scanned documents to OpenXML documents. There's no need of third party applications anymore to convert documents, you can use this tool for free. It was developed with C#.

OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

  •    Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. For details: please consult the documentation.

pdfocr - Adds text to PDF files using the cuneiform OCR software

  •    Ruby

pdfocr adds an OCR text layer to scanned PDF files, allowing them to be searched. It currently depends on Ruby 1.8.7 or above, and uses ocropus, cuneiform, or tesseract for performing OCR. For more details, see the manpage.

GOCR

  •    C

GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers.

gosseract - Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

  •    Go

Golang OCR package, by using Tesseract C++ library. Check Dockerfile for more detail of installation, or you can just try by docker run -it --rm otiai10/gosseract.


JavaOCR

  •    Java

Java OCR is an Optical Character Recognition algorithm based on a mean squared recognizer. This tool also includes utilities to trace and extract characters.

Akshara Malayalam OCR

  •    C++

Akshara Malayalam OCR is a project for the development of an OCR for printed and handwritten documents in Malayalam language. The inspiration is from similar OCR softwares in other languages etc.

gImageReader - A Gtk/Qt front-end to tesseract-ocr.

  •    C++

gImageReader is a simple Gtk/Qt front-end to tesseract-ocr. The steps for compiling gImageReader from source are documented in the wiki.

open-ocr - Run your own OCR-as-a-Service using Tesseract and Docker

  •    Go

OpenOCR makes it simple to host your own OCR REST API. The heavy lifting OCR work is handled by Tesseract OCR.

EasyOCR - Java OCR 识别组件(基于Tesseract OCR 引擎)。能自动完成图片清理、识别 CAPTCHA 验证码图片内容的一体化工作。Java Image cleanup, OCR recognition component (based Tesseract OCR engine, automatically cleanup image and identification CAPTCHA verification code picture content)

  •    

EasyOCR is a Java language using OCR recognition engine (based Tesseract). By means of a few simple API, the Java language can be used to complete the picture content identification work. And integrated image cleanup, recognition CAPTCHA image, bill notes and other content integration efforts. EasyOCR engine supports plugin programming, ETD templates support, provide a graphical ETD template design tools (EasyTemplateDesigner GUI). EasyOCR not only provide services for consumers, but mainly oriented to provide localized development SDK integration with C/S, B/S and Android mobile terminal native integration projects.

android-ocr - Experimental optical character recognition app

  •    Java

An experimental app for Android that performs optical character recognition (OCR) on images captured using the device camera. Runs the Tesseract OCR engine using tess-two, a fork of Tesseract Tools for Android.

Terese OCR verifier

  •    C++

Terese is a tool for proofreading OCR text. Terese tries to map the text back to the scanned image, and visually shows the differences. See the homepage for further details.

paperwork - Personal document manager (Linux/Windows)

  •    Python

Paperwork is a personal document manager. It manages scanned documents and PDFs.It's designed to be easy and fast to use. The idea behind Paperwork is "scan & forget": You can just scan a new document and forget about it until the day you need it again.

PassportScanner - Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer

  •    Swift

With PassportScanner you can use your camera to scan the MRZ code of a passport. It will extract all data like firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer. IMPORTANT NOTICE: SCANNING IDENTITY DOCUMENTS IS IN MOST CASES RESTRICTED BY LAW. OBSERVE THE APPLICABLE LAWS USING THIS TOOL. THE COPYRIGHT HOLDER IS NOT IN ANY WAY LIABLE FOR UNLAWFUL USAGE OF THIS TOOL.

tess4j - Java JNA wrapper for Tesseract OCR API

  •    Java

# Tess4J ## Description: A Java JNA wrapper for Tesseract OCR API. Tess4J is released and distributed under the Apache License, v2.0. ## Features: The library provides optical character recognition (OCR) support for: TIFF, JPEG, GIF, PNG, and BMP image formats Multi-page TIFF images PDF document format

pytesseract - A Python wrapper for Google Tesseract

  •    Python

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and "read" the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff, and others, whereas tesseract-ocr by default only supports tiff and bmp. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

tesseract-ocr-for-php - A wrapper to work with Tesseract OCR inside PHP.

  •    PHP

A wrapper to work with Tesseract OCR inside PHP. ‼️ This library depends on Tesseract OCR, version 3.03 or later.

pyocr - A Python wrapper for Tesseract and Cuneiform

  •    Python

PyOCR is an optical character recognition (OCR) tool wrapper for python. That is, it helps using various OCR tools from a Python program.It has been tested only on GNU/Linux systems. It should also work on similar systems (*BSD, etc). It may or may not work on Windows, MacOSX, etc.

tesseract - Tesseract Open Source OCR Engine (main repository)

  •    C++

This package contains an OCR engine - libtesseract and a command line program - tesseract. The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.

SwiftOCR - Fast and simple OCR library written in Swift

  •    Swift

SwiftOCR is a fast and simple OCR library written in Swift. It uses a neural network for image recognition. As of now, SwiftOCR is optimized for recognizing short, one line long alphanumeric codes (e.g. DI4C9CM). We currently support iOS and OS X. This is a really good question.