SharePoint OCR image files indexing

  •        0

IFilter plugin for the Microsoft Indexing Service (and Sharepoint in particular) to index and search image files (including TIFF, PDF, JPEG, BMP...) using OCR technology.



Related Projects

Automate iFilter and PDF Indexing support to a SharePoint Farm

A stsadm command to automate the support of pdf file (or any other extension) in a SharePoint Farm (not limited to a single-server). It deploys an icon and update the Office Search/WSS Search Servers.

tess4j - Java JNA wrapper for Tesseract OCR API

# Tess4J ## Description: A Java JNA wrapper for Tesseract OCR API. Tess4J is released and distributed under the Apache License, v2.0. ## Features: The library provides optical character recognition (OCR) support for: TIFF, JPEG, GIF, PNG, and BMP image formats Multi-page TIFF images PDF document format

Solr - Blazing-fast, open source enterprise search platform

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.


WS-FileConvertor is a .NET application that converts image files into text readable format. Application also uploads the converted files (Text format)into SharePoint. All image formats are supported including GIF, JPG, BMP, TIFF, etc.

ambar - :mag: Ambar: Document Search System

Ambar is an open-source document search and management system with automated crawling, OCR, tagging and instant full-text search.There are two editions available: Community and Enterprise. Enterprise Edition is a full featured document search and management system that can handle terabytes of data.

An open source .NET web crawler written in C# using SQL 2005/2008. is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.


SharePoint 2010 Service Application framework, containing the infrastructure for easy OCR processing of PDF files in lists. A OCR component is not included.

Constellio - Enterprise Search engine

Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).

TIFF-to-PDF - a bare-bones application to convert tiff files to pdf

a bare-bones application to convert tiff files to pdf


A desktop file Index and Search tool which allows you to choose a list of folders to index, and then search on later. It is based on and the IFilter mechanism.

Xapian - Search Engine Library

Xapian is an Open Source Search Engine Library. It is written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

Open Search Server

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

Converting SharePoint List To PDF

This Library reads all the SharePoint list and converts them in PDFs and also updates into SharePoint library.

PDF-OCR - Release history of PDF-OCR

Release history of PDF-OCR

pdftools - Text Extraction, Rendering and Converting of PDF Documents

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

SharePoint PDF Upload Metadata Extractor

In a SharePoint document library during upload this feature if activated extracts the title of a PDF file and stores it in the "Title" field of the list item.

Searchdaimon - Enterprise Search

Searchdaimon is an open source search engine for corporate data and websites. It comes with a powerful administrator interface and can index websites and several common enterprise systems like SharePoint, Exchange, SQL databases, Windows file shares etc. It also supports many data sources (e.g., Word, PDF, Excel) and the possibility of faceted search, attribute navigation and collection sorting.

Tiff Splitter

A simple WinForms app that that opens a multi-page tiff file and saves all pages as individual tiff files. The multi-page tiff can be opened via open file dialog, or dragged in.

SharePoint Search Admin

SharePoint Search Admin is a windows form based tool to manage Microsoft Office SharePoint Server 2007 and Microsoft Search Server 2008 search functions. It can manage content sources, schedules, and crawl status. Please note that source code is included in release packages. ...

SharePoint Search XSL Samples

This project is a place to share examples of XSL that can be applied to SharePoint search web parts. Products include SharePoint Server 2010, Microsoft Office SharePoint Server 2007, Microsoft Search Server 2008, and Microsoft Search Server 2008 Express.