tikaondotnet - Use the Java Tika text extraction library on the .NET platform

  •        1

Take a look at our tests for more usage examples. Have an idea to make this project better? Great! Start out by taking a look at our Contributing Guide.

http://kevm.github.io/tikaondotnet/
https://github.com/KevM/tikaondotnet

Tags
Implementation
License
Platform

   




Related Projects

tika - Mirror of Apache Tika

  •    Java

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

  •    Ruby

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit. For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Tika Converter

  •    

A converter that automates MS office to do different kind converting. Particularly for text extracting -- and it hopefully integrates with apache Tika and Jackrabbit for document searching, displaying.

Tikka - A content analysis toolkit

  •    Java

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.


extract-text-webpack-plugin - Extracts text from bundle into a file

  •    Javascript

Extract text from a bundle, or bundles, into a separate file. ⚠️ Since webpack v4 the extract-text-webpack-plugin should not be used for css. Use mini-css-extract-plugin instead.

Solr - Blazing-fast, open source enterprise search platform

  •    Java

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

textract - node

  •    HTML

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

xurls - Extract urls from text

  •    Go

Extract urls from text using regular expressions.Note that the funcs compile regexes, so avoid calling them repeatedly.

textract - extract text from any document. no muss. no fuss.

  •    HTML

Extract text from any document. No muss. No fuss. Full documentation.

elasticsearch-mapper-attachments - Mapper Attachments Type plugin for Elasticsearch

  •    Java

If you have a question about the plugin, please use discuss.elastic.co. If you want to report a bug, please use elasticsearch repository.The mapper attachments plugin lets Elasticsearch index file attachments in over a thousand formats (such as PPT, XLS, PDF) using the Apache text extraction library Tika.

OpenPipe - Document Pipeline

  •    Java

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

pdfextract - A tool and library that can extract various areas of text from a PDF, especially a scholarly article PDF

  •    Ruby

A tool and library that can extract various areas of text from a PDF, especially a scholarly article PDF. It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references. The latest version is 0.1.1. Earlier versions are far less reliable.

react-native-parsed-text - Parse text and make them into multiple React Native Text elements

  •    Javascript

This library allows you to parse a text and extract parts using a RegExp or predefined patterns. Currently there are 3 predefined types: url, phone and email.All the props are passed down to a new Text Component if there is a matching text. If those are functions they will receive as param the value of the text.

Swiss File Knife

  •    C++

Multi function command line tool that belongs onto every usb stick.

Decode PeopleCode

  •    Java

Decodes PeopleCode (the proprietary language in Oracle's PeopleSoft ERP software) from bytecode to text. Stores the code in text files, or commits it to a Subversion or Git version control system. Can also extract PeopleCode and SQL text from PeopleTools .xml project files, and does three-way merging of PeopleCode (to reapply customizations during an upgrade).

IFilter Text Extracter

  •    

A simple component to extract just the text from any file that has an IFilter installed. Available as a C++ COM component and as a C# .NET library.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

PDFBox - Java PDF library

  •    Java

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

TextMine

  •    Perl

TextMine is for the Perl hacker who is grappling with the problems of managing unstructured text from various sources. You can use these text mining tools to search the Web, index text, extract entities, categorize your e-mail, and summarize documents.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.