Behemoth - Large Scale Document Processing based on Apache Hadoop

  •        0

Behemoth is an open source platform for large scale document processing based on Apache Hadoop. It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale.

https://github.com/jnioche/behemoth

Tags
Implementation
License
Platform

   




Related Projects

behemoth


Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

geometry-api-java


The Esri Geometry API for Java can be used to enable spatial data processing in 3rd-party data-processing solutions. Developers of custom MapReduce-based applications for Hadoop can use this API for spatial processing of data in the Hadoop system. The API is also used by the [Hive UDF’s](https://github.com/Esri/spatial-framework-for-hadoop) and could be used by developers building geometry functions for 3rd-party applications such as [Cassandra]( https://cassandra.apache.org/), [HBase](http:

Bagri - XML/Document DB on top of distributed cache


Bagri is a Document Database built on top of distributed cache solution like Hazelcast or Coherence. The system allows to process semi-structured schema-less documents and perform distributed queries on them in real-time. It scales horizontally very well with use of data sharding, when all documents are distributed evenly between distributed cache partitions.

cloudqc - NGS data processing package based on Hadoop


NGS data processing package based on Hadoop

Cascading - Data Processing Workflows on Hadoop


Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.

Cascalog - Data processing on Hadoop


Cascalog is a fully-featured data processing and querying library for Clojure or Java. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer. Cascalog is a replacement for tools like Pig, Hive, and Cascading and operates at a significantly higher level of abstraction than those tools.

XOM - XML object model in Java


XOM is a new XML object model. It is a tree-based API for processing XML with Java that strives for correctness, simplicity, and performance, in that order.

hadoop-foobar - Hadoop MapReduce example - processing financial data


Hadoop MapReduce example - processing financial data

i2b2-hadoop - A Hadoop Map/Reduce implementation of processing and executing I2B2 CRC Queries


A Hadoop Map/Reduce implementation of processing and executing I2B2 CRC Queries

pcap-on-Hadoop - packet processing library for handling libpcap format packet trace file on Hadoop


packet processing library for handling libpcap format packet trace file on Hadoop

hadoop-binary-analysis - Framework that makes processing arbitrary binary data in Hadoop easier


Framework that makes processing arbitrary binary data in Hadoop easier

Apache POI - Java API To Access Microsoft Document File Formats


APIs for manipulating various file formats based upon Open Office XML (ECMA-376) and Microsoft's OLE 2 Compound Document formats using pure Java. Apache POI is your Java Excel, Word and PowerPoint solution. We have a complete API for porting other OOXML and OLE 2 Compound Document formats and welcome others to participate.

spatial-framework-for-hadoop


The __Spatial Framework for Hadoop__ allows developers and data scientists to use the Hadoop data processing system for spatial data analysis.For tools, [samples](https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples), and [tutorials](https://github.com/Esri/gis-tools-for-hadoop/wiki) that use this framework, head over to [GIS Tools for Hadoop](https://github.com/Esri/gis-tools-for-hadoop).

LibreOffice - The Document foundation


LibreOffice is the free power-packed Open Source personal productivity suite for Windows, Macintosh and Linux. LibreOffice is the perfect choice for home users, businesses, government and other organizations. It's native file format is the ISO standardized ODF (Open Document Format), but LibreOffice can open and save Microsoft Word, PowerPoint and Excel files, as well as many other formats, bringing you the widest-available compatibility with other products.

behemoth-commoncrawl - Standalone CommonCrawl module for Behemoth


Standalone CommonCrawl module for Behemoth

behemoth - XKE Hackaton 07/02 repository of the Behemoth team, uses Grails, AWS and AngularJS


XKE Hackaton 07/02 repository of the Behemoth team, uses Grails, AWS and AngularJS

behemoth-lws - Integration of Behemoth from Digital Pebble with LucidWorks Search


Integration of Behemoth from Digital Pebble with LucidWorks Search

jumbune - Jumbune is an open-source project to optimize both Yarn (v2) and older (v1) Hadoop based solutions


Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.

HDFS-Internals - Document Containing Hadoop HDFS File System internal details


Document Containing Hadoop HDFS File System internal details