Behemoth - Large Scale Document Processing based on Apache Hadoop