Apache OpenNLP - Document Classification

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



Apache OpenNLP is a library for natural language processing using machine learning. For getting started on apache OpenNLP and its license details refer in our previous article.

In this article, we will explore document / text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then use Navie Bayes Algorithm to train the model.

Document Categorizer

We going to classify documents based on its License. To do that, We need to first prepare a training file which will have information related to software license. In our example, For our example, we just took two variants of license - BSD, GNU. Create a model by parsing tokens and finding the feature vectors with exact likelihood (cutoff params = 0). The quality and content of training data is important as based on this OpenNLP will be able to categorize the documents and we will be able to reduce false positives.

The training file will be provided to MarkableFileInputStreamFactory which will prepare the document sample stream and the stream will be passed as input to DocumentCategorierME class which is primarily responsible to train model by doing 100 iterations to get the exact likehood of finding the category.

Once trained, it will return the document category model. The model will be serialized to the object binary file. Saving the trained model is helpful as in future we can use the pre-trainded model directly or we can also further train the model with new data set.

In our example, we take input of the document to be classified from the console. User has to type in their content which will classified and software license category will be identified using the trained model. Ideal in the production use, we will be getting the new documents from different data source.

Tokenization: It is a process of breaking the sentence in to words based on the delimiter which is mostly whitespace. Below is the example of breaking the text in to tokens. Below sample uses tokenizer model "en-token.bin" and to generate "en-token.bin", refer to our previous article.


Program for training for the license category:

 import java.io.File;
 import java.io.FileInputStream;
 import java.io.InputStream;
 import java.nio.charset.StandardCharsets;
 import java.util.Scanner;

 import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
 import opennlp.tools.doccat.DoccatFactory;
 import opennlp.tools.doccat.DoccatModel;
 import opennlp.tools.doccat.DocumentCategorizerME;
 import opennlp.tools.doccat.DocumentSample;
 import opennlp.tools.doccat.DocumentSampleStream;
 import opennlp.tools.doccat.FeatureGenerator;
 import opennlp.tools.tokenize.TokenizerME;
 import opennlp.tools.tokenize.TokenizerModel;
 import opennlp.tools.util.InputStreamFactory;
 import opennlp.tools.util.MarkableFileInputStreamFactory;
 import opennlp.tools.util.ObjectStream;
 import opennlp.tools.util.PlainTextByLineStream;
 import opennlp.tools.util.TrainingParameters;
 import opennlp.tools.util.model.ModelUtil;

 public class OpenNLPDocumentCategorizerExample {

    public static void main(String[] args) throws Exception {

         /* Read human understandable data & train a model */

        /* Read file with classifications samples of sentences. */
        InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("licensecategory.txt"));
        ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
        ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);

       /*
         Use CUT_OFF as zero since we will use very few samples.
         few samples, each feature/word will have small counts,
         so it won't meet high cutoff.
       */
       TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
       params.put(TrainingParameters.CUTOFF_PARAM, 0);
       DoccatFactory factory = new DoccatFactory(new FeatureGenerator[] { new BagOfWordsFeatureGenerator() });

       /* Train a model with classifications from above file. */
      DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);

      /*
         Serialize model to some file so that next time we don't have to again train a
         model. Next time We can just load this file directly into model.
      */
      model.serialize(new File("documentcategorizer.bin"));

     /**
      * Load model from serialized file & lets categorize reviews.
      * 
      * Load serialized trained model
      */
      try (InputStream modelIn = new FileInputStream("documentcategorizer.bin");
          Scanner scanner = new Scanner(System.in);) {

          while (true) {
             /* Get inputs in loop */
            System.out.println("Enter a sentence:");

            /* Initialize document categorizer tool */
            DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);

            /* Get the probabilities of all outcome i.e. positive & negative */
            double[] probabilitiesOfOutcomes = myCategorizer.categorize(getTokens(scanner.nextLine()));

            /* Get name of category which had high probability */
           String category = myCategorizer.getBestCategory(probabilitiesOfOutcomes);
           System.out.println("Category: " + category);
     }
   }
   catch (Exception e) {
         e.printStackTrace();
   }
 }

 /**
   * Tokenize sentence into tokens.
   *
   * @param sentence
   * @return
   */
  private static String[] getTokens(String sentence) {

    /*
        OpenNLPDocumentCategorizerExample.class.getResourceAsStream("en-token.bin");
        Use model that was created in earlier tokenizer tutorial
    */
    String fileURL = OpenNLPDocumentCategorizerExample.class.getResource("/models/en-token.bin").getPath();
    try (InputStream modelIn = new FileInputStream(new File(fileURL))) {

           TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn));

           String[] tokens = myCategorizer.tokenize(sentence);

           for (String t : tokens) {
             System.out.println("Tokens: " + t);
           }
         return tokens;

    } catch (Exception e) {
          e.printStackTrace();
    }
    return null;
  }
}

Output:

Enter a sentence:
C Shell is a 2BSD release of the Berkeley Software Distribution (BSD).


Tokens: C
Tokens: Shell
Tokens: is
Tokens: a
Tokens: 2BSD
Tokens: release
Tokens: of
Tokens: the
Tokens: Berkeley
Tokens: Software
Tokens: Distribution
Tokens: (
Tokens: BSD
Tokens: )
Tokens: .
Category: BSD

Enter a sentence:


Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is a probabilistic classifier and well suited for supervised learning. Advantage of using Naive Bayes model requires small amount of training data and classify based on the maximum likelihood.

Movie Classification PreBuilt Model

OpenNLP has training parameters class flexibility where we can provide the algorithm as Navie Bayes algorithm with exact match and having options for running the iterations for 10 times. A training file (en-movie-category-train present in github) which has the genre category and then followed by the movie description. It will be huge file so training will be more efficient and then while execution, the chances of successfully finding it will be more. Sample file has been made as plain text stream with document sample stream.

DocumentCategorieserME class constructor accepts the inputs as language, training parameters, training file and then the document category factory object. Document category factory will be used to create the new document categorizer model which will be returned as part of the train method.

Model serialized to the temporary file as object bin file. This bin file can be opened as document categorizer object and which will find the probability of the text content for each category. The one with the maximum probability will be choosen as the best category and displayed.

 package com.nagappans.apachenlp;

 import java.io.BufferedOutputStream;
 import java.io.File;
 import java.io.FileOutputStream;
 import java.io.IOException;

 import opennlp.tools.doccat.DoccatFactory;
 import opennlp.tools.doccat.DoccatModel;
 import opennlp.tools.doccat.DocumentCategorizer;
 import opennlp.tools.doccat.DocumentCategorizerME;
 import opennlp.tools.doccat.DocumentSampleStream;
 import opennlp.tools.ml.AbstractTrainer;
 import opennlp.tools.ml.naivebayes.NaiveBayesTrainer;
 import opennlp.tools.util.InputStreamFactory;
 import opennlp.tools.util.MarkableFileInputStreamFactory;
 import opennlp.tools.util.ObjectStream;
 import opennlp.tools.util.PlainTextByLineStream;
 import opennlp.tools.util.TrainingParameters;

 /**
   * OpenNLP version 1.7.2
   * Training of Document Categorizer using Naive Bayes Algorithm in OpenNLP for Document Classification
   *
   */
 public class DocClassificationNaiveBayesTrainer {

      public static void main(String[] args) throws Exception{

      try {
             /* read the training data */
             InputStreamFactory dataIn =
                         new MarkableFileInputStreamFactory(
                                   new File(DocClassificationNaiveBayesTrainer.class.getResource(
                                        "/models/en-movie-category" + ".train").getFile()));

             ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
             ObjectStream sampleStream = new DocumentSampleStream(lineStream);

             /* define the training parameters */
            TrainingParameters params = new TrainingParameters();
            params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
            params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
            params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

            /* create a model from training data */
            DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, new DoccatFactory());
            System.out.println("Model is successfully trained.");

            File trainedFile = File.createTempFile(
                                       DocClassificationNaiveBayesTrainer.class.getResource(
                                       "/models/").toURI().getPath() ,"en-movie-classifier-naive-bayes" + ".bin");

            /* save the model to local */
           BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(trainedFile));
           model.serialize(modelOut);
           System.out.println("Trained Model is saved locally at : " +
                                        "models" + File.separator + "en-movie-classifier" + "-naive-bayes.bin");

           /* Test the model file by subjecting it to prediction  */
           DocumentCategorizer docCategorizer = new DocumentCategorizerME(model);

          String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");

          double[] aProbs = docCategorizer.categorize(docWords);

         /* print the probabilities of the categories */
         System.out.println("---------------------------------\nCategory : Probability\n---------------------------------");
         for (int i=0;i<docCategorizer.getNumberOfCategories();i++){
             System.out.println(docCategorizer.getCategory(i)+" : " + aProbs[i] );
    }
    System.out.println("---------------------------------");

    System.out.println("\n"+docCategorizer.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
  }
  catch (IOException e) {
       System.out.println("An exception in reading the training file. Please check.");
        e.printStackTrace();
   }
 }
}

Reference:

https://opennlp.apache.org/

Source code - https://github.com/nagappan080810/apache_opennlp_workouts.git

 


   

Nagappan is a techie-geek and a full-stack senior developer having 10+ years of experience in both front-end and back-end. He has experience on front-end web technologies like HTML, CSS, JAVASCRIPT, Angular and expert in Java and related frameworks like Spring, Struts, EJB and RESTEasy framework. He hold bachelors degree in computer science and he is very passionate in learning new technologies.

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Getting started with Apache OpenNLP

  • apache-opennlp machine-learning java nlp natural-language-processing

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.

Read More


PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More


LogicalDOC - Open Source DMS

  • dms document-management-system

LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.

Read More


Top 10 AI development tools which you should know in 2020

  • artificial-Intelligence neural-networks frameworks

It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.

Read More


Leaflet and Keyhole Markup Language (KML)

  • leaflet kml maps

Leaflet, a open-source JavaScript library for interactive maps. It is a well-documented API and extended with lot of plugins. It is also designed with simplicity, performance and usability.

Read More



Read and Write PDF using OpenPDF

  • java pdf openpdf create-pdf

OpenPDF is based on a fork of iText version 4. iText is a widely used PDF library but they changed their license and moved to AGPL. In this article, we can see how to read and write to PDF, How to extract text from PDF and How to create password protected PDF.

Read More


An Introduction to the UnQLite Embedded NoSQL Database Engine

  • database nosql embedded key-value-store

UnQLite is an embedded NoSQL database engine. It's a standard Key/Value store similar to the more popular Berkeley DB and a document-store database similar to MongoDB with a built-in scripting language called Jx9 that looks like Javascript. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

Read More


Desktop Apps using Electron JS with centralized data control

  • electronjs couchdb pouchdb desktop-app

When there is a requirement for having local storage for the desktop application context and data needs to be synchronized to central database, we can think of Electron with PouchDB having CouchDB stack. Electron can be used for cross-platform desktop apps with pouch db as local storage. It can sync those data to centralized database CouchDB seamlessly so any point desktop apps can recover or persist the data. In this article, we will go through of creation of desktop apps with ElectronJS, PouchDB and show the sync happens seamlessly with remote CouchDB.

Read More


Lucene / Solr as NoSQL database

  • lucene solr no-sql nosql document-store

Lucene and Solr are most popular and widely used search engine. It indexes the content and delivers the search result faster. It has all capabilities of NoSQL database. This article describes about its pros and cons.

Read More


ONLYOFFICE Document Server, an online office app for Nextcloud and ownCloud

  • office office-suite word spreadsheet

ONLYOFFICE Document Server is a free collaborative online office suite including viewers and editors for texts, spreadsheets and presentations, fully compatible with Office Open XML formats (.docx, .xlsx, .pptx). This article provides you the overview of ONLYOFFICE Document Server, its features, installation and integration with Nextcloud and ownCloud.

Read More


Connect to MongoDB and Perform CRUD using Java

  • java mongodb database programming

MongoDB is a popular and widely used open source NoSQL database. MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution is quite possible. It is licensed under Server Side Public License. Recently they moved to Server Side Public License, before that MongoDB was released under AGPL. This article will provide basic example to connect and work with MongoDB using Java.

Read More


Whats new in Lucene / Solr 4.0

  • lucene solr new-release

The release 4.0 is one of the important milestone for Lucene and Solr. It has lot of new features and performance important. Few important ones are highliggted in this article.

Read More


An introduction to MongoDB

  • mongodb database document-oriented-databse no-sql c++ data-mining

MongoDB is the most exciting SQL-free database currently available in the market. The new kid on the block, called MongoDB is a scalable, high-performance, open source, schema free and document oriented database that focuses on the ideas of NoSQL Approach. Written in C++, it has taken rapid strides since its emergence into the public sphere as a popular way to build your database applications.

Read More


Exonum Blockchain Framework by the Bitfury Group

  • blockchain bitcoin hyperledger blockchain-framework

Exonum is an extensible open source blockchain framework for building private blockchains which offers outstanding performance, data security, as well as fault tolerance. The framework does not include any business logic, instead, you can develop and add the services that meet your specific needs. Exonum can be used to build various solutions from a document registry to a DevOps facilitation system.

Read More


Solr vs Elastic Search

  • full-text-search search-engine lucene solr elastic-search

Solr and Elastic Search are built on top of Lucene. Both are open source and both have extra features which makes programmer life easy. This article explains the difference and the best situation to choose between them.

Read More


LucidWorks Vs SearchBlox - Enterprise Search Solution

  • lucene solr searchblox lucidworks enterprise-search

Enterprise search software should be capable to search the data available in the entire organization or personnel desktop. The data could be in File system, Web or in Database. It should search contents of Emails, file formats like doc, xls, ppt, pdf and lot more. There are many commercial products available but LucidWorks and SearchBlox are best and free.

Read More


Cache using Hazelcast InMemory Data Grid

  • hazelcast cache key-value

Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance.&nbsp;Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.

Read More


Lucene Vs Solr

  • searchengine lucene solr

Lucene is a search library built in Java. Solr is a web application built on top of Lucene. Certainly Solr = Lucene + Added features. Often there would a question, when to choose Solr and when to choose Lucene.

Read More


An introduction to LucidWorks Enterprise Search

  • lucene solr search engine enterprise

Lucidworks Enterprise search solution is built on top of Apache Solr. It scales seamlessly w/sub-second response times under extreme query loads for multi-billion document collections. It has user friendly UI, which does all the job of configuration and search.

Read More


8 Best Open Source Searchengines built on top of Lucene

  • lucene solr searchengine elasticsearch

Lucene is most powerful and widely used Search engine. Here is the list of 7 search engines which is built on top of Lucene. You could imagine how powerful they are.

Read More