Getting started with Apache OpenNLP

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.

It provides both command line interface and application programming interface. It is built using Java. The recent stable version is 1.9.2 and licensed under Apache license 2.0. In this article, We will understand the usage by simple application examples.

Tokenizer:

Tokenize the words by white space into array of tokens. While it tokenize the words, it also includes the tab, new line feed along with white space characters also. Below code snippet tokenize the text and produces array of tokens where it eliminates the tab characters.

  public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected()
                   throws Exception {

        WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
        String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");

        String expected[] = {"It", "is", "my", "first", "attempt,", "trying", "to", "learn", "apache", "opennlp."};
        assertArrayEquals(tokens, expected);
  }


If we want to consider all the punctuations and split into tokens then we can use SimpleTokenizer. It splits the sentence into words with punctuations also as separate tokens.

  public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected()
                    throws Exception {

       SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
       String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");

       String expected[] = {"It", "is", "my", "first", "attempt", ",", "trying", "to", "learn", "apache", "opennlp", "."};
       assertArrayEquals(tokens, expected);
  }

Apache OpenNLP provides pre trained models for basic language processing. Below code explains tokenizing the text using pre-trained model.

   public void givenEnglishModel_whenTokenize_thenTokensAreDetected()
                    throws Exception {

       InputStream inputStream = getClass().getResourceAsStream("/models/en-token.bin");
       TokenizerModel model = new TokenizerModel(inputStream);
       TokenizerME tokenizer = new TokenizerME(model);
       String[] tokens = tokenizer.tokenize("Its my first attempt to learn apache nlp tutorial.");

       String expected[] = {"Its", "my", "first", "attempt", "to", "learn", "apache", "nlp", "tutorial", "."};
       assertArrayEquals(tokens, expected);
  }


Named Entity Recognition

Named Entity Recognition is an algorithm that extracts information from unstructured text data and categorizes it into groups. Apache OpenNLP provides models for extracting person names, locations, organizations, money, percentage, time etc.

Consider we want to extract cricketers names from the news article, Below code helps to recognize person name using pre-trained model.

   public void givenEnglishPersonModel_whenNER_thenPersonsAreDetected()
                      throws Exception {

      SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
      InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
      TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
      TokenizerME tokenizerModal = new TokenizerME(tokenizerModel);
      String[] tokens =
                    tokenizerModal.tokenize("Legends of the game, masters of their art –" +
                                                   " Muttiah Muralitharan, Anil Kumble and Shane Warne " +
                                                   "are the three leading wicket-takers in Tests");


      InputStream inputStreamNameFinder = getClass().getResourceAsStream("/models/en-ner-person.bin");
      TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
      NameFinderME nameFinderME = new NameFinderME(model);
      List<Span> spans = Arrays.asList(nameFinderME.find(tokens));
      List<String> spanlist = new ArrayList<>();
      for (Span span: spans) {
            spanlist.add(span.toString());
            System.out.println(span.getType() + " " + span.toString() + " " + span.getProb());
      }
  }

It will output the span type as person, it will also print the names word positions and also getting the probabilities of the person names.

 person [10..12) person 0.8058874647477575
 person [13..15) person 0.9360802286465706
 person [16..18) person 0.889340515434591

en-ner-person.bin is the pre-trained available model for extracting the person names. Likewise, it also has the pre-trained models for each entity type as shown below.

Date name finder model en-ner-date.bin
Location name finder model en-ner-location.bin
Money name finder model en-ner-money.bin
Organization name finder model en-ner-organization.bin
Percentage name finder model en-ner-percentage.bin
Person name finder model en-ner-person.bin
Time name finder model en-ner-time.bin

 

POS Tagger

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.

Now the given text will be tagged by the grammatical parts which can be used for further processing. Below example parse the text and provides the parts of speech.

  public void parts_of_speech_tagger() throws Exception {
      String sentence = "";
      InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
      TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
      Tokenizer tokenizer = new TokenizerME(tokenizerModel);
      String tokens[] = tokenizer.tokenize("I am trying to tag the tokens");
     

      InputStream posModelIn = getClass().getResourceAsStream("/models/en-pos-maxent.bin");
      POSModel posModel = new POSModel(posModelIn);
      POSTaggerME posTaggerME = new POSTaggerME(posModel);
      String tags[] = posTaggerME.tag(tokens);

      double[] probs = posTaggerME.probs();

      System.out.println("Token\t:\tTag\t:\tProbability\n---------------------------------------------");
      for(int i=0;i<tokens.length;i++){
            System.out.println(tokens[i]+"\t:\t"+tags[i]+"\t:\t"+probs[i] );
      }

  }

It will produce the results with tokens and tag based on English grammatical parts. Like PRP is the Personal pronoun, VBP is the Verb, non-3rd person singular present and so on. Each abbreviated text can be looked up here for the description here

 

Token : Tag : Probability
---------------------------------------------
I      : PRP : 0.9850802753661616
am     : VBP : 0.975984809797987
trying : VBG : 0.9884076110770207
to     : TO  : 0.9948503758260098
tag    : VB  : 0.9713875923880564
the    : DT  : 0.9447257899870084
tokens : NNS : 0.8032102920939485


Sentence Detection

Pre-trained model en-sent.bin can be used to detect the sentences. Example below shows how the sentences can be detected.

  public void givenEnglishModel_whenDetect_thenSentencesAreDetected()
                     throws Exception {

       String paragraph = "This is a statement. This is another statement."
                                     + " Now is an abstract word for time, "
                                   + "that is always flying. And my email address is google@gmail.com.";

       InputStream is = getClass().getResourceAsStream("/models/en-sent.bin");
       SentenceModel model = new SentenceModel(is);

       SentenceDetector sdetector = new SentenceDetectorME(model);

       String sentences[] = sdetector.sentDetect(paragraph);

       Assert.assertArrayEquals("Sentences detected successfully", sentences, new String[]{
                                            "This is a statement.",
                                            "This is another statement.",
                                            "Now is an abstract word for time, that is always flying.",
                                           "And my email address is google@gmail.com."});
   }

Command Line (CLI):

All the features of Apache OpenNLP are available in command line interface. Download Apache OpenNLP, Untar the Apache OpenNLP and navigate to the bin directory.

For example the sentence detector can be executed as below.

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ls
brat-annotation-service brat-annotation-service.bat morfologik-addon morfologik-addon.bat opennlp opennlp.bat sampletext.txt sentences.txt

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ cat sampletext.txt
This is a sample text file. We are going to check the number of sentences.

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ./opennlp SentenceDetector ../models/en-sent.bin < "sampletext.txt"
Loading Sentence Detector model ... done (0.052s)
This is a sample text file.
We are going to check the number of sentences.

Average: 666.7 sent/s
Total: 2 sent
Runtime: 0.003s
Execution time: 0.143 seconds

Reference:
Apache OpenNLP manual documentation - https://opennlp.apache.org/docs/1.9.2/manual/opennlp.html.

 


   

Nagappan is a techie-geek and a full-stack senior developer having 10+ years of experience in both front-end and back-end. He has experience on front-end web technologies like HTML, CSS, JAVASCRIPT, Angular and expert in Java and related frameworks like Spring, Struts, EJB and RESTEasy framework. He hold bachelors degree in computer science and he is very passionate in learning new technologies.

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Apache OpenNLP - Document Classification

  • opennlp natural-language-processing nlp document-classification

Apache OpenNLP is a library for natural language processing using machine learning. In this article, we will explore document/text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then training using Navie Bayes Algorithm.

Read More


PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More


Top 10 AI development tools which you should know in 2020

  • artificial-Intelligence neural-networks frameworks

It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.

Read More


Cache using Hazelcast InMemory Data Grid

  • hazelcast cache key-value

Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance.&nbsp;Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.

Read More


Getting Started on Undertow Server

  • java web-server undertow rest

Undertow is a high performing web server which can be used for both blocking and non-blocking tasks. It is extermely flexible as application can assemble the parts in whatever way it would make sense. It also supports Servlet 4.0, JSR-356 compliant web socket implementation. Undertow is licensed under Apache License, Version 2.0.

Read More



All About Multi-Provider Feature of Angular Version 2.0

  • angular dependency-injection multi-providers

The newly introduced concept of dependency injection in Angular version 2.0 makes it an attractive front-end technology all because of one amazing feature called 'Multi-Providers'. In general, it allows the users to attach certain operations by themselves and a few plugin custom functionality which is not required in our mobile app use case.

Read More


WebSocket implementation with Spring Boot

  • websocket web-sockets spring-boot java

Spring Boot is a microservice-based Java framework used to create web application. WebSocket API is an advanced technology that provides full-duplex communication channels over a single TCP connection. This article explains about how to implement WebSocket using Spring Boot.

Read More


Push Notifications using Angular

  • angular push-notifications notifications

Notifications is a message pushed to user's device passively. Browser supports notifications and push API that allows to send message asynchronously to the user. Messages are sent with the help of service workers, it runs as background tasks to receive and relay the messages to the desktop if the application is not opened. It uses web push protocol to register the server and send message to the application. Once user opt-in for the updates, it is effective way of re-engaging users with customized content.

Read More


Angular Security - Authentication Service

  • angular security authentication jwt

Angular is a framework for creating single page web application. Angular facilitates the security feature and protection mechanism. It provides frameworks by verifying all the routing urls with security authguard interface to validate and verify the user and its permissions.

Read More


Introduction to Apache Cassandra

  • cassandra database nosql

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.

Read More


Holistic usage guide for OpenSSL

  • openssl security certificate tools

OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.

Read More


COVID19 Stats using Angular Material Design

  • angular material-design covid covid-stats

Material design is inspired from the real world building architecture language. It is an adaptable system of guidelines, components, and tools that support the best practices of user interface design. Backed by open-source code, Material streamlines collaboration between designers and developers, and helps teams quickly build beautiful products. In this article, we will build COVID stats using Angular Material design.

Read More


Getting Started on Angular 7

  • angular ui-ux front-end-framework

Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.

Read More


JHipster - Generate simple web application code using Spring Boot and Angular

  • jhipster spring-boot angular web-application

JHipster is one of the full-stack web app development platform to generate, develop and deploy. It provides the front end technologies options of React, Angular, Vue mixed with bootstrap and font awesome icons. Last released version is JHipster 6.0.1. It is licensed under Apache 2 license.

Read More


Introduction to Light 4J Microservices Framework

  • light4j microservice java programming framework

Light 4j is fast, lightweight, secure and cloud native microservices platform written in Java 8. It is based on pure HTTP server without Java EE platform. It is hosted by server UnderTow. Light-4j and related frameworks are released under the Apache 2.0 license.

Read More


RESTEasy Advanced Guide - Filters and Interceptors

  • resteasy rest-api filters interceptors java

RESTEasy is JAX-RS 2.1 compliant framework for developing rest applications. It is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol.

Read More


Exonum Blockchain Framework by the Bitfury Group

  • blockchain bitcoin hyperledger blockchain-framework

Exonum is an extensible open source blockchain framework for building private blockchains which offers outstanding performance, data security, as well as fault tolerance. The framework does not include any business logic, instead, you can develop and add the services that meet your specific needs. Exonum can be used to build various solutions from a document registry to a DevOps facilitation system.

Read More


Desktop Apps using Electron JS with centralized data control

  • electronjs couchdb pouchdb desktop-app

When there is a requirement for having local storage for the desktop application context and data needs to be synchronized to central database, we can think of Electron with PouchDB having CouchDB stack. Electron can be used for cross-platform desktop apps with pouch db as local storage. It can sync those data to centralized database CouchDB seamlessly so any point desktop apps can recover or persist the data. In this article, we will go through of creation of desktop apps with ElectronJS, PouchDB and show the sync happens seamlessly with remote CouchDB.

Read More


RESTEasy - A guide to implement CRUD Rest API

  • resteasy rest-api java framework

RESTEasy is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol. It is licensed under the Apache 2.0 license.

Read More


Getting Started With Django Python Web Framework

  • django python web-framework

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. It is pre-loaded with user authentication, content administration, site maps, RSS feeds and many more tasks. Security features provided are cross site scripting (XSS) protection, cross site request forgery protection, SQL injection protection, click-jacking protection, host header validation, session security and so on. It also provides in built caching framework.

Read More