Getting started with Apache OpenNLP

  •        0

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.

It provides both command line interface and application programming interface. It is built using Java. The recent stable version is 1.9.2 and licensed under Apache license 2.0. In this article, We will understand the usage by simple application examples.


Tokenize the words by white space into array of tokens. While it tokenize the words, it also includes the tab, new line feed along with white space characters also. Below code snippet tokenize the text and produces array of tokens where it eliminates the tab characters.

  public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected()
                   throws Exception {

        WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
        String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");

        String expected[] = {"It", "is", "my", "first", "attempt,", "trying", "to", "learn", "apache", "opennlp."};
        assertArrayEquals(tokens, expected);

If we want to consider all the punctuations and split into tokens then we can use SimpleTokenizer. It splits the sentence into words with punctuations also as separate tokens.

  public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected()
                    throws Exception {

       SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
       String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");

       String expected[] = {"It", "is", "my", "first", "attempt", ",", "trying", "to", "learn", "apache", "opennlp", "."};
       assertArrayEquals(tokens, expected);

Apache OpenNLP provides pre trained models for basic language processing. Below code explains tokenizing the text using pre-trained model.

   public void givenEnglishModel_whenTokenize_thenTokensAreDetected()
                    throws Exception {

       InputStream inputStream = getClass().getResourceAsStream("/models/en-token.bin");
       TokenizerModel model = new TokenizerModel(inputStream);
       TokenizerME tokenizer = new TokenizerME(model);
       String[] tokens = tokenizer.tokenize("Its my first attempt to learn apache nlp tutorial.");

       String expected[] = {"Its", "my", "first", "attempt", "to", "learn", "apache", "nlp", "tutorial", "."};
       assertArrayEquals(tokens, expected);

Named Entity Recognition

Named Entity Recognition is an algorithm that extracts information from unstructured text data and categorizes it into groups. Apache OpenNLP provides models for extracting person names, locations, organizations, money, percentage, time etc.

Consider we want to extract cricketers names from the news article, Below code helps to recognize person name using pre-trained model.

   public void givenEnglishPersonModel_whenNER_thenPersonsAreDetected()
                      throws Exception {

      SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
      InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
      TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
      TokenizerME tokenizerModal = new TokenizerME(tokenizerModel);
      String[] tokens =
                    tokenizerModal.tokenize("Legends of the game, masters of their art –" +
                                                   " Muttiah Muralitharan, Anil Kumble and Shane Warne " +
                                                   "are the three leading wicket-takers in Tests");

      InputStream inputStreamNameFinder = getClass().getResourceAsStream("/models/en-ner-person.bin");
      TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
      NameFinderME nameFinderME = new NameFinderME(model);
      List<Span> spans = Arrays.asList(nameFinderME.find(tokens));
      List<String> spanlist = new ArrayList<>();
      for (Span span: spans) {
            System.out.println(span.getType() + " " + span.toString() + " " + span.getProb());

It will output the span type as person, it will also print the names word positions and also getting the probabilities of the person names.

 person [10..12) person 0.8058874647477575
 person [13..15) person 0.9360802286465706
 person [16..18) person 0.889340515434591

en-ner-person.bin is the pre-trained available model for extracting the person names. Likewise, it also has the pre-trained models for each entity type as shown below.

Date name finder model en-ner-date.bin
Location name finder model en-ner-location.bin
Money name finder model en-ner-money.bin
Organization name finder model en-ner-organization.bin
Percentage name finder model en-ner-percentage.bin
Person name finder model en-ner-person.bin
Time name finder model en-ner-time.bin


POS Tagger

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.

Now the given text will be tagged by the grammatical parts which can be used for further processing. Below example parse the text and provides the parts of speech.

  public void parts_of_speech_tagger() throws Exception {
      String sentence = "";
      InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
      TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
      Tokenizer tokenizer = new TokenizerME(tokenizerModel);
      String tokens[] = tokenizer.tokenize("I am trying to tag the tokens");

      InputStream posModelIn = getClass().getResourceAsStream("/models/en-pos-maxent.bin");
      POSModel posModel = new POSModel(posModelIn);
      POSTaggerME posTaggerME = new POSTaggerME(posModel);
      String tags[] = posTaggerME.tag(tokens);

      double[] probs = posTaggerME.probs();

      for(int i=0;i<tokens.length;i++){
            System.out.println(tokens[i]+"\t:\t"+tags[i]+"\t:\t"+probs[i] );


It will produce the results with tokens and tag based on English grammatical parts. Like PRP is the Personal pronoun, VBP is the Verb, non-3rd person singular present and so on. Each abbreviated text can be looked up here for the description here


Token : Tag : Probability
I      : PRP : 0.9850802753661616
am     : VBP : 0.975984809797987
trying : VBG : 0.9884076110770207
to     : TO  : 0.9948503758260098
tag    : VB  : 0.9713875923880564
the    : DT  : 0.9447257899870084
tokens : NNS : 0.8032102920939485

Sentence Detection

Pre-trained model en-sent.bin can be used to detect the sentences. Example below shows how the sentences can be detected.

  public void givenEnglishModel_whenDetect_thenSentencesAreDetected()
                     throws Exception {

       String paragraph = "This is a statement. This is another statement."
                                     + " Now is an abstract word for time, "
                                   + "that is always flying. And my email address is";

       InputStream is = getClass().getResourceAsStream("/models/en-sent.bin");
       SentenceModel model = new SentenceModel(is);

       SentenceDetector sdetector = new SentenceDetectorME(model);

       String sentences[] = sdetector.sentDetect(paragraph);

       Assert.assertArrayEquals("Sentences detected successfully", sentences, new String[]{
                                            "This is a statement.",
                                            "This is another statement.",
                                            "Now is an abstract word for time, that is always flying.",
                                           "And my email address is"});

Command Line (CLI):

All the features of Apache OpenNLP are available in command line interface. Download Apache OpenNLP, Untar the Apache OpenNLP and navigate to the bin directory.

For example the sentence detector can be executed as below.

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ls
brat-annotation-service brat-annotation-service.bat morfologik-addon morfologik-addon.bat opennlp opennlp.bat sampletext.txt sentences.txt

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ cat sampletext.txt
This is a sample text file. We are going to check the number of sentences.

nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ./opennlp SentenceDetector ../models/en-sent.bin < "sampletext.txt"
Loading Sentence Detector model ... done (0.052s)
This is a sample text file.
We are going to check the number of sentences.

Average: 666.7 sent/s
Total: 2 sent
Runtime: 0.003s
Execution time: 0.143 seconds

Apache OpenNLP manual documentation -



Nagappan is a techie-geek and a full-stack senior developer having 10+ years of experience in both front-end and back-end. He has experience on front-end web technologies like HTML, CSS, JAVASCRIPT, Angular and expert in Java and related frameworks like Spring, Struts, EJB and RESTEasy framework. He hold bachelors degree in computer science and he is very passionate in learning new technologies.

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter

Related Articles

Apache OpenNLP - Document Classification

  • opennlp natural-language-processing nlp document-classification

Apache OpenNLP is a library for natural language processing using machine learning. In this article, we will explore document/text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then training using Navie Bayes Algorithm.

Read More

PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More

Top 10 AI development tools which you should know in 2020

  • artificial-Intelligence neural-networks frameworks ai machine-learning

It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.

Read More

Getting Started with Spring Batch

  • spring-batch spring-boot batch-processing

The best way to design a system for handling bulk workloads is to make it a batch system. If we are already using Spring, it will be easy to add a Spring Batch to the project. Spring batch provides a lot of boiler plate features required for batch processing like chunk based processing, transaction management and declarative input/output operations. It also provides job control for start, stop, restart, retry and skip processing also.

Read More

Data dumping through REST API using Spring Batch

  • spring-batch data-dump rest-api java

Most of the cloud services provide API to fetch their data. But data will be given as paginated results as returning the complete data will overshoot the response payload. To discover the complete list of books or e-courses or cloud machine details, we need to call the API page-wise till the end. In this scenario, we can use Spring Batch to get the data page by page and dump it into a file.

Read More

Cache using Hazelcast InMemory Data Grid

  • hazelcast cache key-value

Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance.&nbsp;Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.

Read More

Getting Started on Undertow Server

  • java web-server undertow rest

Undertow is a high performing web server which can be used for both blocking and non-blocking tasks. It is extermely flexible as application can assemble the parts in whatever way it would make sense. It also supports Servlet 4.0, JSR-356 compliant web socket implementation. Undertow is licensed under Apache License, Version 2.0.

Read More

All About Multi-Provider Feature of Angular Version 2.0

  • angular dependency-injection multi-providers

The newly introduced concept of dependency injection in Angular version 2.0 makes it an attractive front-end technology all because of one amazing feature called 'Multi-Providers'. In general, it allows the users to attach certain operations by themselves and a few plugin custom functionality which is not required in our mobile app use case.

Read More

Push Notifications using Angular

  • angular push-notifications notifications

Notifications is a message pushed to user's device passively. Browser supports notifications and push API that allows to send message asynchronously to the user. Messages are sent with the help of service workers, it runs as background tasks to receive and relay the messages to the desktop if the application is not opened. It uses web push protocol to register the server and send message to the application. Once user opt-in for the updates, it is effective way of re-engaging users with customized content.

Read More

WebSocket implementation with Spring Boot

  • websocket web-sockets spring-boot java

Spring Boot is a microservice-based Java framework used to create web application. WebSocket API is an advanced technology that provides full-duplex communication channels over a single TCP connection. This article explains about how to implement WebSocket using Spring Boot.

Read More

Glific - Open Source Two Way Communication Platform for Social Sector Organizations

  • glific elixir communication architecture social-sector

Glific is an open source two way communication platform with primary focus on helping social sector organizations to interact with millions of beneficiaries concurrently through automated messages with rich media content and manual intervention wherever required. With Glific one can design flows to automate conversations, see how each of the beneficiaries is interacting with the bot and measure engagement.

Read More

Angular Security - Authentication Service

  • angular security authentication jwt

Angular is a framework for creating single page web application. Angular facilitates the security feature and protection mechanism. It provides frameworks by verifying all the routing urls with security authguard interface to validate and verify the user and its permissions.

Read More

Build Consulting Website using Next.js

  • react nextjs website-development ssr

One of the popular web framework for building Single page application (SPA) or static site is React library. Application built with React packages will be rendered completely on the client side browser. If you want to reduce the load on client side browser, we need to pre-render the pages in server (Serer side rendering) and serve it to the client. So the client loads the page like simple html page. Also if the pages are rendered from server then search engine will be able to fetch and extract the pages. To do SSR for React, the best abstraction framework is Next.js. In this blog, we will explain how to build a simple consulting website using NextJS.

Read More

Introduction to Apache Cassandra

  • cassandra database nosql

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.

Read More

Holistic usage guide for OpenSSL

  • openssl security certificate tools

OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.

Read More

COVID19 Stats using Angular Material Design

  • angular material-design covid covid-stats

Material design is inspired from the real world building architecture language. It is an adaptable system of guidelines, components, and tools that support the best practices of user interface design. Backed by open-source code, Material streamlines collaboration between designers and developers, and helps teams quickly build beautiful products. In this article, we will build COVID stats using Angular Material design.

Read More

Getting Started on Angular 7

  • angular ui-ux front-end-framework

Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.

Read More

JHipster - Generate simple web application code using Spring Boot and Angular

  • jhipster spring-boot angular web-application

JHipster is one of the full-stack web app development platform to generate, develop and deploy. It provides the front end technologies options of React, Angular, Vue mixed with bootstrap and font awesome icons. Last released version is JHipster 6.0.1. It is licensed under Apache 2 license.

Read More

RESTEasy Advanced Guide - Filters and Interceptors

  • resteasy rest-api filters interceptors java

RESTEasy is JAX-RS 2.1 compliant framework for developing rest applications. It is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol.

Read More

Introduction to Light 4J Microservices Framework

  • light4j microservice java programming framework

Light 4j is fast, lightweight, secure and cloud native microservices platform written in Java 8. It is based on pure HTTP server without Java EE platform. It is hosted by server UnderTow. Light-4j and related frameworks are released under the Apache 2.0 license.

Read More

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.