We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.
It provides both command line interface and application programming interface. It is built using Java. The recent stable version is 1.9.2 and licensed under Apache license 2.0. In this article, We will understand the usage by simple application examples.
Tokenizer:
Tokenize the words by white space into array of tokens. While it tokenize the words, it also includes the tab, new line feed along with white space characters also. Below code snippet tokenize the text and produces array of tokens where it eliminates the tab characters.
public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected()
throws Exception {
WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");
String expected[] = {"It", "is", "my", "first", "attempt,", "trying", "to", "learn", "apache", "opennlp."};
assertArrayEquals(tokens, expected);
}
If we want to consider all the punctuations and split into tokens then we can use SimpleTokenizer. It splits the sentence into words with punctuations also as separate tokens.
public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected()
throws Exception {
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");
String expected[] = {"It", "is", "my", "first", "attempt", ",", "trying", "to", "learn", "apache", "opennlp", "."};
assertArrayEquals(tokens, expected);
}
Apache OpenNLP provides pre trained models for basic language processing. Below code explains tokenizing the text using pre-trained model.
public void givenEnglishModel_whenTokenize_thenTokensAreDetected()
throws Exception {
InputStream inputStream = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel model = new TokenizerModel(inputStream);
TokenizerME tokenizer = new TokenizerME(model);
String[] tokens = tokenizer.tokenize("Its my first attempt to learn apache nlp tutorial.");
String expected[] = {"Its", "my", "first", "attempt", "to", "learn", "apache", "nlp", "tutorial", "."};
assertArrayEquals(tokens, expected);
}
Named Entity Recognition
Named Entity Recognition is an algorithm that extracts information from unstructured text data and categorizes it into groups. Apache OpenNLP provides models for extracting person names, locations, organizations, money, percentage, time etc.
Consider we want to extract cricketers names from the news article, Below code helps to recognize person name using pre-trained model.
public void givenEnglishPersonModel_whenNER_thenPersonsAreDetected()
throws Exception {
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
TokenizerME tokenizerModal = new TokenizerME(tokenizerModel);
String[] tokens =
tokenizerModal.tokenize("Legends of the game, masters of their art –" +
" Muttiah Muralitharan, Anil Kumble and Shane Warne " +
"are the three leading wicket-takers in Tests");
InputStream inputStreamNameFinder = getClass().getResourceAsStream("/models/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
NameFinderME nameFinderME = new NameFinderME(model);
List<Span> spans = Arrays.asList(nameFinderME.find(tokens));
List<String> spanlist = new ArrayList<>();
for (Span span: spans) {
spanlist.add(span.toString());
System.out.println(span.getType() + " " + span.toString() + " " + span.getProb());
}
}
It will output the span type as person, it will also print the names word positions and also getting the probabilities of the person names.
person [10..12) person 0.8058874647477575
person [13..15) person 0.9360802286465706
person [16..18) person 0.889340515434591
en-ner-person.bin is the pre-trained available model for extracting the person names. Likewise, it also has the pre-trained models for each entity type as shown below.
Date name finder model | en-ner-date.bin |
Location name finder model | en-ner-location.bin |
Money name finder model | en-ner-money.bin |
Organization name finder model | en-ner-organization.bin |
Percentage name finder model | en-ner-percentage.bin |
Person name finder model | en-ner-person.bin |
Time name finder model | en-ner-time.bin |
POS Tagger
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.
Now the given text will be tagged by the grammatical parts which can be used for further processing. Below example parse the text and provides the parts of speech.
public void parts_of_speech_tagger() throws Exception {
String sentence = "";
InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
String tokens[] = tokenizer.tokenize("I am trying to tag the tokens");
InputStream posModelIn = getClass().getResourceAsStream("/models/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTaggerME = new POSTaggerME(posModel);
String tags[] = posTaggerME.tag(tokens);
double[] probs = posTaggerME.probs();
System.out.println("Token\t:\tTag\t:\tProbability\n---------------------------------------------");
for(int i=0;i<tokens.length;i++){
System.out.println(tokens[i]+"\t:\t"+tags[i]+"\t:\t"+probs[i] );
}
}
It will produce the results with tokens and tag based on English grammatical parts. Like PRP is the Personal pronoun, VBP is the Verb, non-3rd person singular present and so on. Each abbreviated text can be looked up here for the description here
Token : Tag : Probability
---------------------------------------------
I : PRP : 0.9850802753661616
am : VBP : 0.975984809797987
trying : VBG : 0.9884076110770207
to : TO : 0.9948503758260098
tag : VB : 0.9713875923880564
the : DT : 0.9447257899870084
tokens : NNS : 0.8032102920939485
Sentence Detection
Pre-trained model en-sent.bin can be used to detect the sentences. Example below shows how the sentences can be detected.
public void givenEnglishModel_whenDetect_thenSentencesAreDetected()
throws Exception {
String paragraph = "This is a statement. This is another statement."
+ " Now is an abstract word for time, "
+ "that is always flying. And my email address is google@gmail.com.";
InputStream is = getClass().getResourceAsStream("/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetector sdetector = new SentenceDetectorME(model);
String sentences[] = sdetector.sentDetect(paragraph);
Assert.assertArrayEquals("Sentences detected successfully", sentences, new String[]{
"This is a statement.",
"This is another statement.",
"Now is an abstract word for time, that is always flying.",
"And my email address is google@gmail.com."});
}
Command Line (CLI):
All the features of Apache OpenNLP are available in command line interface. Download Apache OpenNLP, Untar the Apache OpenNLP and navigate to the bin directory.
For example the sentence detector can be executed as below.
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ls
brat-annotation-service brat-annotation-service.bat morfologik-addon morfologik-addon.bat opennlp opennlp.bat sampletext.txt sentences.txt
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ cat sampletext.txt
This is a sample text file. We are going to check the number of sentences.
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ./opennlp SentenceDetector ../models/en-sent.bin < "sampletext.txt"
Loading Sentence Detector model ... done (0.052s)
This is a sample text file.
We are going to check the number of sentences.
Average: 666.7 sent/s
Total: 2 sent
Runtime: 0.003s
Execution time: 0.143 seconds
Reference:
Apache OpenNLP manual documentation - https://opennlp.apache.org/docs/1.9.2/manual/opennlp.html.
Subscribe to our newsletter.
We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletterApache OpenNLP is a library for natural language processing using machine learning. In this article, we will explore document/text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then training using Navie Bayes Algorithm.
We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.
It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.
The best way to design a system for handling bulk workloads is to make it a batch system. If we are already using Spring, it will be easy to add a Spring Batch to the project. Spring batch provides a lot of boiler plate features required for batch processing like chunk based processing, transaction management and declarative input/output operations. It also provides job control for start, stop, restart, retry and skip processing also.
Most of the cloud services provide API to fetch their data. But data will be given as paginated results as returning the complete data will overshoot the response payload. To discover the complete list of books or e-courses or cloud machine details, we need to call the API page-wise till the end. In this scenario, we can use Spring Batch to get the data page by page and dump it into a file.
Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance. Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.
Undertow is a high performing web server which can be used for both blocking and non-blocking tasks. It is extermely flexible as application can assemble the parts in whatever way it would make sense. It also supports Servlet 4.0, JSR-356 compliant web socket implementation. Undertow is licensed under Apache License, Version 2.0.
The newly introduced concept of dependency injection in Angular version 2.0 makes it an attractive front-end technology all because of one amazing feature called 'Multi-Providers'. In general, it allows the users to attach certain operations by themselves and a few plugin custom functionality which is not required in our mobile app use case.
Notifications is a message pushed to user's device passively. Browser supports notifications and push API that allows to send message asynchronously to the user. Messages are sent with the help of service workers, it runs as background tasks to receive and relay the messages to the desktop if the application is not opened. It uses web push protocol to register the server and send message to the application. Once user opt-in for the updates, it is effective way of re-engaging users with customized content.
Spring Boot is a microservice-based Java framework used to create web application. WebSocket API is an advanced technology that provides full-duplex communication channels over a single TCP connection. This article explains about how to implement WebSocket using Spring Boot.
Glific is an open source two way communication platform with primary focus on helping social sector organizations to interact with millions of beneficiaries concurrently through automated messages with rich media content and manual intervention wherever required. With Glific one can design flows to automate conversations, see how each of the beneficiaries is interacting with the bot and measure engagement.
Angular is a framework for creating single page web application. Angular facilitates the security feature and protection mechanism. It provides frameworks by verifying all the routing urls with security authguard interface to validate and verify the user and its permissions.
One of the popular web framework for building Single page application (SPA) or static site is React library. Application built with React packages will be rendered completely on the client side browser. If you want to reduce the load on client side browser, we need to pre-render the pages in server (Serer side rendering) and serve it to the client. So the client loads the page like simple html page. Also if the pages are rendered from server then search engine will be able to fetch and extract the pages. To do SSR for React, the best abstraction framework is Next.js. In this blog, we will explain how to build a simple consulting website using NextJS.
Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.
OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.
Material design is inspired from the real world building architecture language. It is an adaptable system of guidelines, components, and tools that support the best practices of user interface design. Backed by open-source code, Material streamlines collaboration between designers and developers, and helps teams quickly build beautiful products. In this article, we will build COVID stats using Angular Material design.
Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.
JHipster is one of the full-stack web app development platform to generate, develop and deploy. It provides the front end technologies options of React, Angular, Vue mixed with bootstrap and font awesome icons. Last released version is JHipster 6.0.1. It is licensed under Apache 2 license.
RESTEasy is JAX-RS 2.1 compliant framework for developing rest applications. It is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol.
Light 4j is fast, lightweight, secure and cloud native microservices platform written in Java 8. It is based on pure HTTP server without Java EE platform. It is hosted by server UnderTow. Light-4j and related frameworks are released under the Apache 2.0 license.
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.