We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.
Apache OpenNLP is a library for natural language processing using machine learning. For getting started on apache OpenNLP and its license details refer in our previous article.
In this article, we will explore document / text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then use Navie Bayes Algorithm to train the model.
Document Categorizer
We going to classify documents based on its License. To do that, We need to first prepare a training file which will have information related to software license. In our example, For our example, we just took two variants of license - BSD, GNU. Create a model by parsing tokens and finding the feature vectors with exact likelihood (cutoff params = 0). The quality and content of training data is important as based on this OpenNLP will be able to categorize the documents and we will be able to reduce false positives.
The training file will be provided to MarkableFileInputStreamFactory which will prepare the document sample stream and the stream will be passed as input to DocumentCategorierME class which is primarily responsible to train model by doing 100 iterations to get the exact likehood of finding the category.
Once trained, it will return the document category model. The model will be serialized to the object binary file. Saving the trained model is helpful as in future we can use the pre-trainded model directly or we can also further train the model with new data set.
In our example, we take input of the document to be classified from the console. User has to type in their content which will classified and software license category will be identified using the trained model. Ideal in the production use, we will be getting the new documents from different data source.
Tokenization: It is a process of breaking the sentence in to words based on the delimiter which is mostly whitespace. Below is the example of breaking the text in to tokens. Below sample uses tokenizer model "en-token.bin" and to generate "en-token.bin", refer to our previous article.
Program for training for the license category:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.Scanner;
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
import opennlp.tools.util.model.ModelUtil;
public class OpenNLPDocumentCategorizerExample {
public static void main(String[] args) throws Exception {
/* Read human understandable data & train a model */
/* Read file with classifications samples of sentences. */
InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("licensecategory.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
/*
Use CUT_OFF as zero since we will use very few samples.
few samples, each feature/word will have small counts,
so it won't meet high cutoff.
*/
TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
params.put(TrainingParameters.CUTOFF_PARAM, 0);
DoccatFactory factory = new DoccatFactory(new FeatureGenerator[] { new BagOfWordsFeatureGenerator() });
/* Train a model with classifications from above file. */
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);
/*
Serialize model to some file so that next time we don't have to again train a
model. Next time We can just load this file directly into model.
*/
model.serialize(new File("documentcategorizer.bin"));
/**
* Load model from serialized file & lets categorize reviews.
*
* Load serialized trained model
*/
try (InputStream modelIn = new FileInputStream("documentcategorizer.bin");
Scanner scanner = new Scanner(System.in);) {
while (true) {
/* Get inputs in loop */
System.out.println("Enter a sentence:");
/* Initialize document categorizer tool */
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model);
/* Get the probabilities of all outcome i.e. positive & negative */
double[] probabilitiesOfOutcomes = myCategorizer.categorize(getTokens(scanner.nextLine()));
/* Get name of category which had high probability */
String category = myCategorizer.getBestCategory(probabilitiesOfOutcomes);
System.out.println("Category: " + category);
}
}
catch (Exception e) {
e.printStackTrace();
}
}
/**
* Tokenize sentence into tokens.
*
* @param sentence
* @return
*/
private static String[] getTokens(String sentence) {
/*
OpenNLPDocumentCategorizerExample.class.getResourceAsStream("en-token.bin");
Use model that was created in earlier tokenizer tutorial
*/
String fileURL = OpenNLPDocumentCategorizerExample.class.getResource("/models/en-token.bin").getPath();
try (InputStream modelIn = new FileInputStream(new File(fileURL))) {
TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn));
String[] tokens = myCategorizer.tokenize(sentence);
for (String t : tokens) {
System.out.println("Tokens: " + t);
}
return tokens;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
Output:
Enter a sentence:
C Shell is a 2BSD release of the Berkeley Software Distribution (BSD).
Tokens: C
Tokens: Shell
Tokens: is
Tokens: a
Tokens: 2BSD
Tokens: release
Tokens: of
Tokens: the
Tokens: Berkeley
Tokens: Software
Tokens: Distribution
Tokens: (
Tokens: BSD
Tokens: )
Tokens: .
Category: BSD
Enter a sentence:
Naive Bayes Classifier
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is a probabilistic classifier and well suited for supervised learning. Advantage of using Naive Bayes model requires small amount of training data and classify based on the maximum likelihood.
Movie Classification PreBuilt Model
OpenNLP has training parameters class flexibility where we can provide the algorithm as Navie Bayes algorithm with exact match and having options for running the iterations for 10 times. A training file (en-movie-category-train present in github) which has the genre category and then followed by the movie description. It will be huge file so training will be more efficient and then while execution, the chances of successfully finding it will be more. Sample file has been made as plain text stream with document sample stream.
DocumentCategorieserME class constructor accepts the inputs as language, training parameters, training file and then the document category factory object. Document category factory will be used to create the new document categorizer model which will be returned as part of the train method.
Model serialized to the temporary file as object bin file. This bin file can be opened as document categorizer object and which will find the probability of the text content for each category. The one with the maximum probability will be choosen as the best category and displayed.
package com.nagappans.apachenlp;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizer;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.ml.AbstractTrainer;
import opennlp.tools.ml.naivebayes.NaiveBayesTrainer;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
/**
* OpenNLP version 1.7.2
* Training of Document Categorizer using Naive Bayes Algorithm in OpenNLP for Document Classification
*
*/
public class DocClassificationNaiveBayesTrainer {
public static void main(String[] args) throws Exception{
try {
/* read the training data */
InputStreamFactory dataIn =
new MarkableFileInputStreamFactory(
new File(DocClassificationNaiveBayesTrainer.class.getResource(
"/models/en-movie-category" + ".train").getFile()));
ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream sampleStream = new DocumentSampleStream(lineStream);
/* define the training parameters */
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
/* create a model from training data */
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, new DoccatFactory());
System.out.println("Model is successfully trained.");
File trainedFile = File.createTempFile(
DocClassificationNaiveBayesTrainer.class.getResource(
"/models/").toURI().getPath() ,"en-movie-classifier-naive-bayes" + ".bin");
/* save the model to local */
BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(trainedFile));
model.serialize(modelOut);
System.out.println("Trained Model is saved locally at : " +
"models" + File.separator + "en-movie-classifier" + "-naive-bayes.bin");
/* Test the model file by subjecting it to prediction */
DocumentCategorizer docCategorizer = new DocumentCategorizerME(model);
String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = docCategorizer.categorize(docWords);
/* print the probabilities of the categories */
System.out.println("---------------------------------\nCategory : Probability\n---------------------------------");
for (int i=0;i<docCategorizer.getNumberOfCategories();i++){
System.out.println(docCategorizer.getCategory(i)+" : " + aProbs[i] );
}
System.out.println("---------------------------------");
System.out.println("\n"+docCategorizer.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
}
catch (IOException e) {
System.out.println("An exception in reading the training file. Please check.");
e.printStackTrace();
}
}
}
Reference:
Source code - https://github.com/nagappan080810/apache_opennlp_workouts.git
Subscribe to our newsletter.
We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletterThe Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.
We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.
LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.
It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.
Leaflet, a open-source JavaScript library for interactive maps. It is a well-documented API and extended with lot of plugins. It is also designed with simplicity, performance and usability.
OpenPDF is based on a fork of iText version 4. iText is a widely used PDF library but they changed their license and moved to AGPL. In this article, we can see how to read and write to PDF, How to extract text from PDF and How to create password protected PDF.
UnQLite is an embedded NoSQL database engine. It's a standard Key/Value store similar to the more popular Berkeley DB and a document-store database similar to MongoDB with a built-in scripting language called Jx9 that looks like Javascript. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.
When there is a requirement for having local storage for the desktop application context and data needs to be synchronized to central database, we can think of Electron with PouchDB having CouchDB stack. Electron can be used for cross-platform desktop apps with pouch db as local storage. It can sync those data to centralized database CouchDB seamlessly so any point desktop apps can recover or persist the data. In this article, we will go through of creation of desktop apps with ElectronJS, PouchDB and show the sync happens seamlessly with remote CouchDB.
We show lot of data in our web applications, it will be awesome if we quickly download specific part of PDF rather than printing it. It will be easy to share for different stakeholders and also for focused meetings. In web application development, download to PDF means, we need to develop backend api specifically and then link it in frontend which takes longer development cylce. Instead it would be really great, if there is way to download what we see in the user interface quickly with few lines of Javascript, similar to export options in word processing application.
Lucene and Solr are most popular and widely used search engine. It indexes the content and delivers the search result faster. It has all capabilities of NoSQL database. This article describes about its pros and cons.
ONLYOFFICE Document Server is a free collaborative online office suite including viewers and editors for texts, spreadsheets and presentations, fully compatible with Office Open XML formats (.docx, .xlsx, .pptx). This article provides you the overview of ONLYOFFICE Document Server, its features, installation and integration with Nextcloud and ownCloud.
MongoDB is a popular and widely used open source NoSQL database. MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution is quite possible. It is licensed under Server Side Public License. Recently they moved to Server Side Public License, before that MongoDB was released under AGPL. This article will provide basic example to connect and work with MongoDB using Java.
The release 4.0 is one of the important milestone for Lucene and Solr. It has lot of new features and performance important. Few important ones are highliggted in this article.
MongoDB is the most exciting SQL-free database currently available in the market. The new kid on the block, called MongoDB is a scalable, high-performance, open source, schema free and document oriented database that focuses on the ideas of NoSQL Approach. Written in C++, it has taken rapid strides since its emergence into the public sphere as a popular way to build your database applications.
The best way to design a system for handling bulk workloads is to make it a batch system. If we are already using Spring, it will be easy to add a Spring Batch to the project. Spring batch provides a lot of boiler plate features required for batch processing like chunk based processing, transaction management and declarative input/output operations. It also provides job control for start, stop, restart, retry and skip processing also.
Exonum is an extensible open source blockchain framework for building private blockchains which offers outstanding performance, data security, as well as fault tolerance. The framework does not include any business logic, instead, you can develop and add the services that meet your specific needs. Exonum can be used to build various solutions from a document registry to a DevOps facilitation system.
Solr and Elastic Search are built on top of Lucene. Both are open source and both have extra features which makes programmer life easy. This article explains the difference and the best situation to choose between them.
Enterprise search software should be capable to search the data available in the entire organization or personnel desktop. The data could be in File system, Web or in Database. It should search contents of Emails, file formats like doc, xls, ppt, pdf and lot more. There are many commercial products available but LucidWorks and SearchBlox are best and free.
Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance. Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.
Glific is an open source two way communication platform with primary focus on helping social sector organizations to interact with millions of beneficiaries concurrently through automated messages with rich media content and manual intervention wherever required. With Glific one can design flows to automate conversations, see how each of the beneficiaries is interacting with the bot and measure engagement.
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.