Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.
document-processing document-analysis text-analysis document-conversionAperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.
document-pipeline connector content-connector text-analysis text-extraction crawler web-crawlerGATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.
text-extraction text-analysis content-connector text-processing nlpThe Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
text-analysis text-extraction library content-analysis document-analysisDoxygen documentation can be found here. We have walkthroughs for a few different parts of MeTA on the MeTA homepage.
nlp nlp-parsing search-engine inverted-index pos-tag text-analysis text-analytics text-classification language-modeling graph-algorithms c-plus-plus word-embeddingsObsei is a low code AI powered automation tool. It can be used in various business flows like social listening, AI based alerting, brand image analysis, comparative study and more. It consist of Observer, Analyzer and Informer. Observer observes the platform like Twitter, Facebook, App Stores, Google reviews, Amazon reviews, News, Website etc and feed that information. Analyzer performs text analysis like classification, sentiment, translation, PII on the analyzed data. Informer sends it to ticketing system, data store, dataframe etc for further action and analysis.
nlp workflow natural-language-processing sentiment-analysis text-classification customer-support text-analysis artificial-intelligence text-analytics social-network-analysis workflow-automation low-code anonymization issue-tracking-system process-automation customer-engagement lowcode business-process-automation social-listeningOpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.
content-connector text-analysis nlp document-pipeline text-processingTextTeaser is an automatic summarization algorithm that combines the power of natural language processing and machine learning to produce good results. It can provide provide a gist of an article, Better previews in news readers.
summarization nlp text-processing text-analysis summaryThis is the main repository for the Programming Historian (http://programminghistorian.org), where we keep the files for the live website. For tutorials in submission, please see: Programming Historian Submissions.
programming-historian text-analysis api data-management data-manipulation data-mining pedagogy linked-open-data mapping network-analysis exhibits scraping dh digital-humanitiesNatural language detection for Rust with focus on simplicity and performance.For more details (e.g. how to blacklist some languages) please check the documentation.
language nlp text-analysis text-classificationReleased under MIT License.
text-processing text-analysis javascript-pluginPHP port of Swearjar.
text-analysis profanity-detection profanity-validatorOniguruma (or rather, the Onigmo fork of it) is the regular expression library used by the Ruby programming language, and ore is somewhat inspired by Ruby's regular expression features; although it is implemented in what aims to be a natural way for R users, including full vectorisation. This README covers the package's R interface only, and assumes that the reader is already familiar with regular expressions. Please see the official reference document for details of supported regular expression syntax.
r regular-expressions regex text-analysisText analysis and assessment library in JavaScript. This library can generate interesting metrics about a text and assess these metrics to give you an assessment which can be used to improve the text. Also included is a preview of the Google search results which can be assessed using the library.
yoast seo text-analysisWord and sentence tokenization in Python. sent_tokenize can keep the whitespace as-is with the flags keep_whitespace=True and normalize_ascii=False.
natural-language-processing xml tokenizer text text-analysistextclean is a collection of tools to clean and normalize text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001, doi:10.1006/csla.2001.0169) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms. The replace_emoticon() function replaces emoticons with word equivalents. Other R packages provide some of the same functionality (e.g., english, gsubfn, mgsub, stringi, stringr, qdapRegex). textclean differs from these packages in that it is designed to handle all of the common cleaning and normalization tasks with a single, consistent, pre-configured toolset (note that textclean uses many of these terrific packages as a backend). This means that the researcher spends less time on munging, leading to quicker analysis. This package is meant to be used jointly with the textshape package, which provides text extraction and reshaping functionality. textclean works well with the qdapRegex package which provides tooling for substring substitution and extraction of pre-canned regular expressions. In addition, the functions of textclean are designed to work within the piping of the tidyverse framework by consistently using the first argument of functions as the data source. The textclean subbing and replacement tools are particularly effective within a dplyr::mutate statement.
r text-cleaning data-munging text-analysis emoticons regexThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
screenwriting screenplay fountain text-analysisThis is the Go client library for AYLIEN's APIs. If you haven't already done so, you will need to sign up. See the Developers Guide for additional documentation.
natural-language-processing text-analysis nlp machine-learningAYLIEN's officially supported Java client library for accessing Text API
natural-language-processing nlp machine-learning text-analysisThis is the Python client library for AYLIEN's APIs. If you haven't already done so, you will need to sign up. See the Developers Guide for additional documentation.
natural-language-processing nlp machine-learning text-analysis
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.