jchardet - Charset detection algorithm in Java

  •        6216

jchardet is a java port of the source from mozilla's automatic charset detection algorithm. The original author is Frank Tang. What is available here is the java port of that code.The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/ More information can be found at http://www.mozilla.org/projects/intl/chardet.html

http://jchardet.sourceforge.net/

Tags
Implementation
License
Platform

   




Related Projects

UIMA - Unstructured information management architecture


UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them.

NTextCat


NTextCat is text classification utility. Primary target is language identification. So it helps you to recognize (identify) the language of text (or binary) snippet. Pure .net application (C#).

ImageMagick


ImageMagick is a software suite to create, edit, and compose bitmap images. It can read, convert and write images in a variety of formats (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, and TIFF. Use ImageMagick to translate, flip, mirror, rotate, scale, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.

franc - Natural language detection


Detect the language of text.† - Based on the UDHR, the most translated document in the world.


EasyOCR - Java OCR 识别组件(基于Tesseract OCR 引擎)。能自动完成图片清理、识别 CAPTCHA 验证码图片内容的一体化工作。Java Image cleanup, OCR recognition component (based Tesseract OCR engine, automatically cleanup image and identification CAPTCHA verification code picture content)


EasyOCR is a Java language using OCR recognition engine (based Tesseract). By means of a few simple API, the Java language can be used to complete the picture content identification work. And integrated image cleanup, recognition CAPTCHA image, bill notes and other content integration efforts. EasyOCR engine supports plugin programming, ETD templates support, provide a graphical ETD template design tools (EasyTemplateDesigner GUI). EasyOCR not only provide services for consumers, but mainly oriented to provide localized development SDK integration with C/S, B/S and Android mobile terminal native integration projects.

whatlanguage - A language detection library for Ruby that uses bloom filters for speed.


Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware. Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.

modernish - cross-platform POSIX shell feature detection and language extension library


modernish is an ambitious, as-yet experimental, cross-platform POSIX shell feature detection and language extension library. It aims to extend the shell language with extensive feature testing and language enhancements, using the power of aliases and functions to extend the shell language using the shell language itself. The name is a pun on Modernizr, the JavaScript feature testing library, -sh, the common suffix for UNIX shell names, and -ish, still not quite a modern programming language but perhaps a little closer. jQuery is another source of general inspiration; like it, modernish adds a considerable feature set by using the power of the language it's implemented in to extend/transcend that same language.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:


Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

langid.py - Stand-alone language identification system


langid.py is a standalone Language Identification (LangID) tool. All that is required to run langid.py is >= Python 2.7 and numpy. The main script langid/langid.py is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only.

AutoTranslate


A very simple application that translates a block of text from one language to another. Like the online Google translate service, this program supports automatic detection of the input language for translation. Requires an internet connection to work.

Highlight.js - Javascript Syntax Highlighter


Highlight.js is a syntax highlighter written in JavaScript. It works in the browser as well as on the server. It works with pretty much any markup, doesn’t depend on any framework and has automatic language detection. It supports 176 languages and 79 styles, automatic language detection, multi-language code highlighting and lot more.

Modular Audio Recognition Framework


MARF is a general cross-platform framework with a collection of algorithms for audio (voice, speech, and sound) and natural language text analysis and recognition along with sample applications (identification, NLP, etc.) of its use, implemented in Java.

OpenPipe - Document Pipeline


OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

language-babel - ES2017, flow, React JSX and GraphQL grammar and transpilation for ATOM


Language grammar for all versions of JavaScript including ES2016 and ESNext, JSX syntax as used by Facebook React, Atom's etch and others, as well as optional typed JavaScript using Facebook flow. This package also supports highlighting of GraphQL language constructs when inside certain JavaScript template strings. For .graphql and .gql file support please see language-graphql . The colour of syntax is determined by the theme in use. By default the language-babel package will detect file types .js,.babel,.jsx, .es, .es6, .mjs and .flow. Use the standard ATOM interface to enable it for other file types. This provides a grammar that scopes the file in order to colour the text in a meaningful way. If other JavaScript grammars are enabled these may take precedence over language-babel. Look at the bottom right status bar indicator to determine the language grammar of a file being edited. language-babel will be shown as either Babel or Babel ES6 JavaScript. Clicking the name will allow the grammar for a file to be changed.

Rant - Procedural text generation language


Rant is a language for adding rich variations to text. It combines a markup language with functional and imperative programming concepts to deliver a concise but powerful tool for procedurally generating text. The goal of Rant is to augment human creativity with the boundless potential of randomness, enabling content producers to consider their next idea as not just a concept, but a seed for countless possibilities.

ale - Asynchronous linting/fixing for Vim and Language Server Protocol (LSP) integration


ALE (Asynchronous Lint Engine) is a plugin for providing linting in NeoVim 0.2.0+ and Vim 8 while you edit your text files, and acts as a Vim Language Server Protocol client. ALE makes use of NeoVim and Vim 8 job control functions and timers to run linters on the contents of text buffers and return errors as text is changed in Vim. This allows for displaying warnings and errors in files being edited in Vim before files have been saved back to a filesystem.

Char-RNN-TensorFlow - Multi-language Char RNN for TensorFlow >= 1.2.


Multi-language Char RNN in TensorFlow. You can use this code to generate English text, Chinese poetries and lyrics, Japanese text and text in other language.

Gate - General Architecture for Text Engineering


GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.