jChardet - Charset detection algorithm in Java

  •        6275

jchardet is a java port of the source from mozilla's automatic charset detection algorithm. The original author is Frank Tang. What is available here is the java port of that code.The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/ More information can be found at http://www.mozilla.org/projects/intl/chardet.html

http://jchardet.sourceforge.net/

Tags
Implementation
License
Platform

   




Related Projects

Charset detector

  •    Delphi

Library for automatic charset detection of a given text or file. Input buffer will be analysed to guess used encoding. The result (charset name or code page id) can be used as control parameter for charset conversation. Make your programs Unicode aware!

UIMA - Unstructured information management architecture

  •    Java

UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them.

franc - Natural language detection

  •    Javascript

Detect the language of text.† - Based on the UDHR, the most translated document in the world.

Java port of Mozilla charset detector

  •    Java

Java port of Mozillaamp;#39;s automatic charset detection algorithm. See... lt;a href=quot;http://www.mozilla.org/projects/intl/chardet.htmlquot;gt; http://www.mozilla.org/projects/intl/chardet.html lt;/agt;for the details of the orginal code and Author.


whatlanguage - A language detection library for Ruby that uses bloom filters for speed.

  •    Ruby

Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware. Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.

modernish - cross-platform POSIX shell feature detection and language extension library

  •    Shell

modernish is an ambitious, as-yet experimental, cross-platform POSIX shell feature detection and language extension library. It aims to extend the shell language with extensive feature testing and language enhancements, using the power of aliases and functions to extend the shell language using the shell language itself. The name is a pun on Modernizr, the JavaScript feature testing library, -sh, the common suffix for UNIX shell names, and -ish, still not quite a modern programming language but perhaps a little closer. jQuery is another source of general inspiration; like it, modernish adds a considerable feature set by using the power of the language it's implemented in to extend/transcend that same language.

cz2cz tools

  •    C

cz2cz is software for converting text files between various encoding charsets (ISO-8859-2, Win-1250, UTF-8, ...). Main feature is autodetection of charset used in text file. Only in czech language (and useful for cz user only).

NTextCat

  •    

NTextCat is text classification utility. Primary target is language identification. So it helps you to recognize (identify) the language of text (or binary) snippet. Pure .net application (C#).

Highlight.js - Javascript Syntax Highlighter

  •    Javascript

Highlight.js is a syntax highlighter written in JavaScript. It works in the browser as well as on the server. It works with pretty much any markup, doesn’t depend on any framework and has automatic language detection. It supports 176 languages and 79 styles, automatic language detection, multi-language code highlighting and lot more.

ASPseek

  •    C++

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

SeetaFaceEngine

  •    C++

SeetaFace Engine is an open source C++ face recognition engine, which can run on CPU with no third-party dependence. It contains three key parts, i.e., SeetaFace Detection, SeetaFace Alignment and SeetaFace Identification, which are necessary and sufficient for building a real-world face recognition applicaiton system. SeetaFace Detection implements a funnel-structured (FuSt) cascade schema for real-time multi-view face detection, which achieves a good trade-off between detection accuracy and speed. State of the art accuracy can be achieved on the public dataset FDDB in high speed. See SeetaFace Detection for more details.

ImageMagick

  •    C++

ImageMagick is a software suite to create, edit, and compose bitmap images. It can read, convert and write images in a variety of formats (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PhotoCD, PNG, Postscript, SVG, and TIFF. Use ImageMagick to translate, flip, mirror, rotate, scale, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.

Suricata IDS - Network threat detection engine

  •    C

The Suricata engine is capable of real time intrusion detection (IDS), inline intrusion prevention (IPS), network security monitoring (NSM) and offline pcap processing. Suricata inspects the network traffic using a powerful and extensive rules and signature language, and has powerful Lua scripting support for detection of complex threats.

EasyOCR - Java OCR 识别组件(基于Tesseract OCR 引擎)。能自动完成图片清理、识别 CAPTCHA 验证码图片内容的一体化工作。Java Image cleanup, OCR recognition component (based Tesseract OCR engine, automatically cleanup image and identification CAPTCHA verification code picture content)

  •    

EasyOCR is a Java language using OCR recognition engine (based Tesseract). By means of a few simple API, the Java language can be used to complete the picture content identification work. And integrated image cleanup, recognition CAPTCHA image, bill notes and other content integration efforts. EasyOCR engine supports plugin programming, ETD templates support, provide a graphical ETD template design tools (EasyTemplateDesigner GUI). EasyOCR not only provide services for consumers, but mainly oriented to provide localized development SDK integration with C/S, B/S and Android mobile terminal native integration projects.

AutoTranslate

  •    VBNET

A very simple application that translates a block of text from one language to another. Like the online Google translate service, this program supports automatic detection of the input language for translation. Requires an internet connection to work.

Snort

  •    C

Snort is a libpcap-based sniffer/logger which can be used as a network intrusion detection and prevention system. It uses a rule-based detection language as well as various other detection mechanisms and is highly extensible.

Filesystem Charset Converter

  •    C

Filesystem Charset Convertor (fcc) Converts the file and directory names from one charset to another.

OpenPipe - Document Pipeline

  •    Java

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.