Chardet - The Universal Character Encoding Detector

  •        111

Chardet is the Universal Character Encoding Detector. It detects ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants), ISO-8859, Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese), EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese), TIS-620 (Thai).

https://github.com/chardet/chardet
https://chardet.readthedocs.io/

Tags
Implementation
License
Platform

   




Related Projects

Charset detector

  •    Delphi

Library for automatic charset detection of a given text or file. Input buffer will be analysed to guess used encoding. The result (charset name or code page id) can be used as control parameter for charset conversation. Make your programs Unicode aware!

Java UTF-7 Charset support

  •    Java

Charset implementation adding encoding and decoding support for UTF-7 (as in RFC 2152, in two variants) and modified UTF-7 (RFC 3501) to Java. The two variants of UTF-7 supported differ in the encoding chosen for Set O (optional direct characters).

Charset Guessing Library

  •    C

A C/C++ library to guess the encoding and charset of a string


Java port of Mozilla charset detector

  •    Java

Java port of Mozillaamp;#39;s automatic charset detection algorithm. See... lt;a href=quot;http://www.mozilla.org/projects/intl/chardet.htmlquot;gt; http://www.mozilla.org/projects/intl/chardet.html lt;/agt;for the details of the orginal code and Author.

File Encoding Checker

  •    

File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify. File Encoding Checker requires .NET 2 or above to run.

Filesystem Charset Converter

  •    C

Filesystem Charset Convertor (fcc) Converts the file and directory names from one charset to another.

utf8

  •    Javascript

utf8.js is a well-tested UTF-8 encoder/decoder written in JavaScript. Unlike many other JavaScript solutions, it is designed to be a proper UTF-8 encoder/decoder: it can encode/decode any scalar Unicode code point values, as per the Encoding Standard. Here’s an online demo.A string representing the semantic version number.

rchardet - Character encoding auto-detection in Ruby. As smart as your browser. Open source.

  •    Ruby

Character encoding auto-detection in Ruby. As smart as your browser. Open source.

fix-mime-charset

  •    C++

Fix incorrect charset information in Content-Type MIME headers of e-mail messages.

cz2cz tools

  •    C

cz2cz is software for converting text files between various encoding charsets (ISO-8859-2, Win-1250, UTF-8, ...). Main feature is autodetection of charset used in text file. Only in czech language (and useful for cz user only).

JFileConv

  •    Java

Project moved to http://sourceforge.net/projects/jencconv/ JFileConv is a text file encoding converter. It supports text-processing plugins and has a `preview' function which allows the user to see how a file is decoded with a particular charset.

charlock_holmes - Character encoding detection, brought to you by ICU

  •    Ruby

NOTE: CharlockHolmes::EncodingDetector.detect will return nil if it was unable to find an encoding. Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

SpookFlare - Loader, dropper generator with multiple features for bypassing client-side and network-side countermeasures

  •    Python

SpookFlare has a different perspective to bypass security measures and it gives you the opportunity to bypass the endpoint countermeasures at the client-side detection and network-side detection. SpookFlare is a loader/dropper generator for Meterpreter, Empire, Koadic etc. SpookFlare has obfuscation, encoding, run-time code compilation and character substitution features. So you can bypass the countermeasures of the target systems like a boss until they "learn" the technique and behavior of SpookFlare payloads. Special thanks to the following projects and contributors.

Recursive Search and Replace

  •    Java

SandR is a Recursive Regex Search and Replacement utility. It works on files or directories recursively. It supports Java-style Regular Expression in Search terms; it supports auto-detection of character encoding of the files. SandR is written in Java.

python-magic - A python wrapper for libmagic

  •    Python

python-magic is a python interface to the libmagic file type identification library. libmagic identifies file types by checking their headers according to a predefined list of file types. This functionality is exposed to the command line by the Unix command file. There is also a Magic class that provides more direct control, including overriding the magic database file and turning on character encoding detection. This is not recommended for general use. In particular, it's not safe for sharing across multiple threads and will fail throw if this is attempted.

umap -- a unicode character map

  •    C

A tool like MS Windows Character Map which places a Unicode character (or string thereof) in the clipboard. umap shows all the characters in an encoding. Clicking on a character places that character in the clipboard.

Character Encoding Conversion Table

  •    Objective-C

This program is a simple tool for displaying maching character encoding methods among NSString, IANA, MS code page, and so on.

Simplepie - PHP library to manage RSS feeds

  •    PHP

Simplepie is an easy to use API that handles all of the dirty work when it comes to fetching, caching, parsing, normalizing data structures between RSS and Atom formats, handling character encoding translation, and sanitizing the resulting data.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.