juniversalchardet is a Java port of "universalchardet", that is the encoding detector library of Mozilla.
https://github.com/albfernandez/juniversalchardetTags | charset charset-detection encoding language-identification language-detection |
Implementation | Java |
License | MPL |
Platform | OS-Independent |
jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
language-identification language-detection text-catagorization internationalization charset charset-detectionLibrary for automatic charset detection of a given text or file. Input buffer will be analysed to guess used encoding. The result (charset name or code page id) can be used as control parameter for charset conversation. Make your programs Unicode aware!
UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them.
document-pipeline connector content-connector text-extraction document-processing unstructuredJava port of Mozillaamp;#39;s automatic charset detection algorithm. See... lt;a href=quot;http://www.mozilla.org/projects/intl/chardet.htmlquot;gt; http://www.mozilla.org/projects/intl/chardet.html lt;/agt;for the details of the orginal code and Author.
Detect the language of text.† - Based on the UDHR, the most translated document in the world.
natural-language language-detection nlp natural language detection detectTextCat written in Perl helps to identify 69 natural langauge.
language-identification language-detection text-catagorizationmodernish is an ambitious, as-yet experimental, cross-platform POSIX shell feature detection and language extension library. It aims to extend the shell language with extensive feature testing and language enhancements, using the power of aliases and functions to extend the shell language using the shell language itself. The name is a pun on Modernizr, the JavaScript feature testing library, -sh, the common suffix for UNIX shell names, and -ish, still not quite a modern programming language but perhaps a little closer. jQuery is another source of general inspiration; like it, modernish adds a considerable feature set by using the power of the language it's implemented in to extend/transcend that same language.
Charset implementation adding encoding and decoding support for UTF-7 (as in RFC 2152, in two variants) and modified UTF-7 (RFC 3501) to Java. The two variants of UTF-7 supported differ in the encoding chosen for Set O (optional direct characters).
cz2cz is software for converting text files between various encoding charsets (ISO-8859-2, Win-1250, UTF-8, ...). Main feature is autodetection of charset used in text file. Only in czech language (and useful for cz user only).
Highlight.js is a syntax highlighter written in JavaScript. It works in the browser as well as on the server. It works with pretty much any markup, doesn’t depend on any framework and has automatic language detection. It supports 176 languages and 79 styles, automatic language detection, multi-language code highlighting and lot more.
highlight syntax highlighting syntax-highlighter code-highlightingSeetaFace Engine is an open source C++ face recognition engine, which can run on CPU with no third-party dependence. It contains three key parts, i.e., SeetaFace Detection, SeetaFace Alignment and SeetaFace Identification, which are necessary and sufficient for building a real-world face recognition applicaiton system. SeetaFace Detection implements a funnel-structured (FuSt) cascade schema for real-time multi-view face detection, which achieves a good trade-off between detection accuracy and speed. State of the art accuracy can be achieved on the public dataset FDDB in high speed. See SeetaFace Detection for more details.
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify. File Encoding Checker requires .NET 2 or above to run.
charset encoding file file-encodings text validationThe Suricata engine is capable of real time intrusion detection (IDS), inline intrusion prevention (IPS), network security monitoring (NSM) and offline pcap processing. Suricata inspects the network traffic using a powerful and extensive rules and signature language, and has powerful Lua scripting support for detection of complex threats.
intrusion-detection network-security-monitoring security ids ips nsm network-monitoringText language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and very poorly on short or Twitter-esque text, so be aware. Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.
Snort is a libpcap-based sniffer/logger which can be used as a network intrusion detection and prevention system. It uses a rule-based detection language as well as various other detection mechanisms and is highly extensible.
CyberChef is a simple, intuitive web app for carrying out all manner of "cyber" operations within a web browser. These operations include simple encoding like XOR or Base64, more complex encryption like AES, DES and Blowfish, creating binary and hexdumps, compression and decompression of data, calculating hashes and checksums, IPv6 and X.509 parsing, changing character encodings, and much more. The tool is designed to enable both technical and non-technical analysts to manipulate data in complex ways without having to deal with complex tools or algorithms. It was conceived, designed, built and incrementally improved by an analyst in their 10% innovation time over several years.
data-analysis data-manipulation encryption encoding compression parsing hashing cipher cypher encode decode encrypt decrypt base64 xor charset hex format cybersecurityFilesystem Charset Convertor (fcc) Converts the file and directory names from one charset to another.
python-magic is a python interface to the libmagic file type identification library. libmagic identifies file types by checking their headers according to a predefined list of file types. This functionality is exposed to the command line by the Unix command file. There is also a Magic class that provides more direct control, including overriding the magic database file and turning on character encoding detection. This is not recommended for general use. In particular, it's not safe for sharing across multiple threads and will fail throw if this is attempted.
Convert character encodings in pure javascript.
iconv encoding encoding-convertors convert charset icu
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.