Displaying 1 to 4 from 4 results

tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List

  •    Python

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'. Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

cstlemma - Lemmatiser that uses affix rules (affix: prefix, infix, suffix, circumfix)

  •    C++

Both 32 and 64 bit versions can be made. For running the CST lemmatiser you need as a minimum a file containing flex rules. The absolute minimal set of flex rules is the empty set, in which case the lemmatiser assumes that all words in your input text are perfectly lemmatised already.