Java based data integration framework can be used to transform/map/manipulate data in various formats (CSV,FIXLEN,XML,XBASE,COBOL,LOTUS, etc.); can be used standalone or embedded(as a library). Connects to RDBMS/JMS/SOAP/LDAP/S3/HTTP/FTP/ZIP/TAR.
etl data-processing data-integration data-extractionRepo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.As of August 2015, the master branch (and Tabula 1.1.X+) uses tabula-java instead of tabula-extractor under the hood. Previous versions of Tabula use tabula-extractor.
pdf csv excel text-extraction data-extractionThis module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Documentation can be found at FlashText Read the Docs.
search-in-text keyword-extraction nlp word2vec data-extractionExtract tables from PDF files
pdf text-extraction data-extractionScriptella is an ETL (Extract-Transform-Load) and script execution tool. Its primary focus is simplicity. It doesn't require the user to learn another complex XML-based language to use it, but allows the use of SQL or another scripting language suitable for the data source to perform required transformations.
etl data-extraction database-migrationPhearJS renders webpages. It runs a server which supervises a set number of PhantomJS workers that do the actual parsing and evaluation. Many websites rely on AJAX and front-end rendering. When a machine requests a page from such a website it sees a completely different page than you would see when viewing it in a browser.
phearjs phantomjs seo prerender ajax data-extractionFast Keyword extraction using Aho–Corasick algorithm and Tries. Flash is meant as a replacement for Regex, which in such cases can be extremely slow.
text search trie data-extraction text-searchRead text and parse tables from PDF files. Supports tabular data with automatic column detection, and rule-based parsing.
data-extraction pdf-converter parsing tabular-data pdf-reader parse-tables rule-based-parsing pdf reader parser parse convert cli table data csv json rulesThis service extracts summaries and illustrations from hacker news articles for people who want to get the most out of hacker news while cutting down the time spent on deciding which one to read and which to skip.
hacker-news data-extraction hacker-news-reader rss extract-summaries article hacker-news-digest html content topic spider crawlerInfoboxer is pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction (hence the name). The whole idea is: you can have any Wikipedia page as a parsed tree with obvious structure, you can navigate that tree easily, and you have a bunch of hi-level helpers method, so typical information extraction tasks should be super-easy, one-liners in best cases.
wikipedia mediawiki data-extraction
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.