pdfplumber - Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables

  •        375

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer and pdfminer.six.

https://github.com/jsvine/pdfplumber

Tags
Implementation
License
Platform

   




Related Projects

TCPDF - PHP class for generating PDF

  •    PHP

TCPDF is a PHP class for generating PDF documents without requiring external extensions. TCPDF Supports UTF-8, Unicode, RTL languages, XHTML, Javascript, digital signatures, barcodes and much more.

PDF Library - PDF manipulation in .NET

  •    VBNET

A library for PDF manipulation implementing Adobe PDF standard version 1.7. This library allows to read PDF files and apply changes to them, it is written in .NET 2.0 using Visual Studio 2005. Writing and Parsing PDF is supported.

PDFJet - PDF library for Java and .NET

  •    Java

PDFjet is a high performance PDF library for Java and .NET. It has support of drawing points, lines, box, polygons etc. It supports unicode text, embedding images, embedding hyperlinks and lot more. Its simple to use table class helps to generate flexible reports.

PDFBox - Java PDF library

  •    Java

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

tabula-extractor - Extract tables from PDF files

  •    Ruby

Deprecation Note: This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use tabula-java (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use tabula-java.Extract tables from PDF files. tabula-extractor is the table extraction engine that used to power Tabula.


PDFClown - PDF library

  •    Java

PDFClown is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. It has support to add Images, Fonts, Barcodes, Bookmarks, Annotations, Form fields like checkbox, button, list box etc, Compression, text extraction.

HexaPDF - A Versatile PDF Creation and Manipulation Library for Ruby

  •    Ruby

HexaPDF is a pure Ruby library with an accompanying application for working with PDF files. It supports Creating new PDF files, Manipulating existing PDF files, Merging multiple PDF files into one, Extracting meta information, text, images and files from PDF files, Securing PDF files by encrypting them and optimizing PDF files for smaller file size or other criteria.

iText - Java PDF library

  •    Java

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.

PDFSharp - Create and process PDF in .NET

  •    CSharp

PDFsharp is the Open Source .NET library that easily creates and processes PDF documents on the fly from any .NET language. The same drawing routines can be used to create PDF documents, draw on the screen, or send output to any printer. Neither Adobe's PDF Library nor Acrobat are required.

PDFEdit

  •    C++

PDFedit is a free open source pdf editor and a library for manipulating PDF documents. It includes PDF manipulating library based on xpdf, GUI, set of command line tools and a pdf editor. You can use it to read, change and extract information from a PDF file. It is based on xpdf library.

Bad-Pdf - Steal Net-NTLM Hash using Bad-PDF

  •    Python

Bad-PDF create malicious PDF file to steal NTLM(NTLMv1/NTLMv2) Hashes from windows machines, it utilize vulnerability disclosed by checkpoint team to create the malicious PDF file. Bad-Pdf reads the NTLM hashes using Responder listener. This method work on all PDF readers(Any version) and java scripts are not required for this attack, most of the EDR/Endpoint solution fail to detect this attack.

jPod - PDF manipulating and rendering framework

  •    Java

jPod is a PDF manipulating and rendering framework. It provides functionality to read, verify the document against the PDF specification. It also provides content stream and rendering framework. It could able to create new document and do incremental updates.

Ghostscript - Document Rendering and Conversion

  •    C

Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.

PDF.js - PDF Reader in JavaScript

  •    Javascript

PDF.js is a Portable Document Format (PDF) viewer that is built with HTML5. PDF.js is community-driven and supported by Mozilla Labs. Our goal is to create a general-purpose, web standards-based platform for parsing and rendering PDFs.

php-svg-lib - SVG file parsing / rendering library

  •    PHP

The main purpose of this lib is to rasterize SVG to a surface which can be an image or a PDF for example, through a \Svg\Surface PHP interface. This project was initialized by the need to render SVG documents inside PDF files for the DomPdf project.

tabula - Tabula is a tool for liberating data tables trapped inside PDF files

  •    CSS

Repo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.As of August 2015, the master branch (and Tabula 1.1.X+) uses tabula-java instead of tabula-extractor under the hood. Previous versions of Tabula use tabula-extractor.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.

Solr - Blazing-fast, open source enterprise search platform

  •    Java

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

ultimate-beamer-theme-list - A collection of custom Beamer themes

  •    

Hi! Below is a table of custom Beamer themes originally taken from latex.simon04.net, now expanded to include a few more themes. Want to add yours? Awesome! Send a PR with your link added to the bottom of the table, or email me (see my GitHub profile) and I'll do it for you.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.