PDFIO is a native Julia implementation for reading PDF files. It's an 100% Julia
implementation of the PDF specification. Other than a few well established
algorithms like flate decode (
zlib library) or cryptographic operations
openssl library) almost all of the APIs are written in native Julia.
The following are some of the benefits of utilizing this approach:
PDF files are in existence for over three decades. Implementations of the PDF writers are not always to the specification or they may even vary significantly from vendor to vendor. Everytime, you get a new PDF file there is a possibility that it may not work to the best interpretation of the specification. A script based language makes it easier for the consumers to quickly modify the code and enhance to their specific needs.
When a higher level scripting language implements a C/C++ PDF library API, the scope is confined to achieving certain high level application tasks like, graphics or text extraction; annotation or signature content extraction or page merging or extraction. However, this API represents the PDF specification as a model (in MVC parlance). Every object in PDF specification can be represented in some form through these APIs. Hence, objects can be utilized effectively to understand document structure or correlate documents in more meaningful ways.
Potential to be extended as a PDF generator. Since, the API is written as an object model of PDF documents, it's easier to extend with additional PDF write or update capabilities.
There are also certain downsides to this approach:
A popular package
Taro.jl that utilizes Java based Apache
Tika, Apache POI and Apache
FOP libraries for reading PDF and other
file types may need the following code to extract text and other metadata from
using Taro Taro.init() meta, txtdata = Taro.extract("sample.pdf");
While the same with
PDFIO may look like below:
function getPDFText(src, out) doc = pdDocOpen(src) docinfo = pdDocGetInfo(doc) open(out, "w") do io npage = pdDocGetPageCount(doc) for i=1:npage page = pdDocGetPage(doc, i) pdPageExtractText(io, page) end end pdDocClose(doc) return docinfo end
The package can be added to a project by the command below:
The current version of the API requires
julia 1.0. The detailed list of packages
PDFIO depends on can be seen in the Project.toml file.
The above mentioned code takes a PDF file
src as input and writes the text data into a file
out. It enumerates all the pages in the document and extracts the text from the pages. The extracted text is written to the output file.
""" ``` getPDFText(src, out) -> Dict ``` - src - Input PDF file from where text is to be extracted - out - Output TXT file where the output will be written return - A dictionary containing metadata of the document """ function getPDFText(src, out) doc = pdDocOpen(src)
doc handle that can be used for subsequence operations on the document.
docinfo = pdDocGetInfo(doc)
Metadata extracted from the PDF document. This value is retained and returned as the return from the function.
open(out, "w") do io npage = pdDocGetPageCount(doc)
Returns number of pages in the document
for i=1:npage page = pdDocGetPage(doc, i)
page handle to the specific page given the number number index.
Extract text from the page and write it to the output file.
end end pdDocClose(doc)
Close the document handle. The
doc handle should not be used after this call
return docinfo end
As can be seen above, granular APIs are provided in
PDFIO that can be used in combination to achieve a desirable task. For details, please refer to the Architecture and Design.
PDFIO is implemented in layers enabling following features:
pdPageExtractTextis an apt example of the same.
The Architecture and Design discusses some of these scenarios.
PDFIO is developed to contribute to both commercial activities and scientific research alike. However, we strongly discourage usage of this product for any illegal, immoral or unethical purposes. PDFIO License while provides rights under a permissible
MIT Expat License, is conditioned upon maintaining strong moral, ethical and legal standards of the final outcome.
This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (http://www.openssl.org/)
Contributions in form of PRs are welcome for any feature you will like to develop for the
PDFIO library. You are requested to review the GitHub Issues section to understand the known issues. You can take up few of the issues, work on them and submit a PR. If you come across a bug or are unable to use the APIs in any manner, feel free to submit an issue.
Taro.jl is an alternate package in Julia that provides reading and extracting content from a PDF files.
It's almost impossible to talk PDF without reference to Adobe. All copyrights or trademarks that are owned by Adobe or ISO, which have been referred to inadvertently without stating ownership, are owned by them. The author also has been part of Adobe's development culture in early part of his career with specific to PDF technology for about 2 years. However, the author has not been part of any activities related to PDF development from 2003. Hence, this API can be considered a clean room development. Usage of words like Carousel and Cos are pretty much public knowledge and large number of reference to the same can be obtained from industry related websites etc.
The package contains Adobe Font Metrics (AFM) for 14 Core Adobe fonts.
Not all PDF files that were used to test the library has been owned by the author. Hence, the author cannot make those files available to general public for distribution under the source code license. However, the author is grateful to the PDF document library maintained by email@example.com. However, these files are no longer available in the link above.
However, test files may have different licensing that the
PDFIO. Hence we have
now uploaded most test files to another project under PDFTest.
10 days ago