From Fedora Project Wiki

Working, Elegant DocBook to PDF Solution

  • Student: AmitUttamchandani
  • Mentor: Not Yet Announced

Abstract

A utility to convert DocBook to PDF would certainly be an important tool for many Free Software distributions. The proposed solution involves using the reportlab open-source python toolkit to generate on-demand PDF files from a DocBook XML source. The solution is based on a simple three pronged approach and will successfully complete the requirements for the project.

First, Python will be used as the utility that takes a valid DocBook XML input and outputs it as PDF file. The latest version of Python (2.5) will be used as it has improved support for Unicode strings. Second, a fast XML parser is needed in order to quickly process DocBook XML files. The solution will use the standard python XML parser expat, which is a fairly fast XML parser, and xmlinit to validate the XML source. Third, the reportlab python toolkit will be used to generate the output in PDF.

The approach described above is simple and straightforward. It takes into account a rapid development time frame as well as extensibility of the solution. Another approach would be to use an XSLT and a preprocessor on a DocBook XML source to produce an RML output and then use rml2PDF to finally generate a PDF file. This solution would be ideal, however, RML and the script rml2PDF are not open-source and thus invalidates its use for the Fedora project.

The simple three pronged approach described above is the ideal solution. It meets the needs of the project and is easily extensible. Initial implementation of the python utility will take a DocBook XML source and generate a PDF output. Future iterations could take in list of DocBook XML sources and output each one as a PDF file. The opportunities to extend are limitless.

Detailed Description

Application for Summer of Code 2007: Amit Uttamchandani

Synopsis

I will a propose a solution to convert DocBook XML files into PDF. The approach is a simple three-pronged solution that will focus on simplicity and extensibility.

Project

The solution involves creating a command line utility to accomplish the task described above. In its simplest form, the utility takes a DocBook XML source and converts into a PDF file. To accomplish this task, I decided to set a criteria for the finished product.

Criteria

1. Simple 1. Extensible 1. Standard

Implementation

The three-pronged approach involves the following: a front-end, a parser and validator, and a PDF toolkit. First, the front end will be based on Python. Python provides a simple yet powerful development environment for implementing our utility. The command line tool would have following interface:

docbook2PDF <input>.xml <output>.pdf

Second, an XML parser needs to be utilized to parse the DocBook XML source. After studying various implementations, the standard Python XML parser expat is the best choice. Advantages of Expat include its speed in parsing, simple python bindings, and its implementation as a standard python module. Expat, however, does not validate XML files. To validate the XML source, xmlinit will be used.

Third, the reportlab open-source toolkit will be used to output the parsed XML data structure into a PDF file. The reportlab toolkit allows for easy output of python data structures into a PDF file. Thus, once a DocBook XML source is parsed and validated, the resulting Python object can then be formatted and outputted to PDF file using reportlab.

The above implementation provides a simple, extensible, and standard implementation for the python utility. The solution is based on standard implementation and does not try to over complicate the process. It is extensible because the command-line utility can be easily expanded to include additional options and features.

Road map

1. Publish a more detailed description of the implementation and specification, including initial flowcharts and function descriptions to the fedora developers mailing list. Obtain feedback and incorporate suggestions into design. (Until end of June) 1. Complete initial version of docbook2PDF utility that successfully validates and parses existing DocBook XML source. (Until 3rd week of July). 1. Implement reportlab toolkit into docbook2PDF and successfully output parsed XML object into PDF. (2nd week of August) 1. Thoroughly test the implementation and make sure it meets the requirements and specifications. Write up documentation on usage of the docbook2PDF utility. (Complete by end of August)

Future Road map

1. Utility can be extended to batch process DocBook XML files. The utility can be passed a directory and convert all DocBook XML sources it finds into PDF files. 1. A GUI can be added using PyGTK to further extend the functionality of the utility.

Biography

My name is Amit Uttamchandani and I will be completing my Bachelor's degree in Computer Engineering this Summer at California State University in Northridge. Before my current internship, I had been working for the Information Systems department at the university. During this time, our group was given the task to perform an inventory of all the computers and peripherals such as printers and scanners in the Engineering department. The current tool used at that time was an Excel sheet. I found this to be quite disturbing. The data that we were collecting would be put to much better use if it were stored in a database. The entire engineering department could benefit from this data. Thus, I suggested to implement a web-based solution involving a PHP front end to a MySQL database back end.

Now, everyone could input the data virtually from anywhere in a simple and easy to use web front end. Also, predefined queries are available to output the data into a PDF file, complete with charts and graphs. The hidden gem comes with Python and reportlab. As soon as the query is made, a python script was called to retrieve the data from a MySQL database and format it using reportlab and provide a link to the outputted PDF file. This whole process worked seamlessly and allowed our department to analyze how many computers where still using Windows NT or how many computers had less that 256MB of RAM, etc.

The above project took 3 months during the summer to complete. The python and reportlab toolkit integration was truly a beauty that shined and impressed. I have been working with reportlab and python ever since to generate on-demand PDF files and reports from databases.

I have also worked with Python and XML. I successfully created a Python script 'prop' to parse and propagate XML test case data from one project to another. The implementation used Python and expat to accomplish the task. Implementing this solution took around 3 weeks and the result was a stable utility. By using standard python libraries, I was able to develop using an OpenBSD system and still use the script in a Windows machine. That is beauty of Python.

I have been involved with open source software ever since my exposure to Mac OS X. From that point on, I strived to use open source software wherever possible. After sometime I felt the need to return the favor the community and I believe this is an opportunity for me to give something back and be part of the open source ecosystem.

Links