PDF Clown's Blog

Developing a free/libre open source PDF library

Waiting for PDF Clown 0.1.2 release

leave a comment »

[NOTE: this post was updated on February 9, 2012]

NOTE — As 0.1.2 version is currently under development, the new features described below are available (except PDF Clown DOM Inspector, which is still offline) through the trunk (HEAD revision) of PDF Clown’s SVN repository.

1. PDF Clown DOM Inspector

Since its earliest versions, PDF Clown has been shipped including a simple Swing-based proof of concept for viewing PDF file structures. Now that little fledgling is going to become a comprehensive tool for the visual editing of the structure of PDF files: PDF Clown DOM Inspector. It will be part of next 0.1.2 version as a dedicated project within the PDF Clown distribution.

This tool conforms to the PDF model as defined by PDF Clown (see the diagram above), which adheres to the official PDF Reference 1.7/ISO 32000-1. This implies that a PDF file is represented through several concurrent views which work at different abstraction levels: Document view (document layer), File view (file/object layer, hierarchical) and XRef view (file/object layer, flat).

1.1. Document view

Document view (see the left pane in the above screenshot) shows the high-level structure of a PDF file; selecting a node, its data is shown in the right pane through several views — in this case, selecting a page node shows its content stream structure (Contents view, see below) and its rendering (Render view [¹], see above). Note that the page model represented by both Contents view and Render view corresponds to the content (sub)layer described in the diagram above.

Here it is just one of the possible functionalities: hovering the mouse pointer over a show-text-operation node, a tooltip pops up revealing the actual text encoded inside it (in this example, inspecting a russian-language document):

There’s such a potential for custom features that I’m considering to make it pluggable so as to let it be extended with additional modules, at user’s will.

1.2. File view

File view shows the low-level representation of the same entities you found in the above-mentioned Document view, expressed as primitive objects like dictionaries (PdfDictionary), arrays (PdfArray), streams (PdfStream) and so on.

1.3. XRef view

XRef view lists the entries of the cross-reference index (either table or stream, but that’s a technical detail you can happily ignore as it’s transparently handled by the library).

It’s really interesting to note that all the views (Document, File, XRef) are always kept synchronized: when you select a node in one of these views, its corresponding entities in each of the others are automatically selected, allowing to seamlessly switch from one view to another.

[¹] Rendering is still partial as it’s under development (pre-alpha stage).

2. Text line alignment

Enhancing an appreciated code contribution by Manuel Guilbault, text line alignment now supports all the standard modes commonly available in typesetting environments (Top, Middle, Bottom, Super (absolute/relative) and Sub (absolute/relative)) and image inlining.

3. File references (file specifications, file identifiers, PDF stream object externalization)

Spurred by an engaging user request, file specification management (now modelled in org.pdfclown.documents.files namespace instead of the old org.pdfclown.documents.fileSpec) has been thoroughly revised to smoothly support PDF stream objects import/export from/to external files.

This practically means that, instead of embedding stream data directly into a PDF file, such data can reside in an external (local or remote) file and be linked from within the PDF file through a file specification object (org.pdfclown.documents.files.FileSpecification). Thus common resources such as images can be shared among multiple documents (useful for example in a server scenario where documents may be assembled on-the-fly).

Anyway, there’s a caveat to consider before approaching externalized streams: as they are prone to security issues, their actual support by PDF viewers is very restricted (e.g., see so-called “privileged locations” in Adobe Acrobat’s Enhanced Security preferences) or even non-existent (e.g., see Evince).

Here it is a code sample demonstrating how external references are applied to PDF stream objects:

  1. PDF stream data is exported and linked back [lines 62-68];
  2. linked files are imported back into their respective PDF stream objects [lines 95-98].

package org.pdfclown.samples.cli;

import org.pdfclown.documents.Document;
import org.pdfclown.documents.files.FileSpecification;
import org.pdfclown.files.File;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.objects.PdfDataObject;
import org.pdfclown.objects.PdfIndirectObject;
import org.pdfclown.objects.PdfStream;

/**
  This sample demonstrates how to move stream data outside PDF files and keep external
  references to them; it demonstrates also the inverse process (reimporting stream data
  from external files).
  Note that, due to security concerns, external streams are a discouraged feature which
  is often unsupported on third-party viewers and disabled by default on recent  Adobe
  Acrobat versions; in the latter case, in order to bypass restrictions and allow access
  to external streams, users have to enable Enhanced Security from the Preferences dialog,
  specifying privileged locations.

  @author Stefano Chizzolini (http://www.stefanochizzolini.it)
  @since 0.1.2
  @version 0.1.2, 01/29/12
*/
public class StreamExternalizationSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    // 1. Externalizing the streams...
    String externalizedFilePath;
    {
      // 1.1. Opening the PDF file...
      File file;
      {
        String filePath = promptPdfFileChoice("Please select a PDF file");
        try
        {file = new File(filePath);}
        catch(Exception e)
        {throw new RuntimeException(filePath + " file access error.",e);}
      }
      Document document = file.getDocument();
      /*
        NOTE: As we are going to export streams using paths relative to the output path,
        it's necessary to ensure they are properly resolved (otherwise they will be
        written relative to the current user directory).
      */
      file.setPath(getOutputPath());

      // 1.2. Iterating through the indirect objects to externalize streams...
      int filenameIndex = 0;
      for(PdfIndirectObject indirectObject : file.getIndirectObjects())
      {
        PdfDataObject dataObject = indirectObject.getDataObject();
        if(dataObject instanceof PdfStream)
        {
          PdfStream stream = (PdfStream)dataObject;
          if(stream.getDataFile() == null) // Internal stream to externalize.
          {
            stream.setDataFile(
              FileSpecification.get(
                document,
                getClass().getSimpleName() + "-external" + filenameIndex++
                ),
              true // Forces the stream data to be transferred to the external location.
              );
          }
        }
      }

      // 1.3. Serialize the PDF file!
      externalizedFilePath = serialize(file, SerializationModeEnum.Standard);
    }

    // 2. Reimporting the externalized streams...
    {
      // 2.1. Opening the PDF file...
      File file;
      try
      {file = new File(externalizedFilePath);}
      catch(Exception e)
      {throw new RuntimeException(externalizedFilePath + " file access error.",e);}

      // 2.2. Iterating through the indirect objects to internalize streams...
      for(PdfIndirectObject indirectObject : file.getIndirectObjects())
      {
        PdfDataObject dataObject = indirectObject.getDataObject();
        if(dataObject instanceof PdfStream)
        {
          PdfStream stream = (PdfStream)dataObject;
          if(stream.getDataFile() != null) // External stream to internalize.
          {
            stream.setDataFile(
              null,
              true // Forces the stream data to be transferred to the internal location.
              );
          }
        }
      }

      // 2.3. Serialize the PDF file!
      String externalizedFileName = new java.io.File(externalizedFilePath).getName();
      String internalizedFilePath = externalizedFileName.substring(0, externalizedFileName.indexOf(".pdf")) + "-reimported.pdf";
      serialize(file, internalizedFilePath, SerializationModeEnum.Standard);
    }

    return true;
  }
}

Working on file specifications involved also the support to file identifiers (PDF 1.7, § 10.3 — modelled by org.pdfclown.files.FileIdentifier class), which enforce referential integrity on document interchange. Their generation and update are now part of the document life cycle automatically managed by PDF Clown.

Written by stechio

December 9, 2011 at 6:06 pm

Posted in Development

Tagged with , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.