PDF Clown 0.1.0 — XRef streams, content rasterization and lots of good stuff

NOTE — On March 4, 2011 PDF Clown 0.1.0 has been released!

Hi there!

New features currently under development that will be available in the next (0.1.0) release:

Cross-reference streams and object streams
Version compatibility check
Content rasterization
Functions
Page data size (a.k.a. How to split a PDF document based on maximum file size)

It’s time to reveal you that I decided to consolidate the project’s identity (and simplify your typing life) changing its namespace prefix (it.stefanochizzolini.clown) in favor of the more succinct org.pdfclown: I know you were eager to strip that cluttering italian identifier! 😉

Last week I was informed that USGS adopted PDF Clown for relayering their topographic maps and attaching metadata to them. Although on a technical note it’s stated that its use will be only transitory, as they are converging toward a solution natively integrated with their main application suite (TerraGo), nonetheless its service in such a production environment seems to be an eloquent demonstration of its reliability. 8)

1. Cross-reference streams and object streams

After lots of requests, I’m currently busy on the development of cross-reference stream and object stream read/write functionalities [PDF:1.6:3.4.6-7]; in particular, stream reading has been partially based upon the code that Joshua Tauberer wrote some months ago while he was experimenting with PDF Clown on PDF files analysis for his US Congress activity tracker, GovTrack.

2. Version compatibility check

Working on cross-reference streams induced me to start supporting version-compatibility checking via annotations. This feature conveniently allows users to transparently control that the PDF files they are creating or modifying conform to a target PDF version (as specified in PDF file header) according to a configurable compatibility policy, defined through Document.Configuration.CompatibilityModeEnum — these are the alternative policies applicable:

Passthrough: document’s conformance version is ignored; any feature is accepted without checking its compatibility.
Loose: document’s conformance version is automatically updated to support actually used features.
Strict: document’s conformance version is mandatory; any unsupported feature is forbidden and causes an exception to be thrown in case of attempted use.

Automatic compatibility checking is very handy as users can enforce generated PDF files’ conformance without manual intervention; for example, you don’t have to tweak your PDF file version to 1.5 if you plan to use the optional content functionality (OCG [PDF:1.6:4.10]), just sit back and see it be done! 🙂

3. Content rasterization

I’m quite impressed how naturally the existing model is integrating with PDF printing and image rasterization functionalities. Leveraging the existing model means that there’s a common infrastructure (see ContentScanner and ContentObject hierarchy) that serves disparate purposes (content creation, content analysis, content extraction, content rasterization, content editing, and so on), simplifying its understanding, use, maintenance and extension. I wanna stress that my goal is to come to an elegant viewer, NOT to a cumbersome retrofit component that’s added as an alien to fill the gap! 😉

Yes, I know these goodies had been outside my official plans for a long time, but during the last week of September, while crawling through the PDF Clown sources, I stumbled upon the above-mentioned ContentScanner and ContentObject hierarchy: I realized they were just ready for supporting content rendering, so I thought “What are we waiting for? Let’s do it!”… but don’t expect that 0.1 will deliver a full-fledged PDF viewer and printer — I’ll start prototyping the most basic graphics primitives such as space coordinates transformations, path drawing, color selection and so on. Advanced operations such as glyph outline drawing will necessarily appear afterwards. Anyway, I’m confident that at the end of the development process it will be possible to print and display PDF pages (and even independent parts of them such as external forms) along with their thumbnails.

The figure below compares an example of PDF Clown’s current rasterization capabilities (on the left, via Java 2D graphics) with its equivalent generated by Adobe Reader (on the right). As you can see, path drawing is highly conformant with the reference implementation, while no text rendering has been implemented yet.

Creating this figure was absolutely trivial — here it is the code sample used (line 34 executes the actual rendering of the first page of the document):

package org.pdfclown.samples;

import java.awt.Dimension;
import java.awt.image.BufferedImage;
import java.io.IOException;

import javax.imageio.ImageIO;

import org.pdfclown.documents.Document;
import org.pdfclown.files.File;
import org.pdfclown.tools.Renderer;

public class ContentRenderingSample
  extends Sample
{
  @Override
  public boolean run(
    )
  {
    String filePath = promptPdfFileChoice("Please select a PDF file");

    // 1. Open the PDF file!
    File file;
    try
    {file = new File(filePath);}
    catch(Exception e)
    {throw new RuntimeException(filePath + " file access error.",e);}

    // 2. Get the PDF document!
    Document document = file.getDocument();

    // 3. Rasterize the first page!
    Renderer renderer = new Renderer();
    BufferedImage image = renderer.render(document.getPages().get(0),new Dimension(1400,850));

    // 4. Save the rasterized image!
    try
    {ImageIO.write(image,"jpg",new java.io.File(getOutputPath() + java.io.File.separator + "ContentRenderingSample.jpg"));}
    catch(IOException e)
    {e.printStackTrace();}

    return true;
  }
}

As you can see in the following code chunk, Renderer.render(…) method takes care to prepare the target graphics context [line 31] delegating its rendering to the chosen content context [line 32] (that is, in this case, a Page object):

package org.pdfclown.tools;

import java.awt.Dimension;
import java.awt.image.BufferedImage;

import org.pdfclown.documents.contents.IContentContext;

/**
  Tool for rendering {@link IContentContext content contexts}.

  @author Stefano Chizzolini (http://www.stefanochizzolini.it)
  @version 0.1.0
  @since 0.1.0
*/
public class Renderer
{
    . . .

    /**
      Renders the specified content context into an image context.

      @param contentContext Source content context.
      @param size Image size expressed in device-space units (that is typically pixels).
      @return Image representing the rendered contents.
    */
    public BufferedImage render(
      IContentContext contentContext,
      Dimension size
      )
    {
      BufferedImage image = new BufferedImage(size.width,size.height,BufferedImage.TYPE_INT_BGR);
      contentContext.render(image.createGraphics(),size);

      return image;
    }
}

The Page object then delegates its own contents rendering to ContentScanner [line 36], which sequentially scans every graphics operation producing its corresponding raster representation:

package org.pdfclown.documents;

import java.awt.Graphics2D;
import java.awt.geom.Dimension2D;
import java.awt.print.Printable;

import org.pdfclown.PDF;
import org.pdfclown.VersionEnum;
import org.pdfclown.documents.contents.ContentScanner;
import org.pdfclown.documents.contents.IContentContext;
import org.pdfclown.objects.PdfDictionary;
import org.pdfclown.objects.PdfObjectWrapper;

/**
  [PDF:1.6:3.6.2] Document page.

  @author Stefano Chizzolini (http://www.stefanochizzolini.it)
  @since 0.0.0
  @version 0.1.0
*/
@PDF(VersionEnum.PDF10)
public class Page
  extends PdfObjectWrapper
  implements IContentContext,
    Printable
{
  . . .

  @Override
  public void render(
    Graphics2D context,
    Dimension2D size
    )
  {
    ContentScanner scanner = new ContentScanner(getContents());
    scanner.render(context,size);
  }

  . . .
}

4. Functions

Improving the color space definitions for content rasterization is forcing me to also manage functions [PDF:1.6:3.9] in all their flavors (Type 0 (Sampled), Type 2 (Exponential Interpolation), Type 3 (Stitching) and Type 4 (PostScript Calculator)).

5. Page data size (a.k.a. How to split a PDF document based on maximum file size)

org.pdfclown.tools.PageManager has been enhanced with the introduction of an elegant algorithm that accurately calculates the data size of PDF pages keeping shared resources (like fonts, images and so on) into consideration: this practically means that you can evaluate the incremental size of each page in a document, splitting the file when the collected pages reach the maximum file size you intended for your target split PDF files, without creating any cumbersome temporary file!

It’s really funny that proprietary products like PDFTron’s, which cost at least hundreds of bucks, suggest in their official knowledge base awkward trial-and-error strategies, iteratively creating horrible temporary files… evidently, only clowns can solve such a task better 😛 — Free Software rocks!

package org.pdfclown.samples;

import java.util.HashSet;
import java.util.Set;

import org.pdfclown.documents.Document;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.Pages;
import org.pdfclown.files.File;
import org.pdfclown.objects.PdfReference;
import org.pdfclown.tools.PageManager;

public class SplitSample
  extends Sample
{
  private static final long MaxDataSize = 4 << 20; // 4 MBytes (you can obviously set it at your will).
  private PageManager manager;

  @Override
  public boolean run(
    )
  {
    // 1. Opening the PDF file...
    File file;
    {
      String filePath = promptPdfFileChoice("Please select a PDF file");
      try
      {file = new File(filePath);}
      catch(Exception e)
      {throw new RuntimeException(filePath + " file access error.",e);}
    }
    Document document = file.getDocument();
    Pages pages = document.getPages();

    // 2. Splitting the document...
    manager = new PageManager(document);
    int splitIndex = 0;
    long incrementalDataSize = 0;
    int beginPageIndex = 0;
    Set visitedReferences = new HashSet();
    for(Page page : pages)
    {
      long pageDifferentialDataSize = PageManager.getSize(page,visitedReferences); 
      incrementalDataSize += pageDifferentialDataSize;
      if(incrementalDataSize > MaxDataSize) // Data size limit reached.
      {
        int endPageIndex = page.getIndex();

        // Split the current document page range!
        splitDocument(++splitIndex,beginPageIndex,endPageIndex);

        beginPageIndex = endPageIndex;
        incrementalDataSize = PageManager.getSize(page,visitedReferences = new HashSet());
      }
    }
    // Split the last document page range!
    splitDocument(++splitIndex,beginPageIndex,pages.size());

    return true;
  }

  private void splitDocument(
    int splitIndex,
    int beginPageIndex,
    int endPageIndex
    )
  {
    // 1. Split the document!
    Document splitDocument = manager.extract(beginPageIndex,endPageIndex);

    // 2. Serialize the split file!
    serialize(splitDocument.getFile(),this.getClass().getSimpleName() + "." + (splitIndex),false);
  }
}

hi roki,

when you work at content level modifying an operation like ShowSimpleText you have to consider that content streams do NOT directly deal with “readable” text, as they are simply concerned by graphical entities called “glyphs” referenced through an arbitrary encoding.

So, if you want to assign some text to a ShowSimpleText operation you have to convert your text into its byte representation as defined by the current font:


File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
Document document = file.getDocument();
Page page = document.getPages().get(0); // Gets for example the first page.
ContentScanner scanner = new ContentScanner(page); // Creates a scanner to walk through the page contents.
... // walking through the contents (see the sample code available within the downloadble distribution).
ContentObject content = scanner.getCurrent();
if(content instanceof ShowSimpleText)
{
  ContentScanner.GraphicsState state = scanner.getState();
  Font font = state.getFont(); // Gets the current font used to show the text defined by ShowSimpleText.
  ((ShowSimpleText)content).setText(font.encode("dddd"));
}
... // serializing the modified file (see the sample code available within the downloadble distribution).

12 thoughts on “PDF Clown 0.1.0 — XRef streams, content rasterization and lots of good stuff”

jason demont says:

October 6, 2011 at 9:39 pm

I have been testing the raster functionality of PDF Clown on several (literally hundreds) of different PDF documents including ones created by acrobat 7, 8, 9 and X and some created with the itextsharp library.
Some of the files render very well and some give errors. I don’t know if the best way to contribute to your work would be to try and figure out why myself and give you the results or maybe the best way would be to send you a rar with two folders, one good, one bad, with a xrf listing what error I received on each.
I have very little knowledge of the PDF object and I have a feeling trying to parse code line by line wouldn’t help as much as sending the docs to you if you are interested. Either way, thanks for the work and let me know if and how I can help make it better.

1. stechio says:
  
  October 8, 2011 at 4:16 am
  
  PDF Clown currently (0.1.0 version) supports only part of the raster model (e.g., it doesn’t show text).
  If you have some errors to report, please apply to the Bugs Tracker
  
Jonathan Hesketh says:

June 24, 2011 at 3:07 pm

Hi stechio,

I have been hunting around for a way to print a pdf for which seems like an age now. I downloaded your sample and hooked it all up; when i start the printing sample it gets to the point where the printer says it is spooling but then fails to render (the pdf I’m trying to print is just a 1 page document containing some text).
I know the printing functionality is in its early stage but any thoughts on the matter would be greatly appreciated.

Thanks for this fantastic library and keep up the good work.

1. stechio says:
  
  June 25, 2011 at 2:49 pm
  
  Hi Jonathan,
  waiting for the full-fledged implementation, another user (Glen) managed to render text through a simple hack that’s described in the 0.1.0 release comments.
  Stefano

Hi, PDF Clown can’t edit the text, right?
e.g.
ShowSimpleText.setText("dddd")

Pingback: PDF Clown 0.1.0 has been released! « PDF Clown's blog
1. Iain says:
  
  March 29, 2011 at 12:19 pm
  
  Good idea to write a content rasterizer! After hitting alignment issues using PDF Clown I also wrote a primitive one. It really helped me to understand what the layout was doing, and where it was going wrong.
  
  Keep up the good work.
  
  1. stechio says:
    
    March 29, 2011 at 5:01 pm
    
    Please consider, anytime you discover and fix an issue affecting PDF Clown, to report your findings and solutions — that’s a fair way to give back at least some of the benefit you received using it.
    Thank you!
Tuoams says:

February 5, 2011 at 7:31 pm

Thank you for the excellent work! Really waiting for the 0.1 release of C#.

btw. I couldn’t get those patches work on 0.0.8 release. The patch util gives errors when trying to upgrade with patch #1 in Windows.
I hope that you’ll release all the future fixes as full source code straight to sourceforge — it would be easier for c# developers 🙂

1. stechio says:
  
  February 5, 2011 at 9:59 pm
  
  I assure you that the published patches for PDF Clown have all been successfully tested for merge to the distributed code base.
  
  Please keep in mind to choose the unified diff format when applying those patches and check that your patch utility is able to handle the Unix end-of-line character sequence (you may use the patch command through Cygwin).
  
  Since the next release I’ll activate the SVN repo.
  
Oscar says:

September 24, 2010 at 3:47 am

Excellent project!