PDF Clown 0.2.0 — Enhanced content handling

NOTE — As version 0.2.0 is currently under development, new features described below will appear in the trunk (HEAD revision) of PDF Clown’s SVN repository before its official release.

NOTE — If you are interested in the current developments of PDF Clown, you may follow PDF Clown on Twitter for the latest news and comments!

NOTE — Version 0.1.3 release has been frozen due to ContentScanner refactoring (see section “Content stream manipulation”, below): once it’s completed, COS Inspector‘s development will be restored.

PDF Clown 0.2.0 development iteration revolves around these topics:

  1. Content stream manipulation: ContentScanner class has been refactored to expand its capabilities;
  2. Content composition engine: ContentComposer class has been introduced to support high-level typographic functionalities.

1. Content stream manipulation

Since its very inception, I have been really delighted by the concept subtending the ContentScanner class, as it proved to be a versatile processor for handling content stream object trees along with their graphics state: you could use it directly to read existing content streams, modify them and also create new ones in a convenient object-oriented fashion, or it could be plugged into specialized tools (e.g. PrimitiveComposer, TextExtractor, Renderer, etc.) for more advanced applications.

But till version series 0.1.x it suffered a significant drawback: it lacked separation of concerns from its object model, that is the algorithmic responsibility to carry out the tasks was delegated to the respective content stream operations. This may work well in case there’s just a single task (“read/write the content stream”), but when further tasks are required (e.g. rendering the content stream into a graphics context) it rapidly becomes unbearable.

Therefore I proceeded with a massive refactoring which was informed by two main concurrent requirements: algorithmic separation between process and structure (accomplished through the classic Visitor pattern) and preservation of the distinctive cursor-based behavior of ContentScanner (solved through dedicated impedance-matching logic).

All the non-core functionalities which were bloating the original ContentScanner (like rendering and content wrappers) have been extracted into specialized processors (respectively: ContentRenderer and ContentModeller), resulting in the following classes:

  • ContentVisitor: abstract content stream processor supporting all the common graphics state transformations;
  • ContentScanner: (read/write) multi-purpose cursor-based processor;
  • ContentModeller: (read-only) modelling processor (generates declarative forms (GraphicsElement hierarchy) of the corresponding content objects);
  • ContentRenderer: (read-only) rendering processor (generates raster representations of the content stream).
ContentScanner refactored
ContentScanner refactored

1.1. ContentScanner

ContentScanner‘s new implementation focuses exclusively on its core purpose, that is to enable users to manipulate content streams through their low-level, procedural, stacked model (operations and composite objects along with their graphics state).

1.2. ContentModeller

ContentModeller works as a parser which maps the low-level content stream model to its high-level, declarative, flat representation through a dedicated model rooted in GraphicsElement abstract class (which corresponds to GraphicsObjectWrapper hierarchy of ContentScanner’s old implementation). This simplified-yet-equivalent representation can be modified and saved back into the content stream.

1.3. ContentRenderer

ContentRenderer works on content rasterization (that is page imaging and printing). Its reimplementation spurred enhancements in text rendering, image object rasterization and color space management (more on that soon — stay tuned!).

2. Content composition engine

PDF Clown 0.2.0 introduces the much-requested keystone of its content composition stack: ContentComposer class. This engine features a layout model inspired by a distilled, meaningful subset of the HTML+CSS ruleset.

Its high-level typographic model (columns, sections, paragraphs, tables and so on) is laid out leveraging the existing low-level functionalities provided by BlockComposer (paragraph typesetting) and PrimitiveComposer (native PDF graphics instructions), the latter of which in turn sits upon the above-mentioned ContentScanner for feeding into the content stream (IContentContext).

PDF Clown's content composition stack
PDF Clown’s content composition stack

This subject is massively broad, so here I’m going to give you just some little highlight about its features (development is currently underway — I’ll add more details as it advances).

2.1. Multi-column layout

PDF Clown’s layout engine supports the multi-column layout model described by the CSS3 specification, which extends the block layout mode to allow the easy definition of multiple columns of text (and any other kind of content, like tables, images, lists and so on). Columns can be defined by count (number of columns desired), width (minimum column width desired) or both: in any case, the official CSS3 pseudo-algorithm is applied.

PDF Clown, according to the CSS3 specification, automatically balances the column heights, that is, it sets the maximum column height so that the heights of the content in each column are approximately equal. This is possible because of a powerful simulation algorithm which ensures an accurate arrangement. Should the content exceed the available height on the paged medium, it would automatically flow into the next page.

If you are interested in further info about CSS multi-column layouts, I recommend you to see Mozilla’s great introduction to CSS Multi-column Layout Module.

Here it is a practical example of its use:

Multi-column layout sample
Multi-column layout sample generated by PDF Clown

And this is the corresponding code:

import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.composition.*;
import org.pdfclown.documents.contents.fonts.StandardType1Font;
import org.pdfclown.util.math.geom.Dimension;

. . .

DocumentComposer composer = new DocumentComposer(document);

/*
  NOTE: Composer's style is at the root of the style model, that is, its definitions
  are inherited by the descending elements, analogously to the style of BODY element
  in HTML DOM.
*/
composer.getStyle()
  .setTextAlign(XAlignmentEnum.Justify)
  .getFont().setSize(12d);

/*
  NOTE: Element type styles are analogous to CSS styles defined through element type 
  selectors.
*/
composer.getStyle(Paragraph.class)
  .setTextIndent(new Length(10));
composer.getStyle(Heading.class)
  .setMargin(new QuadLength(0, 0, 10, 0));

/*
  NOTE: Styles can be defined analogously to CSS class definitions and can be derived
  analogously to Less mixins (http://lesscss.org/).
*/
Style strongStyle = new Style("strong")
  .setFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, true, false), null));
Style emStyle = new Style("em")
  .setFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, false, true), null));
Style noteStyle = new Style("note")
  .setBorder(new Border(
    null,
    new QuadBorderStyle(BorderStyleEnum.Solid, BorderStyleEnum.None, BorderStyleEnum.None, BorderStyleEnum.None),
    new QuadLength(.1, 0, 0, 0),
    null))
  .setFont(new Font(null, 6d))
  .setMargin(new QuadLength(30, 0, 0, 0))
  .setPadding(new QuadLength(5, 0, 0, 0))
  .setTextAlign(XAlignmentEnum.Left)
  .setTextIndent(new Length(0));
Style superStyle = new Style("super")
  .setFont(new Font(null, 6.5d))
  .setVerticalAlign(LineAlignmentEnum.Super);
  
Section section = new Section("Hello World, this is PDF Clown!");

Image clownImage = new Image(document, "Clown.jpg");
clownImage.getStyle()
  .setFloat(FloatEnum.Left)
  .setSize(new Dimension(100,0))
  .setMargin(new QuadLength(new Length(5)));

/*
  NOTE: Group is a typographic element analogous to DIV element in HTML DOM.
*/
Group group = new Group(
  clownImage,
  new Paragraph( 
    new Text("PDF Clown's layout engine supports the "),
    new Text(strongStyle, "multi-column layout model"),
    new Text(" described by the CSS3 specification"),
    new Text(superStyle, "[1]"),
    new Text(" which extends the block layout mode to allow the easy definition of multiple columns "
      + "of text (and any other kind of content, like tables, images, lists and so on).")
    ),
  new Paragraph(
    new Text("PDF Clown, according to the CSS3 specification, "),
    new Text(emStyle, "automatically balances the column heights"),
    new Text(", i.e., it sets the maximum column height so that the heights of the content in each column "
      + "are approximately equal. This is possible because of a powerful simulation algorithm which ensures "
      + "an accurate arrangement. Should the content exceed the available height on the paged medium, it "
      + "would automatically flow into the next page.")
    ),
  new Paragraph(
    new Text("Columns can be defined by count (number of columns desired), width (minimum column width desired)"
      + " or both: in any case, the official CSS3 pseudo-algorithm is applied"),
    new Text(superStyle, "[2]"),
    new Text(". If you are interested in further info about CSS multi-column layouts, I recommend you to see "
      + "Mozilla's documentation for a great introduction to CSS Multi-column Layout Module"),
    new Text(superStyle, "[3]"),
    new Text(".")
    ),
  new Paragraph(noteStyle,
    new Text("1. http://www.w3.org/TR/css3-multicol/\n"
      + "2. http://www.w3.org/TR/css3-multicol/#pseudo-algorithm\n"
      + "3. https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_multi-column_layouts")
    )
  );
/*
  NOTE: This is the declarative CSS3-equivalent style which prescribes the layout engine to treat
  this group as a multi-column block (in this case: 2 columns with a 14-point gap between).
*/
group.getStyle().setColumn(new Column(2, new Length(14)));

section.add(group);
    
composer.show(section);
composer.flush();
NOTE — If you carefully read the source code above, you surely wondered if DocumentComposer was a misspell of ContentComposer. Not at all: DocumentComposer is a subclass of ContentComposer that manages multi-paged media (conversely, ContentComposer works with single canvases implementing the IContentContext interface (e.g., Page and FormXObject objects)).

Honoring the KISS principle, all the magic here is done by a minimal declaration (see line 99 above) which, analogously to the CSS fragment {column-count:2; column-gap:14pt;}, prescribes the PDF Clown’s layout engine to render the content group as a multi-column block:

group.getStyle().setColumn(new Column(2, new Length(14)));

Comparing this neat solution to a well-renowned library like iText, some awkward shortcomings emerge in the way iText deals with multi-column layout: com.itextpdf.text.pdf.ColumnText class works as a dedicated processor outside the common declarative pattern (i.e., you cannot directly feed a column-aware element into the document as you do for tables, paragraphs and so on). That’s a really bad thing: a well-designed layout engine should hide those implementation details and carry out its duties transparently — you just feed the contents and it takes care to do the right thing according to their properties and inherent behaviors. Such a crippled layout model forces users to ridiculous bends and twists just to get contents in place!… let’s examine a few of them:

  • awful treatment of column intrusions: iText requires you to explicitly define the shape of your columns (sic!), distinguishing between “simple” (rectangularly-bound) and “irregular” (arbitrarily-shaped) columns, the latter forcing you to tediously specify each vertex.. plain horror!
  • redundant distinction between composite and text modes: iText curiously discriminates between so-called “composite” (complex entities like tables, lists…) and “text” (plain text chunks) modes. The net result of this wording is that, unfortunately, text mode and irregular columns are intertwined, meaning that if you are adding irregular columns you cannot use composite mode for those columns… absurdity in action!

Why PDF Clown’s solution is way better than iText’s?

  • adaptive column intrusion detection: PDF Clown’s layout engine keeps track of absolutely-positioned elements and its block composer takes care to automatically flow content around those already-occupied areas. What you have to do is just adding content the way shown in the code example above, no convoluted crap here!
  • smooth, consistent separation between content and layout models: in PDF Clown, layout processing is ContentComposer’s business, while content definition is user’s. Multi-column layout is just another style property of your contents, not a strange beast to wrestle with!

As I said, multi-column layout is just a little treat in a full-fledged layout engine… PDF Clown is maturing: in the next weeks new technical details, code snippets and announcements will appear here. Stay tuned with its Twitter stream!

PDF Clown 0.1.3 — COS Inspector

NOTE — As ContentScanner class is under refactoring, the development of 0.1.3 version is temporarily frozen.

1. COS Inspector

Since its earliest versions, PDF Clown has been shipped including a simple Swing-based proof of concept for viewing PDF file structures. Now that little fledgling is going to become a comprehensive tool for the visual editing of the structure of PDF files: PDF Clown COS Inspector. It was initially planned to be part of 0.1.2 version as a dedicated project within the PDF Clown distribution, but approaching the release deadline it wasn’t ready yet.

This tool conforms to the PDF model as defined by PDF Clown (see the diagram above), which adheres to the official PDF Reference 1.7/ISO 32000-1. This implies that a PDF file is represented through several concurrent views which work at different abstraction levels: Document view (document layer), File view (file/object layer, hierarchical) and XRef view (file/object layer, flat).

Continue reading

PDF Clown 0.1.2 has been released!

This release enhances several base structures, providing fully automated object change tracking and object cloning (allowing, for example, to copy page annotations and Acroform fields). It adds support to video embedding, article threads, page labels and several other functionalities.

This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.1.2%20Beta/

PDF Clown 0.1.2 — Multimedia and lots of good stuff

LATEST NEWS — On February 10, 2013 PDF Clown 0.1.2 has been released!

This release cycle revolves around these topics:

  1. Multimedia
  2. Text line alignment
  3. File references (file specifications, file identifiers, PDF stream object externalization)
  4. Advanced cloning
  5. Article threads

1. Multimedia


For a long time I kept low priority over multimedia features (chapter 9 of PDF Reference 1.7), but recently I received some solicitation about that on the project’s forum… so yes, video embedding through Screen annotations is now ready!

Continue reading

PDF Clown 0.1.1 has been released!

NOTE — PDF Clown 0.1.1 has been superseded by PDF Clown 0.1.2

This release adds support to optional/layered contents, text highlighting, metadata streams (XMP), Type1/CFF font files, along with primitive object model and AcroForm fields filling enhancements. Lots of minor improvements have been applied too.

Last but not least: ICSharpCode.SharpZipLib.dll dependency has been removed from .NET implementation.

This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.1.1%20Beta/

enjoy!

PDF Clown 0.1.1 — Text highlighting and lots of good stuff

LATEST NEWS — On November 14, 2011 PDF Clown 0.1.1 has been released!

Next release is going to introduce new exciting features (text highlighting, optional/layered contents, Type1/CFF font support, etc.) along with improvements and consolidations of existing ones (enhanced text extraction, enhanced content rendering, enhanced acroform creation and filling, etc.). This post will be kept updated according to development progress, so please stay tuned! ;-)
These are some of the things I have been working on till now:

  1. Primitive object model enhancements
  2. Text highlighting
  3. Metadata streams (XMP)
  4. Optional/layered contents
  5. AcroForm fields filling

Continue reading

PDF Clown 0.1.0 has been released!

LATEST NEWS — PDF Clown 0.1.0 has been superseded by PDF Clown 0.1.1

This release introduces support to cross-reference-stream-based PDF files (as defined since PDF 1.5 spec) along with page rendering and printing: a specialized tool provides a convenient way to convert PDF pages into images (aka rasterization). Lots of minor improvements have been applied too.

Last but not least: the project’s base namespace has changed to org.pdfclown

This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.1.0%20Alpha/

enjoy!

PDF Clown 0.1.0 — XRef streams, content rasterization and lots of good stuff

LATEST NEWS — On March 4, 2011 PDF Clown 0.1.0 has been released!

Hi there!

New features currently under development that will be available in the next (0.1.0) release:

  1. Cross-reference streams and object streams
  2. Version compatibility check
  3. Content rasterization
  4. Functions
  5. Page data size (a.k.a. How to split a PDF document based on maximum file size)

It’s time to reveal you that I decided to consolidate the project’s identity (and simplify your typing life) changing its namespace prefix (it.stefanochizzolini.clown) in favor of the more succinct org.pdfclown: I know you were eager to strip that cluttering italian identifier! ;-)

Last week I was informed that USGS adopted PDF Clown for relayering their topographic maps and attaching metadata to them. Although on a technical note it’s stated that its use will be only transitory, as they are converging toward a solution natively integrated with their main application suite (TerraGo), nonetheless its service in such a production environment seems to be an eloquent demonstration of its reliability. 8)

Continue reading

PDF Clown 0.0.8: Q&A

LATEST NEWS — PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0). As 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!

This post collects all the relevant information about issues and questions regarding PDF Clown 0.0.8.

If you have any doubt on topics not treated here, please apply your question to the Help forum.

1. ‘GoToExternalDestination’ class missing

See Topic 3836075 in the Help forum.

2. ‘xref’ keyword not found

See Topic 3434621 in the Help forum.

3. Unknown type: Comment

See Topic 3863926 in the Help forum.

4. Text line height

See Topic 3928380 in the Help forum.