Waiting for PDF Clown 0.1.2 release
1. PDF Clown DOM Inspector
Since its earliest versions, PDF Clown has been shipped including a simple Swing-based proof of concept for viewing PDF file structures. Now that little fledgling is going to become a comprehensive tool for the visual editing of the structure of PDF files: PDF Clown DOM Inspector. It will be part of next 0.1.2 version as a dedicated project within the PDF Clown distribution.

This tool conforms to the PDF model as defined by PDF Clown (see the diagram above), which adheres to the official PDF Reference 1.7/ISO 32000-1. This implies that a PDF file is represented through several concurrent views which work at different abstraction levels: Document view (document layer), File view (file/object layer, hierarchical) and XRef view (file/object layer, flat).
1.1. Document view
Document view (see the left pane in the above screenshot) shows the high-level structure of a PDF file; selecting a node, its data is shown in the right pane through several views — in this case, selecting a page node shows its content stream structure (Contents view, see below) and its rendering (Render view [¹], see above). Note that the page model represented by both Contents view and Render view corresponds to the content (sub)layer described in the diagram above.
Here it is just one of the possible functionalities: hovering the mouse pointer over a show-text-operation node, a tooltip pops up revealing the actual text encoded inside it (in this example, inspecting a russian-language document):
There’s such a potential for custom features that I’m considering to make it pluggable so as to let it be extended with additional modules, at user’s will.
1.2. File view
File view shows the low-level representation of the same entities you found in the above-mentioned Document view, expressed as primitive objects like dictionaries (PdfDictionary), arrays (PdfArray), streams (PdfStream) and so on.
1.3. XRef view
XRef view lists the entries of the cross-reference index (either table or stream, but that’s a technical detail you can happily ignore as it’s transparently handled by the library).
It’s really interesting to note that all the views (Document, File, XRef) are always kept synchronized: when you select a node in one of these views, its corresponding entities in each of the others are automatically selected, allowing to seamlessly switch from one view to another.
[¹] Rendering is still partial as it’s under development (pre-alpha stage).
PDF Clown 0.1.1 has been released!
This release adds support to optional/layered contents, text highlighting, metadata streams (XMP), Type1/CFF font files, along with primitive object model and AcroForm fields filling enhancements. Lots of minor improvements have been applied too.
Last but not least: ICSharpCode.SharpZipLib.dll dependency has been removed from .NET implementation.
This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.1.1%20Beta/
enjoy!
Waiting for PDF Clown 0.1.1 release
[NOTE: this post was updated on November 14, 2011]
Latest news: on November 14, 2011 PDF Clown 0.1.1 has been released!
Next release is going to introduce new exciting features (text highlighting, optional/layered contents, Type1/CFF font support, etc.) along with improvements and consolidations of existing ones (enhanced text extraction, enhanced content rendering, enhanced acroform creation and filling, etc.). This post will be kept updated according to development progress, so please stay tuned!
These are some of the things I have been working on till now:
- primitive object model enhancements
- text highlighting
- metadata streams (XMP)
- optional/layered contents
- AcroForm fields filling
1. Primitive object model enhancements
PDF primitive object model (see org.pdfclown.objects namespace) has undergone a substantial revision in order to simplify its use (transparent update), extend its functionality (bidirectional traversal), enforce its consistency (simple object immutability) and consolidate its code base (parser classes refactoring).
Bidirectional traversal has been accomplished by the introduction of explicit references to ascendants: composite objects (PdfDictionary, PdfArray, PdfStream) are now aware of their parent container, so walking through the ascending path to the root PdfIndirectObject (and File) is absolutely trivial! This functionality has loads of engaging potential applications, such as fine-grained object cloning based on structure context (as in case of Acroform annotations residing on a given page).
Ascendant-aware objects are intelligent enough to automatically detect and notify changes to their parent container, making incremental updates transparent to the user.
Simple objects have been made immutable to avoid risks of unintended changes and promote their efficient reuse.
As expected (you may have noticed some TODO task comments about this within the project’s code base), object parsing of PostScript-related formats (PDF file, PDF content stream and CMaps) has been organized under the same class hierarchy to improve its consistency and maintainability.
2. Text highlighting
Text highlighting was a much-requested feature. It took me less than one hour of enjoyable coding to write a prototype which could populate a PDF file with highlight annotations matching an arbitrary text pattern, as you can see in the following figure representing a page of Alice in Wonderland resulting from the search of “rabbit” occurrences:
This text highlighting sample leverages both text extraction [line 55] and annotation [line 106] functionalities of PDF Clown, as you can see in its source code:
package org.pdfclown.samples.cli;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.ITextString;
import org.pdfclown.documents.contents.TextChar;
import org.pdfclown.documents.interaction.annotations.TextMarkup;
import org.pdfclown.documents.interaction.annotations.TextMarkup.MarkupTypeEnum;
import org.pdfclown.files.File;
import org.pdfclown.tools.TextExtractor;
import org.pdfclown.util.math.Interval;
import org.pdfclown.util.math.geom.Quad;
/**
This sample demonstrates how to highlight text matching arbitrary patterns.
Highlighting is defined through text markup annotations.
@author Stefano Chizzolini (http://www.stefanochizzolini.it)
@since 0.1.1
@version 0.1.1
*/
public class TextHighlightSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// Define the text pattern to look for!
String textRegEx = promptChoice("Please enter the pattern to look for: ");
Pattern pattern = Pattern.compile(textRegEx, Pattern.CASE_INSENSITIVE);
// 2. Iterating through the document pages...
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
// 2.1. Extract the page text!
Map> textStrings = textExtractor.extract(page);
// 2.2. Find the text pattern matches!
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
// 2.3. Highlight the text pattern matches!
textExtractor.filter(
textStrings,
new TextExtractor.IIntervalFilter()
{
@Override
public boolean hasNext()
{return matcher.find();}
@Override
public Interval next()
{return new Interval(matcher.start(), matcher.end());}
@Override
public void process(
Interval interval,
ITextString match
)
{
// Defining the highlight box of the text pattern match...
List highlightQuads = new ArrayList();
{
/*
NOTE: A text pattern match may be split across multiple contiguous lines,
so we have to define a distinct highlight box for each text chunk.
*/
Rectangle2D textBox = null;
for(TextChar textChar : match.getTextChars())
{
Rectangle2D textCharBox = textChar.getBox();
if(textBox == null)
{textBox = (Rectangle2D)textCharBox.clone();}
else
{
if(textCharBox.getY() > textBox.getMaxY())
{
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D)textCharBox.clone();
}
else
{textBox.add(textCharBox);}
}
}
highlightQuads.add(Quad.get(textBox));
}
// Highlight the text pattern match!
new TextMarkup(page, MarkupTypeEnum.Highlight, highlightQuads);
}
@Override
public void remove()
{throw new UnsupportedOperationException();}
}
);
}
// 3. Highlighted file serialization.
serialize(file, false);
return true;
}
}
This is another example matching words which contain “co” (regular expression “\w*co\w*”):

Here you can appreciate the dehyphenation functionality applied to another search (words beginning with “devel” — regular expression “\bdevel\w*”):
3. Metadata streams (XMP)
XMP metadata streams are now available for reading and writing on any dictionary or stream entity within a PDF document (see PdfObjectWrapper.get/setMetadata()).
4. Optional/Layered contents
Smoothing out some PDF spec awkwardness while implementing the content layer (aka optional content) functionality proved to be an interesting challenge. The result was nothing but satisfaction: a clean, intuitive and rich programming interface which automates lots of annoying housekeeping tasks and lets you access even the whole raw structures in case of special needs!
The figure above represents a document generated by the following code sample; for the sake of comparison, I took an iText example and translated it to PDF Clown, adding some niceties like the cooperation between the PrimitiveComposer (whose lower-level role is graphics composition through primitive operations like showing text lines and drawing shapes) and the BlockComposer (whose higher-level role is to arrange text within page areas managing alignments, paragraph spacing and indentation, hyphenation, and so on).
package org.pdfclown.samples.cli;
import java.awt.Dimension;
import java.awt.Point;
import java.awt.Rectangle;
import org.pdfclown.documents.Document;
import org.pdfclown.documents.Document.PageModeEnum;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.contents.composition.AlignmentXEnum;
import org.pdfclown.documents.contents.composition.AlignmentYEnum;
import org.pdfclown.documents.contents.composition.BlockComposer;
import org.pdfclown.documents.contents.composition.PrimitiveComposer;
import org.pdfclown.documents.contents.fonts.StandardType1Font;
import org.pdfclown.documents.contents.layers.Layer;
import org.pdfclown.documents.contents.layers.Layer.ViewStateEnum;
import org.pdfclown.documents.contents.layers.LayerDefinition;
import org.pdfclown.documents.contents.layers.LayerGroup;
import org.pdfclown.documents.contents.layers.Layers;
import org.pdfclown.files.File;
/**
This sample demonstrates how to define layers to control content visibility.
@author Stefano Chizzolini (http://www.stefanochizzolini.it)
@since 0.1.1
@version 0.1.1
*/
public class LayerCreationSample
extends Sample
{
@Override
public boolean run(
)
{
// 1. PDF file instantiation.
File file = new File();
Document document = file.getDocument();
// 2. Content creation.
populate(document);
// 3. Serialize the PDF file!
serialize(file, false, "Layer", "inserting layers");
return true;
}
private void populate(
Document document
)
{
// Initialize a new page!
Page page = new Page(document);
document.getPages().add(page);
// Initialize the primitive composer (within the new page context)!
PrimitiveComposer composer = new PrimitiveComposer(page);
composer.setFont(new StandardType1Font(document, StandardType1Font.FamilyEnum.Helvetica, true, false), 12);
// Initialize the block composer (wrapping the primitive one)!
BlockComposer blockComposer = new BlockComposer(composer);
// Initialize the document layer configuration!
LayerDefinition layerDefinition = new LayerDefinition(document); // Creates the document layer configuration.
document.setLayer(layerDefinition); // Activates the document layer configuration.
document.setPageMode(PageModeEnum.Layers); // Shows the layers tab on document opening.
// Get the root layers collection!
Layers rootLayers = layerDefinition.getLayers();
// 1. Nested layers.
{
Layer nestedLayer = new Layer(document, "Nested layer");
rootLayers.add(nestedLayer);
Layers nestedSubLayers = nestedLayer.getLayers();
Layer nestedLayer1 = new Layer(document, "Nested layer 1");
nestedSubLayers.add(nestedLayer1);
Layer nestedLayer2 = new Layer(document, "Nested layer 2");
nestedSubLayers.add(nestedLayer2);
nestedLayer2.setLocked(true);
// NOTE: Text in this section is shown using PrimitiveComposer.
composer.beginLayer(nestedLayer);
composer.showText(nestedLayer.getTitle(), new Point(50, 50));
composer.end();
composer.beginLayer(nestedLayer1);
composer.showText(nestedLayer1.getTitle(), new Point(50, 75));
composer.end();
composer.beginLayer(nestedLayer2);
composer.showText(nestedLayer2.getTitle(), new Point(50, 100));
composer.end();
}
// 2. Simple group (labeled group of non-nested, inclusive-state layers).
{
Layers simpleGroup = new Layers(document, "Simple group");
rootLayers.add(simpleGroup);
Layer layer1 = new Layer(document, "Grouped layer 1");
simpleGroup.add(layer1);
Layer layer2 = new Layer(document, "Grouped layer 2");
simpleGroup.add(layer2);
// NOTE: Text in this section is shown using BlockComposer along with PrimitiveComposer
// to demonstrate their flexible cooperation.
blockComposer.begin(new Rectangle(50, 125, 200, 50), AlignmentXEnum.Left, AlignmentYEnum.Middle);
composer.beginLayer(layer1);
blockComposer.showText(layer1.getTitle());
composer.end();
blockComposer.showBreak(new Dimension(0, 15));
composer.beginLayer(layer2);
blockComposer.showText(layer2.getTitle());
composer.end();
blockComposer.end();
}
// 3. Radio group (labeled group of non-nested, exclusive-state layers).
{
Layers radioGroup = new Layers(document, "Radio group");
rootLayers.add(radioGroup);
Layer radio1 = new Layer(document, "Radiogrouped layer 1");
radioGroup.add(radio1);
radio1.setViewState(ViewStateEnum.On);
Layer radio2 = new Layer(document, "Radiogrouped layer 2");
radioGroup.add(radio2);
radio2.setViewState(ViewStateEnum.Off);
Layer radio3 = new Layer(document, "Radiogrouped layer 3");
radioGroup.add(radio3);
radio3.setViewState(ViewStateEnum.Off);
// Register this option group in the layer configuration!
LayerGroup options = new LayerGroup(document);
options.add(radio1);
options.add(radio2);
options.add(radio3);
layerDefinition.getOptionGroups().add(options);
// NOTE: Text in this section is shown using BlockComposer along with PrimitiveComposer
// to demonstrate their flexible cooperation.
blockComposer.begin(new Rectangle(50, 185, 200, 75), AlignmentXEnum.Left, AlignmentYEnum.Middle);
composer.beginLayer(radio1);
blockComposer.showText(radio1.getTitle());
composer.end();
blockComposer.showBreak(new Dimension(0, 15));
composer.beginLayer(radio2);
blockComposer.showText(radio2.getTitle());
composer.end();
blockComposer.showBreak(new Dimension(0, 15));
composer.beginLayer(radio3);
blockComposer.showText(radio3.getTitle());
composer.end();
blockComposer.end();
}
composer.flush();
}
}
Some comments on the code:
- document layer configuration initialization [lines 68-69]: this is the first operation to do;
- layer creation [line 77] and insertion [line 78] into the hierarchical structure;
- sublayer insertion [line 82];
- content layering [lines 89, 91]: content is enclosed within a layer section, making its visibility dependent on the layer state. There’s a subtle discrepancy in the PDF spec when it comes to nested layers: one may assume they imply a hierarchical dependency of the sublayer states, but that’s NOT the case — if you hide a layer its descendants are still visible! To work around this counterintuitive behaviour, many software toolkits wrap contents within multiple nested layer blocks; for example, if you want to wrap the text “nested layer 1″ into a layer (resource name /Pr2) which is a sublayer of another one (resource name /Pr1), the content stream will contain this cumbersome syntax:
4 0 obj
<< /Length 205 >>
stream
[...]
/OC /Pr1 BDC
/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
Q
EMC
EMC
[...]
endstream
endobj
This beast is repeated as many times as there are distinct content chunks to include within the same layer; it goes even worse as the number of nesting levels increases — just awful!
Instead of this, PDF Clown defines a default hierarchical membership for each layer which can be used as a single, terse wrapping block (resource name /Pr2):
4 0 obj
<< /Length 185 >>
stream
[...]
/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
Q
EMC
[...]
endstream
endobj
6 0 obj
<< /Type /Pages /Count 1 /Resources << /Font 7 0 R /Properties 15 0 R >> /Kids [5 0 R ] >>
endobj
15 0 obj
<< /Pr2 16 0 R >>
endobj
16 0 obj
<< /Type /OCMD /OCGs [12 0 R 11 0 R ] /P /AllOn >> % Membership containing the references to the layers belonging to the hierarchical path of nested layer 1.
endobj
This way code is concise and more maintainable (if you want to rearrange the hierarchical structure of the layers you don’t have to walk through the content stream hunting layer block occurrences for correction — just go to the membership associated to the layer and update its hierarchical path!).
- simple layer group creation and insertion [lines 104-105]
- option group definition [lines 148-152]
5. AcroForm fields filling
Text fields have been enhanced to support automatic appearance update on value change.
Celebrating Freedom
This is a neat off-topic but, besides being a human and a european citizen, I’m an Italian: as in the next week (precisely on March 17, 2011) Italy will celebrate its 150th anniversary of unification, I’d like for a moment to speak out to all you guys coming to this blog from ’round the world.
Italy had a great past made by giants; unfortunately, it has currently a troubled present made by embarrassing dwarves: despite this sad situation, please do NOT believe to the traditional stereotypes which infamously depict Italians — Italy is a complex country populated both by lovely persons and pure bastards, as any other country. ![]()
In this particular historical moment my thought goes, full of admiration, to the brave people of Northern Africa and all the other Arabs and Persians who are fighting for their freedom: as a European, I apologize for the weak political and humanitarian response of the Union’s institutions to their efforts for emancipating themselves from dictatorships. History is moving!



PDF Clown 0.1.0 has been released!
Latest news: PDF Clown 0.1.0 has been superseded by PDF Clown 0.1.1
This release introduces support to cross-reference-stream-based PDF files (as defined since PDF 1.5 spec) along with page rendering and printing: a specialized tool provides a convenient way to convert PDF pages into images (aka rasterization). Lots of minor improvements have been applied too.
Last but not least: the project’s base namespace has changed to org.pdfclown
This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.1.0%20Alpha/
enjoy!
Waiting for PDF Clown 0.1.0 release
[NOTE: this post was updated on March 4, 2011]
Latest news: on March 4, 2011 PDF Clown 0.1.0 has been released!
Hi there!
New features currently under development that will be available in the next (0.1.0) release:
- cross-reference streams and object streams
- version compatibility check
- content rasterization
- functions
- page data size (a.k.a. How to split a PDF document based on maximum file size)
It’s time to reveal you that I decided to consolidate the project’s identity (and simplify your typing life) changing its namespace prefix (it.stefanochizzolini.clown) in favor of the more succinct org.pdfclown: I know you were eager to strip that cluttering italian identifier!
Last week I was informed that USGS adopted PDF Clown for relayering their topographic maps and attaching metadata to them. Although on a technical note it’s stated that its use will be only transitory, as they are converging toward a solution natively integrated with their main application suite (TerraGo), nonetheless its service in such a production environment seems to be an eloquent demonstration of its reliability. 8)
1. Cross-reference streams and object streams
After lots of requests, I’m currently busy on the development of cross-reference stream and object stream read/write functionalities [PDF:1.6:3.4.6-7]; in particular, stream reading has been partially based upon the code that Joshua Tauberer wrote some months ago while he was experimenting with PDF Clown on PDF files analysis for his US Congress activity tracker, GovTrack.
2. Version compatibility check
Working on cross-reference streams induced me to start supporting version-compatibility checking via annotations. This feature conveniently allows users to transparently control that the PDF files they are creating or modifying conform to a target PDF version (as specified in PDF file header) according to a configurable compatibility policy, defined through Document.Configuration.CompatibilityModeEnum — these are the alternative policies applicable:
- Passthrough: document’s conformance version is ignored; any feature is accepted without checking its compatibility.
- Loose: document’s conformance version is automatically updated to support actually used features.
- Strict: document’s conformance version is mandatory; any unsupported feature is forbidden and causes an exception to be thrown in case of attempted use.
Automatic compatibility checking is very handy as users can enforce generated PDF files’ conformance without manual intervention; for example, you don’t have to tweak your PDF file version to 1.5 if you plan to use the optional content functionality (OCG [PDF:1.6:4.10]), just sit back and see it be done!
3. Content rasterization
I’m quite impressed how naturally the existing model is integrating with PDF printing and image rasterization functionalities. Leveraging the existing model means that there’s a common infrastructure (see ContentScanner and ContentObject hierarchy) that serves disparate purposes (content creation, content analysis, content extraction, content rasterization, content editing, and so on), simplifying its understanding, use, maintenance and extension. I wanna stress that my goal is to come to an elegant viewer, NOT to a cumbersome retrofit component that’s added as an alien to fill the gap!
Yes, I know these goodies had been outside my official plans for a long time, but during the last week of September, while crawling through the PDF Clown sources, I stumbled upon the above-mentioned ContentScanner and ContentObject hierarchy: I realized they were just ready for supporting content rendering, so I thought “What are we waiting for? Let’s do it!”… but don’t expect that 0.1 will deliver a full-fledged PDF viewer and printer — I’ll start prototyping the most basic graphics primitives such as space coordinates transformations, path drawing, color selection and so on. Advanced operations such as glyph outline drawing will necessarily appear afterwards. Anyway, I’m confident that at the end of the development process it will be possible to print and display PDF pages (and even independent parts of them such as external forms) along with their thumbnails.
The figure below compares an example of PDF Clown’s current rasterization capabilities (on the left, via Java 2D graphics) with its equivalent generated by Adobe Reader (on the right). As you can see, path drawing is highly conformant with the reference implementation, while no text rendering has been implemented yet.
Creating this figure was absolutely trivial — here it is the code sample used (line 34 executes the actual rendering of the first page of the document):
package org.pdfclown.samples;
import java.awt.Dimension;
import java.awt.image.BufferedImage;
import java.io.IOException;
import javax.imageio.ImageIO;
import org.pdfclown.documents.Document;
import org.pdfclown.files.File;
import org.pdfclown.tools.Renderer;
public class ContentRenderingSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Rasterize the first page!
Renderer renderer = new Renderer();
BufferedImage image = renderer.render(document.getPages().get(0),new Dimension(1400,850));
// 4. Save the rasterized image!
try
{ImageIO.write(image,"jpg",new java.io.File(getOutputPath() + java.io.File.separator + "ContentRenderingSample.jpg"));}
catch(IOException e)
{e.printStackTrace();}
return true;
}
}
As you can see in the following code chunk, Renderer.render(…) method takes care to prepare the target graphics context [line 31] delegating its rendering to the chosen content context [line 32] (that is, in this case, a Page object):
package org.pdfclown.tools;
import java.awt.Dimension;
import java.awt.image.BufferedImage;
import org.pdfclown.documents.contents.IContentContext;
/**
Tool for rendering {@link IContentContext content contexts}.
@author Stefano Chizzolini (http://www.stefanochizzolini.it)
@version 0.1.0
@since 0.1.0
*/
public class Renderer
{
. . .
/**
Renders the specified content context into an image context.
@param contentContext Source content context.
@param size Image size expressed in device-space units (that is typically pixels).
@return Image representing the rendered contents.
*/
public BufferedImage render(
IContentContext contentContext,
Dimension size
)
{
BufferedImage image = new BufferedImage(size.width,size.height,BufferedImage.TYPE_INT_BGR);
contentContext.render(image.createGraphics(),size);
return image;
}
}
The Page object then delegates its own contents rendering to ContentScanner [line 36], which sequentially scans every graphics operation producing its corresponding raster representation:
package org.pdfclown.documents;
import java.awt.Graphics2D;
import java.awt.geom.Dimension2D;
import java.awt.print.Printable;
import org.pdfclown.PDF;
import org.pdfclown.VersionEnum;
import org.pdfclown.documents.contents.ContentScanner;
import org.pdfclown.documents.contents.IContentContext;
import org.pdfclown.objects.PdfDictionary;
import org.pdfclown.objects.PdfObjectWrapper;
/**
[PDF:1.6:3.6.2] Document page.
@author Stefano Chizzolini (http://www.stefanochizzolini.it)
@since 0.0.0
@version 0.1.0
*/
@PDF(VersionEnum.PDF10)
public class Page
extends PdfObjectWrapper
implements IContentContext,
Printable
{
. . .
@Override
public void render(
Graphics2D context,
Dimension2D size
)
{
ContentScanner scanner = new ContentScanner(getContents());
scanner.render(context,size);
}
. . .
}
4. Functions
Improving the color space definitions for content rasterization is forcing me to also manage functions [PDF:1.6:3.9] in all their flavors (Type 0 (Sampled), Type 2 (Exponential Interpolation), Type 3 (Stitching) and Type 4 (PostScript Calculator)).
5. Page data size (a.k.a. How to split a PDF document based on maximum file size)
org.pdfclown.tools.PageManager has been enhanced with the introduction of an elegant algorithm that accurately calculates the data size of PDF pages keeping shared resources (like fonts, images and so on) into consideration: this practically means that you can evaluate the incremental size of each page in a document, splitting the file when the collected pages reach the maximum file size you intended for your target split PDF files, without creating any cumbersome temporary file!
It’s really funny that proprietary products like PDFTron’s, which cost at least hundreds of bucks, suggest in their official SDKs awkward trial-and-error strategies, iteratively creating horrible temporary files… evidently, only clowns can solve such a task better
— Free Software rocks! *<;o)
package org.pdfclown.samples;
import java.util.HashSet;
import java.util.Set;
import org.pdfclown.documents.Document;
import org.pdfclown.documents.Page;
import org.pdfclown.documents.Pages;
import org.pdfclown.files.File;
import org.pdfclown.objects.PdfReference;
import org.pdfclown.tools.PageManager;
public class SplitSample
extends Sample
{
private static final long MaxDataSize = 4 << 20; // 4 MBytes (you can obviously set it at your will).
private PageManager manager;
@Override
public boolean run(
)
{
// 1. Opening the PDF file...
File file;
{
String filePath = promptPdfFileChoice("Please select a PDF file");
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
}
Document document = file.getDocument();
Pages pages = document.getPages();
// 2. Splitting the document...
manager = new PageManager(document);
int splitIndex = 0;
long incrementalDataSize = 0;
int beginPageIndex = 0;
Set visitedReferences = new HashSet();
for(Page page : pages)
{
long pageDifferentialDataSize = PageManager.getSize(page,visitedReferences);
incrementalDataSize += pageDifferentialDataSize;
if(incrementalDataSize > MaxDataSize) // Data size limit reached.
{
int endPageIndex = page.getIndex();
// Split the current document page range!
splitDocument(++splitIndex,beginPageIndex,endPageIndex);
beginPageIndex = endPageIndex;
incrementalDataSize = PageManager.getSize(page,visitedReferences = new HashSet());
}
}
// Split the last document page range!
splitDocument(++splitIndex,beginPageIndex,pages.size());
return true;
}
private void splitDocument(
int splitIndex,
int beginPageIndex,
int endPageIndex
)
{
// 1. Split the document!
Document splitDocument = manager.extract(beginPageIndex,endPageIndex);
// 2. Serialize the split file!
serialize(splitDocument.getFile(),this.getClass().getSimpleName() + "." + (splitIndex),false);
}
}
PDF Clown 0.0.8: Q&A
[NOTE: this post was updated on March 21, 2011]
Latest news: PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0) — as 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!
This post collects all the relevant information about issues and questions regarding PDF Clown 0.0.8.
If you have any doubt on topics not treated here, please apply your question to the Help forum.
1. ‘GoToExternalDestination’ class missing
See Topic 3836075 in the Help forum.
2. ‘xref’ keyword not found
See Topic 3434621 in the Help forum.
3. Unknown type: Comment
See Topic 3863926 in the Help forum.
4. Text line height
See Topic 3928380 in the Help forum.
PDF Clown 0.0.8 has been released!
[NOTE: this post was updated on March 21, 2011]
Latest news: PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0) — as 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!
This release is focused on text extraction support: a specialized tool provides, along with plain-text extraction, advanced functionalities such as full graphic state of extracted text (font, font size, text color, text rendering mode, text position…), text filtering by area, text grouping and sorting. Lots of minor improvements have been applied too.
Java version migrated to Java 6 platform, while C#/.NET version migrated to .NET 3.5.
LGPL 3 is the new license applied to the project.
Last but not least: the distribution’s directory structure has been revised to simplify its navigation and ease its integration with common IDEs (Eclipse- and Visual Studio-compatible).
This release may be downloaded from:
https://sourceforge.net/projects/clown/files/PDFClown-devel/0.0.8%20Alpha/
enjoy!
Patches
Waiting for PDF Clown 0.0.8 release
[NOTE: this post was updated on March 21, 2011]
Latest news: PDF Clown 0.0.8 functionalities are part of the latest release (PDF Clown 0.1.0) — as 0.0 version series is under decommissioning, you’re warmly invited to adopt the current 0.1 version series. Thank you!
I know, it’s been just about one year since the latest version (0.0.7) was released… please, forgive me!
In the meantime PDF Clown has been growing considerably to provide a rich text extraction functionality for its next 0.0.8 version:
- the font model has been deeply revised and expanded to smoothly support character encoding issues;
- the content stream model has been furtherly harmonized to simplify the access to text contents;
- the content scanner has been simplified in its iterative mechanism and enriched through a new level of abstraction to allow easy object placement detection (image and text characters coordinates);
- a text extraction tool allows sub-page region selection to extract text only from specific page areas.
Waiting for the termination of the current development iteration, let’s see some new stuff!
NOTE: the following code samples are expressed extending the Sample class common to all the CLI samples shipped with PDF Clown 0.0.8 downloadable distribution.
1. Basic text extraction
This code sample demonstrates the most basic way to extract text content according to PDF Clown 0.0.8.
package it.stefanochizzolini.clown.samples;
import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ContentScanner;
import it.stefanochizzolini.clown.documents.contents.fonts.Font;
import it.stefanochizzolini.clown.documents.contents.objects.ContainerObject;
import it.stefanochizzolini.clown.documents.contents.objects.ContentObject;
import it.stefanochizzolini.clown.documents.contents.objects.ShowText;
import it.stefanochizzolini.clown.documents.contents.objects.Text;
import it.stefanochizzolini.clown.files.File;
import java.util.HashMap;
import java.util.Map;
public class BasicTextExtractionSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting text from the document pages...
for(Page page : document.getPages())
{
if(!prompt(page))
return false;
extract(
new ContentScanner(page) // Wraps the page contents into a scanner.
);
}
return true;
}
/**
Scans a content level looking for text.
*/
/*
NOTE: Page contents are represented by a sequence of content objects,
possibly nested into multiple levels.
*/
private void extract(
ContentScanner level
)
{
if(level == null)
return;
while(level.moveNext())
{
ContentObject content = level.getCurrent();
if(content instanceof ShowText)
{
Font font = level.getState().font;
// Extract the current text chunk, decoding it!
System.out.println(font.decode(((ShowText)content).getText()));
}
else if(content instanceof Text
|| content instanceof ContainerObject)
{
// Scan the inner level!
extract(level.getChildLevel());
}
}
}
private boolean prompt(
Page page
)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
}
In order to understand this sample, you have to know that the PDF Specification prescribes text content to be shown through so-called ShowText operations; so, we look for that kind of object…
Here it is the way it works:
- iterate the document pages [lines 36-44] applying the ContentScanner to the current page [lines 39-41];
- iterate the current page contents (through ContentScanner) looking for ShowText operations [lines 63-78], recurring into ContainerObject-s and Text objects;
- extract the text content from ShowText operations [line 70].

Incipit of the Japanese translation of the UN Universal Declaration of Human Rights
Applying this code sample to a document such as the Japanese translation of the UN Universal Declaration of Human Rights (see above), the result is pretty accurate, despite the extracted text contains exceeding line breaks (see second row in the figure below): such discrepancy is due to the way the PDF Specification defines text data representation. Particularly, contents within ShowText operations may have been (legally) arbitrarily split by the document generator, as at the time of its inception the PDF format was primarily aimed at typographic rendition instead of content accessibility. For this purpose, the above-mentioned TextExtractor tool provides the appropriate heuristics to effortlessly organize the extracted text in a more intelligible manner (see the following paragraphs).

Text extracted by PDF Clown from the incipit of the Japanese translation of the UN Universal Declaration of Human Rights.
2. Extended text extraction
This code sample shows how to exploit the new abstraction level provided by the content scanner of PDF Clown 0.0.8, which allows you to get a rich set of information describing the graphic state of extracted text (font, font size, text color, text rendering mode, text bounding box…).
In order to demonstrate its precision in detecting text position, the following code also draws the bounding box of each single character appearing on the pages.
package it.stefanochizzolini.clown.samples;
import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ContentScanner;
import it.stefanochizzolini.clown.documents.contents.TextChar;
import it.stefanochizzolini.clown.documents.contents.colorSpaces.DeviceRGBColor;
import it.stefanochizzolini.clown.documents.contents.composition.PrimitiveFilter;
import it.stefanochizzolini.clown.documents.contents.objects.ContainerObject;
import it.stefanochizzolini.clown.documents.contents.objects.ContentObject;
import it.stefanochizzolini.clown.documents.contents.objects.Text;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.PageStamper;
import java.awt.geom.Rectangle2D;
public class TextInfoExtractionSample
extends Sample
{
private DeviceRGBColor[] textCharBoxColors = new DeviceRGBColor[]
{
new DeviceRGBColor(200f/255,100f/255,100f/255),
new DeviceRGBColor(100f/255,200f/255,100f/255),
new DeviceRGBColor(100f/255,100f/255,200f/255)
};
private DeviceRGBColor textStringBoxColor = DeviceRGBColor.Black;
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
PageStamper stamper = new PageStamper(); // NOTE: Page stamper is used to draw contents on existing pages.
// 3. Iterating through the document pages...
for(Page page : document.getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
stamper.setPage(page);
extract(
new ContentScanner(page), // Wraps the page contents into a scanner.
stamper.getForeground()
);
stamper.flush();
}
serialize(file,false);
return true;
}
/**
Scans a content level looking for text.
*/
private void extract(
ContentScanner level,
PrimitiveFilter builder
)
{
if(level == null)
return;
while(level.moveNext())
{
ContentObject content = level.getCurrent();
if(content instanceof Text)
{
ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)level.getCurrentWrapper();
int colorIndex = 0;
for(ContentScanner.TextStringWrapper textString : text.getTextStrings())
{
Rectangle2D stringBox = textString.getBox();
System.out.println(
"Text ["
+ "x:" + Math.round(stringBox.getX()) + ","
+ "y:" + Math.round(stringBox.getY()) + ","
+ "w:" + Math.round(stringBox.getWidth()) + ","
+ "h:" + Math.round(stringBox.getHeight())
+ "]: " + textString.getText()
);
// Drawing text character bounding boxes...
colorIndex = (colorIndex + 1) % textCharBoxColors.length;
builder.setStrokeColor(textCharBoxColors[colorIndex]);
for(TextChar textChar : textString.getTextChars())
{
/*
NOTE: You can get further text information
(font, font size, text color, text rendering mode)
through textChar.style.
*/
builder.drawRectangle(textChar.box);
builder.stroke();
}
// Drawing text string bounding box...
builder.beginLocalState();
builder.setLineDash(0, 5);
builder.setStrokeColor(textStringBoxColor);
builder.drawRectangle(textString.getBox());
builder.stroke();
builder.end();
}
}
else if(content instanceof ContainerObject)
{
// Scan the inner level!
extract(level.getChildLevel(),builder);
}
}
}
}
This sample works exactly the same way as the previous “1. Basic text extraction” sample, but it dramatically empowers the extraction functionality providing decoded text along with its graphic attributes, such as font, font size, bounding box, text color, and so on:
- ContentScanner.TextWrapper represents a text object extracted from the ContentScanner [line 82];
- each ContentScanner.TextWrapper contains a list of text chunks (ContentScanner.TextStringWrapper) [line 84];
- each ContentScanner.TextStringWrapper contains a list of text characters (TextChar) [line 99];
- each TextChar provides information about the character state (position and style).
The figure below shows the result of this code running over the greek translation of the UN Universal Declaration of Human Rights.
3. Advanced text extraction
PDF Clown supports a third level of text extraction functionality built upon the others (basic and extended, as seen above): the TextExtractor tool.
Its purpose is to leverage the extended text extraction features for sorting, aggregating and integrating the retrieved text chunks. With TextExtractor you can:
- extract full text information (text content along with graphic attributes for each single character (font, font size, text color, text rendering mode, text bounding box…)) or just plain text;
- extract all the text content in a page (or any other IContentContext, such as FormXObject) or filter just partial page areas.
3.1. Plain text extraction
This sample demonstrates the extreme simplicity involved in extracting plain text from a page: after you have instantiated the TextExtractor [line 31], it’s just a matter of passing your page [line 38] — nothing but 1 line of code!
package it.stefanochizzolini.clown.samples;
import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.TextExtractor;
import java.util.HashMap;
import java.util.Map;
public class AdvancedPlainTextExtractionSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting plain text from the document pages...
TextExtractor extractor = new TextExtractor();
for(Page page : document.getPages())
{
if(!prompt(page))
return false;
// Extract plain text from the current page!
System.out.println(extractor.extractPlain(page));
}
return true;
}
private boolean prompt(
Page page
)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
}
3.2. Full text extraction
In this case text content is extracted along with its graphic attributes (font, font size, text color, text rendering mode, text bounding box…).
Note that, as we didn’t specify any particular page area, text strings are all gathered within the default area (the page itself), identified by the null key [line 40].
package it.stefanochizzolini.clown.samples;
import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.contents.ITextString;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.tools.TextExtractor;
import java.awt.geom.Rectangle2D;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class AdvancedTextExtractionSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting text from the document pages...
TextExtractor extractor = new TextExtractor();
for(Page page : document.getPages())
{
if(!prompt(page))
return false;
List<ITextString> textStrings = extractor.extract(page).get(null);
for(ITextString textString : textStrings)
{
Rectangle2D textStringBox = textString.getBox();
System.out.println(
"Text ["
+ "x:" + Math.round(textStringBox.getX()) + ","
+ "y:" + Math.round(textStringBox.getY()) + ","
+ "w:" + Math.round(textStringBox.getWidth()) + ","
+ "h:" + Math.round(textStringBox.getHeight())
+ "]: " + textString.getText()
);
}
}
return true;
}
private boolean prompt(
Page page
)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
}
3.3. Page area filtering
Text filtering by page area can be done both before and after extracting the page text:
- pre-filtering: TextExtractor.getAreas()/setAreas(…) methods allow the user to define the relevant page areas before extracting the text;
- post-filtering: TextExtractor.filter(…) methods allow the user to select text by area from previously-extracted text (useful in case of multi-stage processing).
In this case we apply the text filtering to a common task: retrieving the text associated to link annotations on a page [lines 80-82] (maybe you don’t know that text links on PDF pages are just superimposed to the “associated” text, so a location inference is needed in order to match the position of a link annotation with the respective text — such a tough work
).
package it.stefanochizzolini.clown.samples;
import it.stefanochizzolini.clown.documents.Document;
import it.stefanochizzolini.clown.documents.Page;
import it.stefanochizzolini.clown.documents.PageAnnotations;
import it.stefanochizzolini.clown.documents.contents.ITextString;
import it.stefanochizzolini.clown.documents.fileSpecs.FileSpec;
import it.stefanochizzolini.clown.documents.interaction.actions.Action;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToDestination;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToEmbedded;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToNonLocal;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToURI;
import it.stefanochizzolini.clown.documents.interaction.actions.GoToEmbedded.TargetObject;
import it.stefanochizzolini.clown.documents.interaction.annotations.Annotation;
import it.stefanochizzolini.clown.documents.interaction.annotations.Link;
import it.stefanochizzolini.clown.documents.interaction.navigation.document.Destination;
import it.stefanochizzolini.clown.files.File;
import it.stefanochizzolini.clown.objects.PdfObjectWrapper;
import it.stefanochizzolini.clown.tools.TextExtractor;
import java.awt.geom.Rectangle2D;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class LinkTextExtractionSample
extends Sample
{
@Override
public boolean run(
)
{
String filePath = promptPdfFileChoice("Please select a PDF file");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting links text from the document pages...
TextExtractor extractor = new TextExtractor();
extractor.setAreaTolerance(2); // 2 pt tolerance on area boundary detection.
for(Page page : document.getPages())
{
if(!prompt(page))
return false;
Map<Rectangle2D,List<ITextString>> textStrings = null;
// Get the page annotations!
PageAnnotations annotations = page.getAnnotations();
if(annotations == null)
{
System.out.println("No annotations here.");
continue;
}
boolean linkFound = false;
for(Annotation annotation : annotations)
{
if(annotation instanceof Link)
{
linkFound = true;
if(textStrings == null)
{textStrings = extractor.extract(page);}
Link link = (Link)annotation;
Rectangle2D linkBox = link.getBox();
/*
Extracting text superimposed by the link...
NOTE: As links have no strong relation to page text but a weak location correspondence,
we have to filter extracted text by link area.
*/
StringBuilder linkTextBuilder = new StringBuilder();
for(ITextString linkTextString : extractor.filter(textStrings,linkBox))
{linkTextBuilder.append(linkTextString.getText());}
System.out.println("Link '" + linkTextBuilder + "' ");
System.out.println(
" Position: "
+ "x:" + Math.round(linkBox.getX()) + ","
+ "y:" + Math.round(linkBox.getY()) + ","
+ "w:" + Math.round(linkBox.getWidth()) + ","
+ "h:" + Math.round(linkBox.getHeight())
);
System.out.print(" Target: ");
PdfObjectWrapper<?> target = link.getTarget();
if(target instanceof Destination)
{printDestination((Destination)target);}
else if(target instanceof Action)
{printAction((Action)target);}
else if(target == null)
{System.out.println("[not available]");}
else
{System.out.println("[unknown type: " + target.getClass().getSimpleName() + "]");}
}
}
if(!linkFound)
{
System.out.println("No links here.");
continue;
}
}
return true;
}
private void printAction(
Action action
)
{
System.out.println("Action [" + action.getClass().getSimpleName() + "] " + action.getBaseObject());
if(action instanceof GoToDestination<?>)
{
if(action instanceof GoToNonLocal<?>)
{
FileSpec fileSpec = ((GoToNonLocal<?>)action).getFileSpec();
if(fileSpec != null)
{System.out.println(" Filename: " + fileSpec.getFilename());}
if(action instanceof GoToEmbedded)
{
TargetObject target = ((GoToEmbedded)action).getTarget();
System.out.println(" EmbeddedFilename: " + target.getEmbeddedFileName() + " Relation: " + target.getRelation());
}
}
System.out.print(" ");
printDestination(((GoToDestination<?>)action).getDestination());
}
else if(action instanceof GoToURI)
{System.out.println(" URI: " + ((GoToURI)action).getURI());}
}
private void printDestination(
Destination destination
)
{
System.out.println(destination.getClass().getSimpleName() + " " + destination.getBaseObject());
System.out.print(" Page ");
Object pageRef = destination.getPageRef();
if(pageRef instanceof Page)
{
Page refPage = (Page)pageRef;
System.out.println((refPage.getIndex()+1) + " [ID: " + refPage.getBaseObject() + "]");
}
else
{System.out.println(((Integer)pageRef+1));}
}
private boolean prompt(
Page page
)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
}












Get PDF Clown!
