Waiting for PDF Clown 0.1.2 release
[NOTE: this post was updated on February 9, 2012]
1. PDF Clown DOM Inspector
Since its earliest versions, PDF Clown has been shipped including a simple Swing-based proof of concept for viewing PDF file structures. Now that little fledgling is going to become a comprehensive tool for the visual editing of the structure of PDF files: PDF Clown DOM Inspector. It will be part of next 0.1.2 version as a dedicated project within the PDF Clown distribution.

This tool conforms to the PDF model as defined by PDF Clown (see the diagram above), which adheres to the official PDF Reference 1.7/ISO 32000-1. This implies that a PDF file is represented through several concurrent views which work at different abstraction levels: Document view (document layer), File view (file/object layer, hierarchical) and XRef view (file/object layer, flat).
1.1. Document view
Document view (see the left pane in the above screenshot) shows the high-level structure of a PDF file; selecting a node, its data is shown in the right pane through several views — in this case, selecting a page node shows its content stream structure (Contents view, see below) and its rendering (Render view [¹], see above). Note that the page model represented by both Contents view and Render view corresponds to the content (sub)layer described in the diagram above.
Here it is just one of the possible functionalities: hovering the mouse pointer over a show-text-operation node, a tooltip pops up revealing the actual text encoded inside it (in this example, inspecting a russian-language document):
There’s such a potential for custom features that I’m considering to make it pluggable so as to let it be extended with additional modules, at user’s will.
1.2. File view
File view shows the low-level representation of the same entities you found in the above-mentioned Document view, expressed as primitive objects like dictionaries (PdfDictionary), arrays (PdfArray), streams (PdfStream) and so on.
1.3. XRef view
XRef view lists the entries of the cross-reference index (either table or stream, but that’s a technical detail you can happily ignore as it’s transparently handled by the library).
It’s really interesting to note that all the views (Document, File, XRef) are always kept synchronized: when you select a node in one of these views, its corresponding entities in each of the others are automatically selected, allowing to seamlessly switch from one view to another.
[¹] Rendering is still partial as it’s under development (pre-alpha stage).
2. Text line alignment
Enhancing an appreciated code contribution by Manuel Guilbault, text line alignment now supports all the standard modes commonly available in typesetting environments (Top, Middle, Bottom, Super (absolute/relative) and Sub (absolute/relative)) and image inlining.
3. File references (file specifications, file identifiers, PDF stream object externalization)
Spurred by an engaging user request, file specification management (now modelled in org.pdfclown.documents.files namespace instead of the old org.pdfclown.documents.fileSpec) has been thoroughly revised to smoothly support PDF stream objects import/export from/to external files.
This practically means that, instead of embedding stream data directly into a PDF file, such data can reside in an external (local or remote) file and be linked from within the PDF file through a file specification object (org.pdfclown.documents.files.FileSpecification). Thus common resources such as images can be shared among multiple documents (useful for example in a server scenario where documents may be assembled on-the-fly).
Anyway, there’s a caveat to consider before approaching externalized streams: as they are prone to security issues, their actual support by PDF viewers is very restricted (e.g., see so-called “privileged locations” in Adobe Acrobat’s Enhanced Security preferences) or even non-existent (e.g., see Evince).
Here it is a code sample demonstrating how external references are applied to PDF stream objects:
- PDF stream data is exported and linked back [lines 62-68];
- linked files are imported back into their respective PDF stream objects [lines 95-98].
package org.pdfclown.samples.cli;
import org.pdfclown.documents.Document;
import org.pdfclown.documents.files.FileSpecification;
import org.pdfclown.files.File;
import org.pdfclown.files.SerializationModeEnum;
import org.pdfclown.objects.PdfDataObject;
import org.pdfclown.objects.PdfIndirectObject;
import org.pdfclown.objects.PdfStream;
/**
This sample demonstrates how to move stream data outside PDF files and keep external
references to them; it demonstrates also the inverse process (reimporting stream data
from external files).
Note that, due to security concerns, external streams are a discouraged feature which
is often unsupported on third-party viewers and disabled by default on recent Adobe
Acrobat versions; in the latter case, in order to bypass restrictions and allow access
to external streams, users have to enable Enhanced Security from the Preferences dialog,
specifying privileged locations.
@author Stefano Chizzolini (http://www.stefanochizzolini.it)
@since 0.1.2
@version 0.1.2, 01/29/12
*/
public class StreamExternalizationSample
extends Sample
{
@Override
public boolean run(
)
{
// 1. Externalizing the streams...
String externalizedFilePath;
{
// 1.1. Opening the PDF file...
File file;
{
String filePath = promptPdfFileChoice("Please select a PDF file");
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
}
Document document = file.getDocument();
/*
NOTE: As we are going to export streams using paths relative to the output path,
it's necessary to ensure they are properly resolved (otherwise they will be
written relative to the current user directory).
*/
file.setPath(getOutputPath());
// 1.2. Iterating through the indirect objects to externalize streams...
int filenameIndex = 0;
for(PdfIndirectObject indirectObject : file.getIndirectObjects())
{
PdfDataObject dataObject = indirectObject.getDataObject();
if(dataObject instanceof PdfStream)
{
PdfStream stream = (PdfStream)dataObject;
if(stream.getDataFile() == null) // Internal stream to externalize.
{
stream.setDataFile(
FileSpecification.get(
document,
getClass().getSimpleName() + "-external" + filenameIndex++
),
true // Forces the stream data to be transferred to the external location.
);
}
}
}
// 1.3. Serialize the PDF file!
externalizedFilePath = serialize(file, SerializationModeEnum.Standard);
}
// 2. Reimporting the externalized streams...
{
// 2.1. Opening the PDF file...
File file;
try
{file = new File(externalizedFilePath);}
catch(Exception e)
{throw new RuntimeException(externalizedFilePath + " file access error.",e);}
// 2.2. Iterating through the indirect objects to internalize streams...
for(PdfIndirectObject indirectObject : file.getIndirectObjects())
{
PdfDataObject dataObject = indirectObject.getDataObject();
if(dataObject instanceof PdfStream)
{
PdfStream stream = (PdfStream)dataObject;
if(stream.getDataFile() != null) // External stream to internalize.
{
stream.setDataFile(
null,
true // Forces the stream data to be transferred to the internal location.
);
}
}
}
// 2.3. Serialize the PDF file!
String externalizedFileName = new java.io.File(externalizedFilePath).getName();
String internalizedFilePath = externalizedFileName.substring(0, externalizedFileName.indexOf(".pdf")) + "-reimported.pdf";
serialize(file, internalizedFilePath, SerializationModeEnum.Standard);
}
return true;
}
}
Working on file specifications involved also the support to file identifiers (PDF 1.7, § 10.3 — modelled by org.pdfclown.files.FileIdentifier class), which enforce referential integrity on document interchange. Their generation and update are now part of the document life cycle automatically managed by PDF Clown.







Get PDF Clown!
