PDF Explained

A guide on how PDFs work at a low level

View on GitHub

Data Types in PDFs

Integers and Real Numbers

An integer is composed of one or more decimal digits (0 through 9), optionally preceded by a plus or minus sign.

0 +1 -1 69

A real number is composed of one or more decimal digits (0 trought 9), optionally preceded by a plus or minus sign, and containing a single decimal point ..

0.1 .25 -0.007 4.20

Be aware that exponential notation is not supported

Strings

A string is a series of bytes, written between parentesis.

(Hello World!)

The backslash \ is used to escape special characters like the backslash it self and a parenthesis, or it can be used to mark special character sequences(see table below).

(Hello \\ world \( but with escaped characters\))

Character sequence | Meaning ———— | ————- \n | Line feed \r | Carriage return \t | Horizonatal tab \b | Backspace \f | Form feed *ddd* | Display character from provided 3 octal digits

Hexadecimal Strings

Strings can be also encoded as a sequence of hexadecimal digits between two < and > characters, with each pair rappresenting a byte. If the number of digits is odd, the last digit is assumed to be 0.

<5044462061726520636f6f6c21> % Converts to 'PDFs are cool!'

Dates

Dates are actually strings with a particular format shown below.

(D: YYYYMMDDHHmmSSOHH'mm)

Key | Value ———— | ————- YYYY | The year MM | The month DD | The day HH | The hour mm | The minute O | Timezone + or - or Z if UTC HH | The hours offset of the timezone, if present must be followed by an apostrophe ' mm | The the minutes offset of the timezone

Names

Names are words preceded by a forward slash / and find many uses all troughout a document, but their most common use is as keys for Dictionaries (see below). They can contain #xx hexadecimal ASCII (like URLs) which will be converted by the reader on need.

/Name
/Name#20with#20spaces % -> 'Name with spaces'

Booleans

Denoted by the keywords true and false, they are exactly what you would expect them to be.

Arrays

Arrays are a way to group other data in an ordered manner. The data doesn’t have to be all of the same type and and array can contain another arrays.

[0 0 600 800]
[/Red 42 [true (Hi!)]]

Dictionaries

Dictionaries group a series of keywords, as names, relating them to other data in an unordered manner. They are denoted by << and >> at the beginning and end. Like arrays dictionaries can contain other dictionaries and any other data type.

<<  /Hello /World
    /Flag true
    /Numbers <<
        /One 1
        /Two 2
        /Three 3
    >>
    /Words [(Hi) (Hello) (Sup)]
>>

Indirect References

As we said in the PDF structure, PDFs are made of a series of objects. To link this together we use indirect references. They are composed of the object number, followed by the generation number, ending with a capital R.

6 0 R

Streams

Streams are used to store binary data, like images and fonts, but also text and grapycs. Streams are made of a dictionary, followed by the stream keyword and then the endstream keyword. In between the stream and endstream keyword is where is the data at. The dictionary has to contain the /Lenght key with the lenght of the number of bytes of the stream. Additionally it can contain a /Filter key which mark which compression method has been used in the stream (which is almost always compressed); see the table below for more info. This can be combined, and have to be marked trough an array.

4 0 obj
<<
/Length 64
>>
stream
1. 0. 0. 1. 50. 700. cm
BT
  /F0 36. Tf
  (Hello World!) Tj
ET
endstream
endobj

Method name | Description ———— | ————- /ASCIIHexDecode | Similar to hexadecimal strings, every hexadecimal pair encodes one byte of uncompressed data /ASCII85Decode | This uses only 7 of the 8 bits of every byte of data to encode the data /LZWDecode | Marks the stream to be using the Lempel-Ziv-Welch compression /FlateDecode | Marks the stream to be using the open source zlib library /RunLengthDecode | A simple byte-based run-lenght compressor /CCITTFFaxDecode | Implements Group 3 and 4 encoding as used in fax machines /JBIG2Decode | A modern alternative to /CCITTFaxDEcode; implements JBIG2 compression /DCTDecode | JPEG lossy compression; a whole JPEG file may be put in this stream including headers /JPXDecode | JPEG2000 loddy and lossless compression