Software Developer, Chemist

PDF is dead, long live PDF

A vector illustration of a red-tinted document.

About a year ago I posted on facebook saying that people should never use the pdf file format, least of all in academia. As I've gotten older and wiser, I think I may need to soften that position slightly.

This piece is the first in what is likely to be a very long series of articles writing up thoughts contained in my notes that I made during my relatively brief stint in academia. I have about a year's worth of blog posts contained in those notes, containing thoughts on everything from the Unicode standard through to pdf files and a design for a new kind of operating system.

The irony is not lost on me that I only now have time to write these notes up because I have left academia. I am no longer looking to publish these items in academic journals and so I do not need to get the ideas to any specific point of completeness, nor implement the ideas in order to publish them with any specific gravitas. Funnily enough, I also have a number of things to write about academic publishing and career structure too.

PDF files are not nearly so horrible as I had previously believed them to be, and yet are equally as horrible as I had believed them to be. The problem with them really stems from what one believes they are for, and what one believes the nature of information is, and implicitly in this, what the purpose of academic publishings is.

I feel an illustrative parable is in order.

A mathematician, an artist and a computer scientist are in front of a page. On it are some markings. "This is an equation" says the mathematician. "This is a series of strokes in a calligraphic pen" states the artist. "This is a binary plane in which some values are 0 and others are 1", remarks the computer scientist.

All three, of course, have a different understanding of the same item, and some of that comes from their perspective. Our perspective in research comes primarily on what we view a document as being for. Publishers view their publications as a means of broadcasting information from scientists to other scientists; without necessarily caring or needing to care that it might be desireable for the consumer of that information to be a machine.

For the former purpose, a pdf is actually ideal. They represent a very small data footprint for the information they contain, and because of their internal data structure are very processor light to parse. The problem for academics is that the additional information of what exactly the document contains, for example “This is an equation and not an image&rdqo; is lost, making the automated comprehension of information broadcasted in this way an extremely challenging problem.

Take, for a concrete example, the typesetting habit of having a dash at the end of a line, which splits a word accross two lines. This convention allows us to maintain readability, but does not alter the meaning of the word we have split. Nevertheless, if you take a pdf in which this typesetting convention has been followed, and search for a word which has been split in this manner, the search will pass over any instances of the word which have been split. This is because the pdf file, and the programs which read them have no notion of a word, much less one which has been split, and so the information that the word exists is lost in those cases.

An illustration of a dash at the end of a typeset document.

This problem is compounded with older versions of pdf, which didn't have the capacity for some international characters, and so makers of pdf files had to come up with fudges to permit the characters to appear. Nevertheless, while the writing appears correct to a human viewer, this impedes or prevents the searching of the pdf for those words and characters.

The pdf is, therefore, a print-like document. It serves the same purpose as printed documents but in digital format. This is not dissimilar from the relationship between the source code of a computer program and the programs that most people actually install on their computers. Few people have an interest in compiling programs from source code – a slow and error-prone process – and simply wish to recieve the program compiled into source code. If you were to open the compiled version of any reasonably complex computer program however, editing it would be a challenge even for the most experienced programmers. The semantic meaning has been removed from the distribution.

So why would we want computers to be able to differentiate content in these documents? There are many potential applications in this area. One which springs immediately to mind is that of searchability. Chemistry researchers' work, for instance, would be made much more straightforward if they could determine, without depending on human-intensive processes, whether a molecule occurs in the literature previously. Searching for known equations in an unambiguous way would also be of great help to many scientists. These tools would be made much more feasible if the information was stored in a format wherein the specifics of the data were digestible. Additionally, it would make it much easier to produce survey statistics of the literature.

Of course, this is not without human cost, and will likely necessitate putting some of this semantic information back in to old pdf files. However, I would suggest that this definite, finite, cost would be much lower than the indefinite human effort cost of trying to survey an ever-growing body of literature without these tools in place.

One problem remains however, in that the alternatives as they currently stand are poor. Those which exist have flawed semantic models for the information they contain. As a compounding issue, they have frequently not been implemented or designed in a way that future editions of the same tools can retro-actively correct for the mistakes which have been made.html, for instance, has a peculiar way of handling lists, formatting for paged documentation is difficult at best, and the implementation of mathematics is extraordinarily ambiguous for some very simple cases, let alone more complex ones. This, to the point where well known sites like Wikipedia still use the implementations common in the typesetting program Latex.

Html does have the advantage over many data formats that, at least in principle, the information about how things are displayed is kept separate from the information about the objects themselves, making parsing for understanding very straightforward.

With that in mind, Latex is a strong contender, but it's parsing model makes implementing understanding engines complex. It' parsing model also makes it difficult in some cases to ascertain where one segment of an object ends, and the next begins. Some of this stems from the fact that Latex is inescapably a typesetting program, and so mixes the code for how things appear in with what things are. It only really cares what things are insofar as this informs how they appear. This also leads to some remaining ambiguity in the discussion of mathematics, though not nearly so much as is found in mathematical markup language.

xml-fo files have many of the weaknesses of both Latex and html. All of the three standards have some missing elements which mean they are not complete, even at a cursory level, for describing academic texts.

I do have designs for alternative tools of course (otherwise I would have little additional to add to this discussion). It would be hubris of me to say that I have found the “one true way” to represent academic documents, but I do believe that I have found a good starting point, and one which would be much more amenable to extension into specific domains than current standards. In future upcoming posts, I intend to discuss my alternative designs.