Here’s another of my beefs: publishing PDF on the web is lazy, bad practice.
PDF – portable document format – what does that mean? It means, here’s what you want to print … in a file. It’s a portable print format; not a native web document format and not an open document format. It looks pretty, but is substantially dumb. You can create TOCs and embed links and annotations etc but almost all the document structure and semantics is lost, subverted to the holy grail of print replication.
Most often PDF is published on the web because it’s convenient and no hassle to do so. We’ve authored a pretty Word (or other) doc; we’ll print it to PDF and publish it on the web; it looks good … job done!
It looks good, but how does it feel? If you want to navigate it or extract meaning from it, it feels bad. There’s a place, a big place, for PDF, but it’s no substitute for an open web format which can be rich with meaning.
In the US Adobe is touting its technologies as good for open government; but Adobe is Bad for Open Government; bad for open anything.
I’m unsure about this. I think that PDF integration in browsers is in a woeful state and that what the web could do with is something that displays natively in the browser like HTML but is still easily encapsulated for complete downloading. I’d also really miss accurate and reliable pagination if it was gone. I think lazy publishing is where you don’t think about the format you should use for publications and that’s a bigger problem. One of the worst uses of PDF I’ve ever seen was when a law firm emailed a PDF letter for a copyright cease and desist. The email effectively read “see attached document” and the content of the message was in the attachment and that’s clearly insane behaviour. That’s not what I think PDF should be used for and it’s better at keeping assorted bits of information together than HTML.
It really helps the opengov thing that (old) PDF is an open standard. I think the strength of PDF is that it is very flexible, it can hold the text of a newspaper article just as easily as the scanned image of the report and that simply can’t be done in HTML.
There are lots of positive things to be said for PDFs, particularly where what you want is a standardised, authenticated product, a hymn-sheet where everyone can be sure of being on the right page, at the right line, on the same note.
A good example is the PDF versions of the official Law Reports, going back to 1865, where retaining the appearance and pagination of the original document is essential for citation in court. I personally hate reading law reports on line where the page number is buried in the HTML mid-line somewhere. I want the original page breaks so when someone quotes from a judgment I can find the quote and read it. I can print it out and know that my page 63 will be the same as the page 63 being read by anyone else, whether from the hard copy published by ICLR or licensed electronic versions available on Justis or Westlaw or LexisNexis.
PDFs can provide many of the benefits of HTML or XML: internal or external links, navigational bookmarks etc. They are not only easier to read on screen but much better formatted when printed out. I certainly wouldn’t reject them quite so glibly. But I accept that they are not as flexible for reading on different devices, such as phones and e-books, where a certain malleability of type size and display is essential.
Stephen and Paul – Yes, PDF is good for the reasons you mention. My point though is that PDF does not capture the structure and semantics of a document which is essential if it is to be used intelligently on the web. For example a computer cannot tell from a PDF version of a law report what are the headnotes, who are the parties, what are the citations within it etc. It cannot even reliably distinguish headings, paragraph breaks etc. These are all visually apparent to the human reader but they are not encoded. PDF was never designed to do so; it was designed to render printed pages faithfully and it is good at that.
I completely agree with Paul Magrath. If you want, for example, SSRN to stop hosting PDFs of journal articles, and instead have them as some sort of HTML/XML/whatever web page, someone needs to come up with a reliable, consistent way of exactly reproducing the print versions of those articles. That is what the reader is ultimately searching for.
Martin – PDF *is* the reliable, consistent way of exactly reproducing print versions. My beef is with those who publish only in PDF without regard to other (legitimate) ways which people may want to access the content or repurpose it. One can generate both PDF and HTML from the same XML; one can also tag PDF with meta data (which is I guess what SRN and Heinonline do); but too often PDF is just a dumb print dump.
It’s not lazy par se. More that Word lets users style the document exactly as they want very easily. HTML authoring tools are simply not easy enough to use for the average user.