Ideas for Rich Documents

Frode and I have spent a long time discussing rich text documents. Or augmented or enhanced. Whatever you want to call them.

The basic idea is that structured supporting information should be included in the document.

Generally this should be little or no work for the author as they will already have the structured information so it’s just about tools to attach it in creation and exploit it in consumption.

The proof of concept demo Frode has recently shown just writes the BibTeX version of the bibliography at the end of the document (in a small font). Discovery is as simple as looking for a line starting with a “@” symbol. Why BibTeX?  Just because. Any metadata system would do, Frode just picked that to start.

A more advanced system would use a section heading to indicate the section contained machine readable content, and maybe some hints of how to consume it, or other metadata.

Putting such data into the body of the document is going to be controversial with some so an alternative would be to embed the data in other ways. In theory, PDF can have attached data, much the same way as JPEG contains EXIF data telling you about the camera, the time of day, the shutter speed etc.

Datasets for academic texts

The most obvious application is the biblography, but another that could be very valuable is suporting datasets where these are tabular and small to moderate in size. If someone has constructed a graph, then adding this should be easy– pretty much just importing a classic dataset spreadsheet where the first row are field titles and each subsequent row is a record.

This could be embedded into the paper with a symbol or code that indicates to an “aware” repository or viewer application that the section contains a dataset that can be read and made interactive. 

For added points the author can add metadata for the dataset, such as assumptions or errors etc. and even supply units for certain fields. 

Providing units in a machine readable way could be a valuable time saver allowing a future reader to overlay different datasets with more ease. One of the ways that new discoveries are made is an academic satisfying their curiosity. My dream is for them to be able to pull data from a paper and overlay it on their own research data in seconds to see corrolations or other insights which could lead to more formal investigation. 


This is the most obvious; a viewer or editor application can detect these additions and provide services around them.


As well as smart clients being able to work with such data, it would be relatively easy to make repository software like EPrints able to detect such attachments, render them into a useful view along side the PDF download, and provide interactive tools to discover the referenced papers. This is much cleaner than trying to parse the text of a citation section.

The server could take tabular datasets and offer to download them as Excel, graph them, sort and filter them, and integrate them with other datasets.


A middle option is to use javascript to download the PDF and produce dynamic interactive elelments in the page without the server doing any processing. This required the whole PDF (or whatever format) to be downloaded in the background so isn’t ideal for huge PDFs but have the advantage of requiring no new code to run on the server, just some new javascript libraries to be linked in.


Most web crawlers looking for citations in a paper use language processing or just look for DOI or other unique IDs [citation required, I’m just guessing!]. Providing strutured data reduces the risk of errors and lowers the barrier to creating such a tool.


Leave a Reply

Your email address will not be published. Required fields are marked *