Frode and I have spent a long time discussing rich text documents. Or augmented or enhanced. Whatever you want to call them.
The basic idea is that structured supporting information should be included in the document.
Generally this should be little or no work for the author as they will already have the structured information so it’s just about tools to attach it in creation and exploit it in consumption.
The proof of
concept demo Frode has recently shown just writes the BibTeX version of
the bibliography at the end of the document (in a small font). Discovery
is as simple as looking for a line starting with a “@” symbol. Why
BibTeX? Just because. Any metadata system would do, Frode just picked
that to start.
A more advanced system would use a section heading to indicate the section contained machine readable content, and maybe some hints of how to consume it, or other metadata.
Putting such data into the body of the document is going to be controversial with some so an alternative would be to embed the data in other ways. In theory, PDF can have attached data, much the same way as JPEG contains EXIF data telling you about the camera, the time of day, the shutter speed etc.
Datasets for academic texts
The most obvious application is the biblography, but another that could be very valuable is suporting datasets where these are tabular and small to moderate in size. If someone has constructed a graph, then adding this should be easy– pretty much just importing a classic dataset spreadsheet where the first row are field titles and each subsequent row is a record.
This could be embedded into the paper with a
symbol or code that indicates to an “aware” repository or viewer
application that the section contains a dataset that can be read and
made interactive.
For added points the
author can add metadata for the dataset, such as assumptions or errors
etc. and even supply units for certain fields.
Providing
units in a machine readable way could be a valuable time saver allowing
a future reader to overlay different datasets with more ease. One of
the ways that new discoveries are made is an academic satisfying their
curiosity. My dream is for them to be able to pull data from a paper and
overlay it on their own research data in seconds to see corrolations or
other insights which could lead to more formal investigation.
Clientside
This is the most obvious; a viewer or editor application can detect these additions and provide services around them.
Serverside
As well as smart clients being able to work with such data, it would be relatively easy to make repository software like EPrints able to detect such attachments, render them into a useful view along side the PDF download, and provide interactive tools to discover the referenced papers. This is much cleaner than trying to parse the text of a citation section.
The server could take tabular datasets and offer to download them as Excel, graph them, sort and filter them, and integrate them with other datasets.
Browserside
A middle option is to use javascript to download the PDF and produce dynamic interactive elelments in the page without the server doing any processing. This required the whole PDF (or whatever format) to be downloaded in the background so isn’t ideal for huge PDFs but have the advantage of requiring no new code to run on the server, just some new javascript libraries to be linked in.
Crawlers
Most
web crawlers looking for citations in a paper use language processing
or just look for DOI or other unique IDs [citation required, I’m just
guessing!]. Providing strutured data reduces the risk of errors and
lowers the barrier to creating such a tool.
3 Comments
Pingback:
Pingback:
Pingback: