RFC: Glossary Data Formats

First stage

For now, we can use the following format to specify glossary entries for the Journal:

  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>

As the Journal is currently stored in WordPress, we might use its data model to some extend, but will abandon that more and more by abusing it as generic online data storage, where any arbitrary data is stored in the content-body of a blog post. This will render all plugins, themes and administrative tools on the server useless and might even break the HTML that gets generated for the web, so users in a browser won’t be able to see something reasonable any more if they visit the URL. The views rendered by a server are irrelevant anyway, as the client can’t rely on this external entity to cooperate on ViewSpec operations, or even to be present/reachable at all.

After retrieval, a client is supposed to look into the content-body of a blog post, and if there’s a XHTML <dl> in there, this glossary definitions should be added to the glossary of the Journal, the latter consisting of all <dl>s found in all blog posts. Because of the semantic <dl> markup, we can ignore the explicit WordPress category that marks posts to be glossary entries, while on the other hand the entire Journal can’t have two separate glossaries at this first stage. There’s no need to require all glossary entries to be in a single <dl> in a single post, or to demand that every post containing a <dl> isn’t allowed to contain more than one definition. One post could contain more than one <dl>, but this is certainly not encouraged. There are no recommendations on what to do with other data found in a post content-body outside of <dl>. Clients are free to ignore it, as it is expected that different types of data will be stored separately instead of being embedded in big collections, so clients get a choice to explicitly only retrieve the data that’s of interest to them, by avoiding the need to extract relevant data from mixed media fragments in complex ways. It is expected that on the client side, the glossary data will get applied onto other, independent media fragments in the process of building the document.

At the first stage, XHTML’s <dl> supports multiple terms and multiple definitions for each glossary entry. If I understand correctly, all consecutive <dt>s are alternatives (synonyms, abbreviations, plurals, languages even?) to each other until the first <dd> is encountered, of which there might be several to express different definitions (different in meaning, not highlighting different aspects of the same meaning) for the same term(s). Clients are expected to associate the term(s) and description(s) with some logic of their own, as XHTML doesn’t provide explicit grouping unfortunately. Within the <dt>s and <dd>s, other semantics aren’t supported nor encouraged, clients are free to ignore such markup, but they must interpret all text nodes inside the unsupported markup as belonging to the text of the term or definition. Clients are allowed to interpret other valid XHTML if found in a <dt> or <dd> if they can and want, but they can never expect an implicitly granted meaning on the Journal, nor will there be a stage in the future which will include unrelated XHTML elements via this backdoor into glossary definitions.

We look at <dl> as some kind of microformat serialized in XML, and as XML namespaces aren’t included in the first stage, the unambiguous meaning of this “magic” markup name is derived from the declaration that it has the XHTML meaning if this markup name is found in the data of the official location(s) and sources of the Journal. For now, definitions should be unique globally and not published at two separate places or as duplicates at the same place. If clients encounter a term that’s already in the local data storage, they’re free to ignore the newly retrieved definition, which includes all other term alternatives of the definition and all definition descriptions that are associated with the term in question, but not other definitions of the same <dl> that might provide new, previously unknown definitions.

Second stage

<?ohs version="1.0" format="xml" encoding="UTF-8"?>
<dl xmlns="http://www.w3.org/1999/xhtml"
  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>

Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International. It was first published with the canonical URL http://doug-50.info/journal/2018/04/07/rfc-glossary-data-formats/.

https://skreutzer.de/about.html (+ https://skreutzer.de/2018/03/09/my-journey-through-text/).


  1. I think there’s a strong argument for aspiring to keep the default HTML rendering to something useful.

    Use of “data” attributes could be useful to describe richer scope, or additional matching terms which don’t want to be shown to the default HTML view.

    Later, it would be useful to be able to use a semantic code/id/uri to link a concept to a definition. So that you can provide make the text of “The Queen” link to a definition of “Queen Elizabeth II” even though “Th Queen is not an appropriate term for the glossary entry.

    1. Hi Christopher,

      thank you very much for your comment! Could you please elaborate a little more what the strong argument for keeping (expectation that this is how the rendering will look like, that it will be possible for clients to produce that rendering, designing the content for a specific visual layout) the HTML rendering is? I assume that any given ViewSpec will ignore it completely anyway, and while there can be a ViewSpec that mimics the HTML default rendering, it shouldn’t be more privileged than any other ViewSpec.

      What purpose/function would richer scope provide, could you give a quick example? Similarily, for matches/occurrences that don’t want to be shown, this RFC isn’t much concerned about how the glossary is actually used, how connections between glossary terms and a main text are established, because those mechanisms might be capabilities of their own, needing their own RFCs and research, as I expect that we will come up with several of them. Explicit linkage and automatic text search come to mind, already delving into a myriad of possibilities, options, potential. This proposal here is pretty much about a primitive, generic text capability that can be used in different ways, even be repurposed to do other things than we anticipate. I realize that it is not a full traditional glossary spec of course, but if we would go on to build glossaries, this would be one way to go about it, to structure parts of the data that would be involved in it. To come up with this RFC at all, I hope that parts of the glossary data might get stored in a usable format, and especially for being published as independent resources with no relation to WordPress (or other server/hosting/storage systems), as becomes a little bit more apparent in the second stage, because that one might break WordPress already (didn’t try or can’t remember, irrelevant anyway).

      I’m pretty much in favor of adding IDs for more advanced knowledge capabilities like global disambiguation, but not clear yet if that’s supposed to be a separate format/resource/publication/relation/statement, so the glossary data remains free of the many problems that would be invited by such additions. As a glossary, that’s the terms and definitions in it, tend to be explicitly constructe/curated (sure, we could do a lot for the authoring of glossaries, indexes, etc.) in a certain context of only one main text or a collection of texts or the use of words of a single individual or a limited group (not global), the glossary as a reading aid is supposed to be employed in ways that aid the reader, considering the text/terms he will encounter in the material, previous knowledge (what to include and what not to), intended effect of the augmentation, and so on. “The Queen” could point out that this is a way how the author or people or the text refers to her, on the other hand, sure, in RFCs for the linking/connecting capabilities (if separate), we would need to look at questions like a reference to a term where the term text doesn’t match, for linking the same term of the main text in different occurrences to different glossary term definitions (furthermore, if both of those groups have multiple definitions for the same term), linking to a specific form/variant of the term or to one or several specific definitions, and on and on.

      The RFC is obviously a little older and only suggested that if it is done on WordPress, it should rather be in a useful form to not burry data in custom formats. It’s also a first draft/stage, not only as a RFC, but for bootstrapping, so it would develop from practical implementation and usage. It’s designed to serve as a transitional compromise for users who are exclusively web-affected (a form of computer illiteracy). It doesn’t need to be the only one, there could be other formats as well as long as these are reasonable enough. It could be a start to research and experiment with glossary related capabilities, a wide and interesting field in itself probably, but for our own purposes I’m not aware if we already encountered the “Queen problem”. I also didn’t want to suggest full xFiles/EDL/ZigZag/NOSQL dimensions for category/type semantics as a compromise to get a glossary capability for a Journal going, not entering this whole other field of data structuring, which may or may not solve all of the problems we will encounter down the road with this simple, first glossary data RFC. Now my “short reply” turned already into a report on my thinking about glossary-related capabilities and a potential larger architecture/infrastructure/system they might be part of.

      Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International.

  2. Some more thoughts:
    – How does a “smart/aware” client take advantage of this information? By getting it when the document is viewed or can it get more information via an API mediated by the blog?
    – How does a dumb (HTML+Javascript) client take advantage of this information? At the most naive level, some javascript could detect the glossary, and the regions(s) of the page to apply it to the HTML for the benefit of the reader.

    A good example of something like this, that already exists, is Wikipedia. Wikipedia considers the first paragraph of a page the primary definition, so internal links have a nice ajax popup window giving that definition if you hover over them.

    1. Potentially similar to this: Hypertext Components: HyperGlossary Capability Prototype for Doug@50? I know, it’s using the web stack and therefore was a waste of time, but surely technical people might get the idea that I could arbitrarily pull in different glossaries and apply it to the text, either by text search or alternatively because there are explicit links/connections made available for given text/glossary combinations, or in other ways, and authors could suggest glossaries, but also researchers, readers, insiders of the field, I myself could publish such glossaries (with or without separate links/connections to existing bodies of text), or I could apply my own, non-public glossary for note taking, so I can collect different definitions whereever I encounter the word in what I’m reading. The prototype is also supposed to demonstrate that I apply different styles/layouts/ViewSpecs (of my own, by the author, publicly shared via a ViewSpec repository/marketplace, made by any group or individual), where again the author can suggest a certain ViewSpec, but if I encounter the work in a different context, they might suggest a different ViewSpec, or I apply my own, etc.

      Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International.

Leave a Comment

Your email address will not be published. Required fields are marked *