RFC: Glossary Data Formats

First stage

For now, we can use the following format to specify glossary entries for the Journal:

<dl>
  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>
</dl>

As the Journal is currently stored in WordPress, we might use its data model to some extend, but will abandon that more and more by abusing it as generic online data storage, where any arbitrary data is stored in the content-body of a blog post. This will render all plugins, themes and administrative tools on the server useless and might even break the HTML that gets generated for the web, so users in a browser won’t be able to see something reasonable any more if they visit the URL. The views rendered by a server are irrelevant anyway, as the client can’t rely on this external entity to cooperate on ViewSpec operations, or even to be present/reachable at all.

After retrieval, a client is supposed to look into the content-body of a blog post, and if there’s a XHTML <dl> in there, this glossary definitions should be added to the glossary of the Journal, the latter consisting of all <dl>s found in all blog posts. Because of the semantic <dl> markup, we can ignore the explicit WordPress category that marks posts to be glossary entries, while on the other hand the entire Journal can’t have two separate glossaries at this first stage. There’s no need to require all glossary entries to be in a single <dl> in a single post, or to demand that every post containing a <dl> isn’t allowed to contain more than one definition. One post could contain more than one <dl>, but this is certainly not encouraged. There are no recommendations on what to do with other data found in a post content-body outside of <dl>. Clients are free to ignore it, as it is expected that different types of data will be stored separately instead of being embedded in big collections, so clients get a choice to explicitly only retrieve the data that’s of interest to them, by avoiding the need to extract relevant data from mixed media fragments in complex ways. It is expected that on the client side, the glossary data will get applied onto other, independent media fragments in the process of building the document.

At the first stage, XHTML’s <dl> supports multiple terms and multiple definitions for each glossary entry. If I understand correctly, all consecutive <dt>s are alternatives (synonyms, abbreviations, plurals, languages even?) to each other until the first <dd> is encountered, of which there might be several to express different definitions (different in meaning, not highlighting different aspects of the same meaning) for the same term(s). Clients are expected to associate the term(s) and description(s) with some logic of their own, as XHTML doesn’t provide explicit grouping unfortunately. Within the <dt>s and <dd>s, other semantics aren’t supported nor encouraged, clients are free to ignore such markup, but they must interpret all text nodes inside the unsupported markup as belonging to the text of the term or definition. Clients are allowed to interpret other valid XHTML if found in a <dt> or <dd> if they can and want, but they can never expect an implicitly granted meaning on the Journal, nor will there be a stage in the future which will include unrelated XHTML elements via this backdoor into glossary definitions.

We look at <dl> as some kind of microformat serialized in XML, and as XML namespaces aren’t included in the first stage, the unambiguous meaning of this “magic” markup name is derived from the declaration that it has the XHTML meaning if this markup name is found in the data of the official location(s) and sources of the Journal. For now, definitions should be unique globally and not published at two separate places or as duplicates at the same place. If clients encounter a term that’s already in the local data storage, they’re free to ignore the newly retrieved definition, which includes all other term alternatives of the definition and all definition descriptions that are associated with the term in question, but not other definitions of the same <dl> that might provide new, previously unknown definitions.

Second stage

<?ohs version="1.0" format="xml" encoding="UTF-8"?>
<dl xmlns="http://www.w3.org/1999/xhtml"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    rdf:about="ohsidentifierscheme:info.doug-50/journal/category/12?created=2018-02-11T23:42">
  <dt>Media Fragment</dt>
  <dt>Media Fragments</dt>
  <dd>Paragraphs, datasets, images, topic maps, glossaries… that go into forming one or more document(s).</dd>
  <dd>This glossary entry is a media fragment, and each of its parts could be as well.</dd>
</dl>

Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International. It was first published with the canonical URL http://doug-50.info/journal/2018/04/07/rfc-glossary-data-formats/.

https://skreutzer.de/about.html (+ https://skreutzer.de/2018/03/09/my-journey-through-text/).