Personal Hypertext Report #6

Stephan Kreutzer doesn’t want to work on a WordPress enhancement: the current posts on the default installation broke XML processing by a non-browser client because of a named DTD entity and bugs in WordPress’ wpautop() function – both unnecessary problems caused by the general web context, too expensive to fix and difficult to work with.

There’s great consensus that the first project, the jrnl, should be a WordPress enhancement project. I’m happy with that as a good practical compromise for bootstrapping ourselves: it’s readily available and many people know how to set it up and/or use it, it’s libre-freely licensed (not ideal though, GPLv2 makes no sense for software that is run over a network, the software should be licensed under the GNU Affero General Public License 3 + any later version, but they probably lack a way to update the license or can’t do that for social reasons), it tends to be an “open”, friendly actor of the open web and it has a usage share of the total of all websites that makes it likely that I encounter such legacy web content I want to work with, I wouldn’t want to lose those entirely. For our group, we need something to capture our discussions, otherwise we would lose some of our own history. Doug Engelbart, at the time he didn’t have the NLS yet, printed his proposals on paper, which is good enough to communicate the idea and get the project going, as these documents can be imported into the very system they describe, but it’s not that we need to wait for the perfect tool in order to allow ourselves to talk about the design of such perfect tool. The conversation about this topic is more important than the costs of doing it wrong. It’s part of bootstrapping really to improve on what’s already around, because if we would re-invent everything starting from digital electronics over networking infrastructure to software architecture and graphical rendering engines, we would never get anywhere in regard to our end goals, so the improvement process must happen from within – adjusting, enhancing, replacing piece by piece for an exponentially enabling overall progress.

What I personally plan to do however won’t contribute much to a WordPress enhancement because WordPress for all its benefits also comes with many problems as it happens to follow the web model. The web wasn’t designed as a hypertext capability infrastructure, but as a document accessing scheme that developed into a fully blown programming framework for online software. It has never been much concerned about the needs of text, which becomes all too apparent for the otherwise perfectly reasonable things it cannot do. Changing that within the web paradigm would be unbelievably expensive for socio-political reasons and the required money/time investment, so I just might play devil’s advocate for client components that aren’t browsers and don’t care much about what servers do for the web by building my own little system that solves the problems I still remain to have when it comes to reading, writing and publishing. That can either lead to wonderful interplay with WordPress enhancements or to great conflicts depending on the technical specifics of implementation and design. In any case, it’s a worthwile research endeavor that promises directly usable results in the larger context, no matter how all the other parts behave, being the independent, potentially non-cooperating entities on the network they are.

The first thing I tried as a simple test was to automatically download the existing posts on jrnl.global with my wordpress_retriever_1 workflow. I encountered two minor issues that were easily solvable by curation without having negative effects on what WordPress generates for the web: In the biography post about Teodora Petkova, there was a single, lone   on the last line, and that broke my XML processing of the post data as retrieved via the JSON API. XML does not come with a lot of DTD entities except those to escape its own special characters and therefore is unaware of HTML’s legacy DTD entity declarations. They’re usually not really needed because characters should just be encoded in the file with the encoding clearly stated in the XML declaration or processing instruction. I agree that it might be sometimes favorable to explicitly use escaping for special characters, especially the non-printable ones that can be hard to read and write in standard text editors, but then XML supports the encoding of Unicode characters in the form of   (hexadecimal). My parsers generally still lack the ability to translate these Unicode escape codes to the actual character (with the additional difficulty to never break encoding through the entire processing pipeline), but that needs to be added anyway eventually, with the XML tools probably already being fine. The old HTML entities for encoding special characters is really a legacy remnant of DTDs, and while DTD declarations at the beginning of HTML file might arguably help with identifying the HTML version (especially a huge problem with “versionless”, “living” and therefore bad HTML5), they also cause a lot of trouble for processing XHTML with a XML processor (which is just great, but was abandoned by the web and browser people for questionable reasons): DTDs can include other DTDs, each of which can theoretically contain entity declarations, so an generic, vocabulary-agnostic, XHTML-unaware XML processor needs to load/read all involved DTDs in order to find out how entities need to be replaced for text output, which entity names are defined and which aren’t (the latter will cause the processor to fail). Now, the DTDs define the HTML vocabulary or specialized format based on the generic XML format convention, which is published as a standard by the W3C and needs an unique identifier to be disambiguated from other standards and versions, otherwise the whole vocabulary would be ambiguous and of no help for processors to interpret its meaning correctly. So if the W3C needs an identifier, ideally globally unique (which implies a central name registry), what do you think they would use, especially back in the day? An URL of course! Act two: imagine you write a generic XML parser library and you encounter a DTD declaration, while your library is also going to parse DTDs and allow DTD validation (for proper, but separate HTML support) eventually, and lack the time or knowledge to do a particular good job with the first release of the implementation? It might even be legal constraints as the DTD files by the W3C have some almost non-free restrictions or the question why a generic XML parser library should come with W3C DTDs if some users will never parse any XHTML ever – what happens a lot of times, as the unique identifier looks like and is an URL and the referenced, needed DTDs are not locally available, these parsers send out HTTP GET requests for the initial DTD, which then references several other DTDs internally, and then sometimes the received results (if any) might not get saved (because why would you and how, for small, ad-hoc, generic XML parsers?). There are probably plenty of web crawlers and bots and client implementations that try to obtain a resource from the web and then read it in for further processing with a normal XML tool, and in a distributed network, this causes quite some excessive traffic for the W3C server. So what did the W3C do in response? They throttle so serve the response of course! So imagine yourself, being an unsuspecting user of such an XML library, trying to parse in some nice XHTML, just to observe that your program execution stalls. Is it a bug, an endless loop? You might enter a debugging session but can’t find out a lot. Lucky you if you go to get you a coffee eventually while leaving the parser running instead of giving up the plan entirely, to be surprised when coming back that miraculously the XHTML input somehow did get parsed. That’s likely because the artificial delay at the W3C server has expired – to then serve the requested DTD(s)! To treat the identifier as resolvable instead of an abstract one is the root cause of this problem, delivering the DTDs encourages such behavior, and additionally, the delay might lead to the impression that XHTML/XML parsing is very, very slow. But what else can you do? Not serving the DTDs would leave HTML entities unresolvable and the XML would fail to parse. If parsers/processors don’t have the DTDs and can’t get them, all of XHTML would be unreadable, which would be a grim option for the whole XML/XHTML/semantic effort. But can XHTML demand special handling from all generic XML parsers out there? Of course not, the whole point of XHTML is to make it universally available just as all the other XML vocabularies. Therefore, still having the DTD declaration and named entities in XHTML for backward compatibility with crappy HTML4 while not advancing to XHTML2 (HTML5 removed the DTD and with that, XHTML5 doesn’t have one either, but as rubbish HTML5/XHTML5 is the one without version as a “living standard” indistinguishable from a potentially upcoming HTML6/XHTML6 except for heuristic analysis, a HTML5 DTD would have been better than no version indication at all) is a historical tragedy that renders even the best, well-formed, valid portions of the web hard to use for some of our machinery, completely unnecessarily. For my tools, I install a catalog of the DTDs that I included in the download, so all DTD identifiers can be resolved to the local copy, but that’s not an option for my own primitive StAX parser in C++ or its sloppy JavaScript port (deliberate lack of XML tools in browsers is unbelievably ironic and a big loss of potential). XML parsers of the big programming frameworks usually also have a DTD parser on board and just need the W3C DTDs from somewhere, ideally delivered as a local resource with me resolving the ID to the local file in the catalog. These catalogs could be adjusted to not resolve to the actual W3C DTD for mere parsing (we’re not doing DTD validation here, do we?), but to a pseudo DTD that contains nothing than the named entity declarations and omit all references to any of the other DTDs. For the custom, small StAX parser, I think I’ll add a mechanism that instructs the parser to first read a XML file that lists named entity names and their intended textual replacement, so it can continue with the real input file and recognize all entities that were configured this way in advance. And all of this needs to happen in preparation to handle a single   that hasn’t have any effect to the visual appearance anyway. Pragmatically, I just removed it from the blog post for now, I think it won’t be missed.

Something else broke the parsing of the jrnl launch meeting announcement (for Friday 24th of August 2018) post: welcome to the wpautop() function of WordPress! “autop” stands for “auto paragraph” and replaces two consecutive line breaks with XHTML <p> tags. In the process, for XML/XHTML well-formedness as the very minimum of standard compliance, it has to find the beginning of the previous paragraph and the end of the paragraph that follows. It also must respect other encapsulating tags as tags are in hierarchical structure and aren’t allowed to overlap. For some reason, autop is notoriously error-prone. Have a look at the issue tracker: there’s #103, #488, #2259, #2691, #2813, #2833, #3007, #3362, #3621, #3669, #3833, #3935, #4857, #7511, #9744, #14674, #27350, #27733, #28763, #33466, #33834, #38656, #39377, #40135, #40603, #42748, #43387. The fact that regex is involved explains part of this mess. WordPress recognizes that plain text linebreaks are considered harmful when they’re typed to convey semantic meaning, and as the application is managing output for the web in (X)HTML, whitespace like linebreaks doesn’t encode meaning, it usually gets entirely ignored in the visual rendering and instead formats the source code for better readability. If the plain text editor of WordPress is used, autop should not be applied onto the text whatsoever because it can be expected that the user will directly input XHTML. The visual editor should create a a new <p>-pair with the cursor right within it – any subsequent enter key press in an empty paragraph should have no effect (but then people will start to enter spaces into the empty paragraphs to force their way to vertical spacing for layout/typesetting, so a whitespace check needs to be performed). An actual plain text editor variant should XML-escape all angle brackets as mixing different formats and encodings is always a bad idea. Here we’re already losing WordPress as a decent writing and publishing tool, that’s with the out-of-the-box default configuration and no plugins or themes installed yet to interfere with the content even more. The divs with the ApplePlainTextBody class seem to be of other origin, also being pretty useless as the CSS class doesn’t seem to exist and doesn’t help as a mechanism to derive semantic meaning. There’s even an empty one of those divs, so I just removed all of them and that also caused autop to not fail for what otherwise was perfectly valid XHTML.

“Enhancement” can mean different things, “WordPress enhancement” means using WordPress as the basis for the enhancement, but then “WordPress” can be looked at from different perspectives. The web perspective isn’t one I personally consider much because I precisely want to get rid of the limitations that are associated with it, plus it would be too expensive for me to invest into fixing all the problems that exist with the current implementations/paradigm against the resistance of those who are in actively favor of them or simply don’t care.

Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International.

https://skreutzer.de/about.html (+ https://skreutzer.de/2018/03/09/my-journey-through-text/).