Stephan Kreutzer doesn’t want to work on a WordPress enhancement: the current posts on the default installation broke XML processing by a non-browser client because of a named DTD entity and bugs in WordPress’
wpautop() function – both unnecessary problems caused by the general web context, too expensive to fix and difficult to work with.
There’s great consensus that the first project, the jrnl, should be a WordPress enhancement project. I’m happy with that as a good practical compromise for bootstrapping ourselves: it’s readily available and many people know how to set it up and/or use it, it’s libre-freely licensed (not ideal though, GPLv2 makes no sense for software that is run over a network, the software should be licensed under the GNU Affero General Public License 3 + any later version, but they probably lack a way to update the license or can’t do that for social reasons), it tends to be an “open”, friendly actor of the open web and it has a usage share of the total of all websites that makes it likely that I encounter such legacy web content I want to work with, I wouldn’t want to lose those entirely. For our group, we need something to capture our discussions, otherwise we would lose some of our own history. Doug Engelbart, at the time he didn’t have the NLS yet, printed his proposals on paper, which is good enough to communicate the idea and get the project going, as these documents can be imported into the very system they describe, but it’s not that we need to wait for the perfect tool in order to allow ourselves to talk about the design of such perfect tool. The conversation about this topic is more important than the costs of doing it wrong. It’s part of bootstrapping really to improve on what’s already around, because if we would re-invent everything starting from digital electronics over networking infrastructure to software architecture and graphical rendering engines, we would never get anywhere in regard to our end goals, so the improvement process must happen from within – adjusting, enhancing, replacing piece by piece for an exponentially enabling overall progress.
What I personally plan to do however won’t contribute much to a WordPress enhancement because WordPress for all its benefits also comes with many problems as it happens to follow the web model. The web wasn’t designed as a hypertext capability infrastructure, but as a document accessing scheme that developed into a fully blown programming framework for online software. It has never been much concerned about the needs of text, which becomes all too apparent for the otherwise perfectly reasonable things it cannot do. Changing that within the web paradigm would be unbelievably expensive for socio-political reasons and the required money/time investment, so I just might play devil’s advocate for client components that aren’t browsers and don’t care much about what servers do for the web by building my own little system that solves the problems I still remain to have when it comes to reading, writing and publishing. That can either lead to wonderful interplay with WordPress enhancements or to great conflicts depending on the technical specifics of implementation and design. In any case, it’s a worthwile research endeavor that promises directly usable results in the larger context, no matter how all the other parts behave, being the independent, potentially non-cooperating entities on the network they are.
The first thing I tried as a simple test was to automatically download the existing posts on jrnl.global with my
wordpress_retriever_1 workflow. I encountered two minor issues that were easily solvable by curation without having negative effects on what WordPress generates for the web: In the biography post about Teodora Petkova, there was a single, lone
on the last line, and that broke my XML processing of the post data as retrieved via the JSON API. XML does not come with a lot of DTD entities except those to escape its own special characters and therefore is unaware of HTML’s legacy DTD entity declarations. They’re usually not really needed because characters should just be encoded in the file with the encoding clearly stated in the XML declaration or processing instruction. I agree that it might be sometimes favorable to explicitly use escaping for special characters, especially the non-printable ones that can be hard to read and write in standard text editors, but then XML supports the encoding of Unicode characters in the form of
(hexadecimal). My parsers generally still lack the ability to translate these Unicode escape codes to the actual character (with the additional difficulty to never break encoding through the entire processing pipeline), but that needs to be added anyway eventually, with the XML tools probably already being fine. The old HTML entities for encoding special characters is really a legacy remnant of DTDs, and while DTD declarations at the beginning of HTML file might arguably help with identifying the HTML version (especially a huge problem with “versionless”, “living” and therefore bad HTML5), they also cause a lot of trouble for processing XHTML with a XML processor (which is just great, but was abandoned by the web and browser people for questionable reasons): DTDs can include other DTDs, each of which can theoretically contain entity declarations, so an generic, vocabulary-agnostic, XHTML-unaware XML processor needs to load/read all involved DTDs in order to find out how entities need to be replaced for text output, which entity names are defined and which aren’t (the latter will cause the processor to fail). Now, the DTDs define the HTML vocabulary or specialized format based on the generic XML format convention, which is published as a standard by the W3C and needs an unique identifier to be disambiguated from other standards and versions, otherwise the whole vocabulary would be ambiguous and of no help for processors to interpret its meaning correctly. So if the W3C needs an identifier, ideally globally unique (which implies a central name registry), what do you think they would use, especially back in the day? An URL of course! Act two: imagine you write a generic XML parser library and you encounter a DTD declaration, while your library is also going to parse DTDs and allow DTD validation (for proper, but separate HTML support) eventually, and lack the time or knowledge to do a particular good job with the first release of the implementation? It might even be legal constraints as the DTD files by the W3C have some almost non-free restrictions or the question why a generic XML parser library should come with W3C DTDs if some users will never parse any XHTML ever – what happens a lot of times, as the unique identifier looks like and is an URL and the referenced, needed DTDs are not locally available, these parsers send out HTTP
that hasn’t have any effect to the visual appearance anyway. Pragmatically, I just removed it from the blog post for now, I think it won’t be missed.
Something else broke the parsing of the jrnl launch meeting announcement (for Friday 24th of August 2018) post: welcome to the
wpautop() function of WordPress! “autop” stands for “auto paragraph” and replaces two consecutive line breaks with XHTML
<p> tags. In the process, for XML/XHTML well-formedness as the very minimum of standard compliance, it has to find the beginning of the previous paragraph and the end of the paragraph that follows. It also must respect other encapsulating tags as tags are in hierarchical structure and aren’t allowed to overlap. For some reason, autop is notoriously error-prone. Have a look at the issue tracker: there’s #103, #488, #2259, #2691, #2813, #2833, #3007, #3362, #3621, #3669, #3833, #3935, #4857, #7511, #9744, #14674, #27350, #27733, #28763, #33466, #33834, #38656, #39377, #40135, #40603, #42748, #43387. The fact that regex is involved explains part of this mess. WordPress recognizes that plain text linebreaks are considered harmful when they’re typed to convey semantic meaning, and as the application is managing output for the web in (X)HTML, whitespace like linebreaks doesn’t encode meaning, it usually gets entirely ignored in the visual rendering and instead formats the source code for better readability. If the plain text editor of WordPress is used, autop should not be applied onto the text whatsoever because it can be expected that the user will directly input XHTML. The visual editor should create a a new
<p>-pair with the cursor right within it – any subsequent enter key press in an empty paragraph should have no effect (but then people will start to enter spaces into the empty paragraphs to force their way to vertical spacing for layout/typesetting, so a whitespace check needs to be performed). An actual plain text editor variant should XML-escape all angle brackets as mixing different formats and encodings is always a bad idea. Here we’re already losing WordPress as a decent writing and publishing tool, that’s with the out-of-the-box default configuration and no plugins or themes installed yet to interfere with the content even more. The
divs with the
ApplePlainTextBody class seem to be of other origin, also being pretty useless as the CSS class doesn’t seem to exist and doesn’t help as a mechanism to derive semantic meaning. There’s even an empty one of those
divs, so I just removed all of them and that also caused autop to not fail for what otherwise was perfectly valid XHTML.
“Enhancement” can mean different things, “WordPress enhancement” means using WordPress as the basis for the enhancement, but then “WordPress” can be looked at from different perspectives. The web perspective isn’t one I personally consider much because I precisely want to get rid of the limitations that are associated with it, plus it would be too expensive for me to invest into fixing all the problems that exist with the current implementations/paradigm against the resistance of those who are in actively favor of them or simply don’t care.
Copyright (C) 2018 Stephan Kreutzer. This text is licensed under the GNU Affero General Public License 3 + any later version and/or under the Creative Commons Attribution-ShareAlike 4.0 International.