TextPublishingArtifacts

GregorHagedorn - Sat Mar 12 2005 - Version 1.1
Parent topic: FormattedText
In addition to the inline formatting of text (formatting characters with bold, italic, subscript within a block-level element, see FormattedText), a major problem when marking up legacy text (digitized books) is to handle high level publishing artifacts such as page numbering, header and footer text. Neither is the block-level structure nested withing pages, nor reverse. In a single xml tree the text-syntactical view (paragraphs, inline formatting) and the publishing syntactical view (pages) are therefore difficult to express.

The following publishing artifacts are in general most problematic

A simple solution simply inserts appropriate empty xml elements (e.g. xhtml using CSS 2 or proprietary methods) into the text at the position of a page break. This solution has several disadvantages:

The first problem may be addressed by a similar solution to that in FormattedText, i.e. using xml-like markup, treating it as text by escaping (or encoding) </> to entities &lt;/&gt;. The text formatting proposal (FormattedText) already contains a method to support line breaks through escaped xhtml <br/> tags. The intended use case in that proposal was not to preserve publishing artifacts, but to increase the "fuzzy semantic expressiveness" where authors believe that a new line is necessary for appropriate separation of statements or arguments.

The second problem may be addressed by always placing publishing artifact information behind the word (in front of the next whitespace character) and informing about the relative position of it (see below).

-- Main.GregorHagedorn - 12 Mar 2005


Recommendations

Under the assumption that the primary markup of legacy text should follow semantic and syntactical document categories that are independent of the publishing medium (such as divisions, paragraphs, etc.), and that the publishing artifacts of printed publications should be preserved in a form compatible with xhtml, the following recommendations are proposed:

Please comment on the above - currently this is a very raw first draft!

-- Main.GregorHagedorn - 12 Mar 2005