It is believed that some text content in databases or xml elements requires text formatting to preserve the correct semantics, or to allow the degree of expressiveness required by content authors. A semantic requirement is the use of superscript and subscript markup, since "G2", "G2", and "G2" may be three different things. In "email-style", superscript is often expressed using "mm^2", but a more general solution was sought.
Basing this in principle on the relatively wide-spread known xhtml (rather than inventing proprietary codes) was always agreed upon. However, versions of UBIF prior to 1.0 beta 18 used a mixed content markup, closely modeled on a mimized xhtml inline-formatting xhtml. This use of mixed content was received critically and was considered to pose a significant burden. For example, a product like Altova Mapforce (allowing graphical xslt creation to map to and from databases) can not handle elements with mixed content. Furthermore, the element validation that is implicit in using xhtml-style markup to format label or definitional text, creates an impedance problem between database and xml: The database most likely uses ANSI or Unicode to encode &, <, >, rather than natively storing character entities (&, <, >). It would thus have to distinguish between passing these through unencoded if used in combination of the few recognized markup tags, and encoding them otherwise.
Example: the Database content "A<sub>1</sub> > A<sub>2</sub>" would have to be recoded into: "A<sub>1</sub> > A<sub>2</sub>" when creating SDD xml content. This is especially problematic, since some validation may have to occur in the conversion process to avoid ill-formed xml or non-valid SDD. For example, unbalanced markup like: "A<sub>1</sup>" should be converted to "A<sub>1</sup>"!
Starting with UBIF 1.0 beta 18, it is proposed to change this to a new concept based on encoded markup, avoiding mixed content.
The experience with DELTA shows that text formatting is a significant issue. DELTA underwent changes from "[Itext]" to "[I]text['I]" to RTF typesetting marks. The DELTA User guide (Edition 4.12, 2000) explicitly accepts codes for: italics, bold, superscript, subscript, font size, default font appearance, En dash, Left/Right quote, new paragraph, default paragraph attributes, space before paragraph, space after paragraph, line indentation, first line indentation. Of these, the en-dash and quotes are already covered by Unicode. Italics, bold, superscript, subscript are proposed to support in a special convention detailed below. The paragraph-level formatting is considered problematic, since it may conflict with report generation styles, and may in fact cause invalid html to be created. As a compromise, the support of the break tag (<br/>) is proposed. Font size or style in general is not considered desirable. However, support of the <small> and <big> tags may be introduced if requested.
String content in UBIF xml string fields should not be mixed content. All occurrences of "<", ">", or "&" should be encoded (i. e. to "<", ">", or "&"). However, certain combinations of encoded text should be recognized as having formatting semantics. Thus a text that is literally in the UBIF document "H<sub>2</sub>O" should be treated as xhtml mixed content markup "H<sub>2</sub>O" when creating xhtml, and thus ultimately displayed as "H2O". This process recovers the encoded markup. The recovering process should be stable and always produce well-formed xml (e. g. "H<sub>2</sup>O" should be left unchanged).
This is a recommendation that is not validated by the UBIF or SDD schema. It is intended to be an agreement between content authors and processors of SDD (e. g. a routine creating natural language html documents). Processors may not implement this, if it is not relevant for their purposes. However, they may wish to realize that "<em>xyz</em>" and "xyz" are more similar than a plain text comparison may indicate.
The following xhtml tags are proposed to be recognized:
Examples: | ||||
Tag name |
Content editor or database view |
XML encoded string in UBIF document |
Browser view after RecoverEncodedFormatting |
Plain text after StripEncodedFormatting |
strong | <strong>Strongly emphasized</strong> | <strong>Strongly emphasized</strong> | Strongly emphasized | Strongly emphasized |
em | This is <em>emphasized</em>. | This is <em>emphasized</em>. | This is emphasized. | This is emphasized. |
b | Using <b>bold</b>text. | Using <b>bold</b> text. | Using bold text. | Using bold text. |
i | This is <i>italics</i>. | This is <i>italics</i>. | This is italics. | This is italics. |
sub | H<sub>2</sub>O needs subscript. | H<sub>2</sub>O needs subscript. | H2O needs subscript. | H2O needs subscript. |
sup | cm<sup>3</sup> needs superscript | cm<sup>3</sup> needs superscript | cm3 needs superscript | cm3 needs superscript |
br | line break (3 forms):<br> (1),<br/> (2),<br /> (3) | line break (3 forms):<br> (1),<br/> (2),<br /> (3) | line break (3 forms): (1), (2), (3) |
line break (3 forms): (1), (2), (3) |
Strong and emphasis are usually rendered bold and italics, respectively. They are logical markup which leaves the exact rendering to the processor. Thus, emphasized words should be marked with em, wheras explicitly required italics (e. g. for taxonomic names) should use i.
The following xhtml tags are not yet proposed, but discussion is encouraged.
Support for the u/underline tag was already rejected in a previous SDD discussion.
In UBIF/SDD any element named "Text", "Details", "Definition", "Abbreviation", and "InternationalAbbreviation" should be treated when creating formatted reports, as potentially containing encoded formatting.
Specific to SDD natural language generation are the following additional element names: "TextBefore, TextAfter, SingleDelimiterText, RepeatedDelimiterText, LastDelimiterText"
A short sample xslt has been created as proof of concept, see the files xsl-RecoverEncodedFormattingTest.xsl (xslt code), xsl-RecoverEncodedFormattingTest.xml (example file), xsl-RecoverEncodedFormattingOutput.html (expected result of transformation). Furthermore, the files xsl-StripEncodedFormattingTest.xsl (xslt code) and xsl-StripEncodedFormattingOutput.html (expected result of transformation) show a variant of the code that strips the recognized formatting marks. Such a function should be used when comparing or indexing strings.
Gregor Hagedorn 18.8.2004.)