GarryJolleyRogers - Wed Nov 25 2009 - Version 1.5

AmnhLegacyLiteratureProject

* NovitatesTreatment_1_0.xml: Natural Language Coding of a taxonomic treatment in the AMNH legacy literature project

This is my first attempt at an SDD coding of the Novitates ant treatment of Baikurus casei from the digitized version of http://research.amnh.org/informatics/ants/pdf/N3208.pdf the initial digitization of which is http://research.amnh.org/informatics/ants/xml_lite/N3208.xml See also other representations of this particular treatment at http://research.amnh.org/informatics/ants/

Especially see Ants.WebHome

I've done the treatment entirely with the NaturalLanguageDescriptions ("NLD") of SDD. This kind of coding is intended exactly for this purpose, and is designed to encode the text at any level of conceptual granularity desired. In particular, it is the aim of NLD to support subsequent refinement of the markup to subsequently refined concepts. For example, in the document NovitatesTreatment_1_0.xml, I use an SDD Terminology with two ConceptTrees. The first ConceptTree is a (rather flat) tree of SDD whose ConceptTreeType is PartsCompositionHierarchy. It has Concepts named HEAD, ALITRUNK, and GASTER, corresponding to the named parts in the treatment. The second tree is of type MethodHierarchy and has entries labelled MATERIALS EXAMINED, ETYMOLOGY, and DISCUSSION. (These probably do not really correspond to the intent of a MethodHierarchy in SDD, which is more designed to detail how data was gathered. However, it is interesting to note that the ConceptTrees could quite easily be reorganized, retyped, and renamed (i.e. relabeled) in the Terminology without having to recode the NaturalLanguageDescription. That's because in the Description are only references to the keys that uniquely specify the Concept and these would survive such a reorganization in most cases. (There are uniqueness constraints on those keys that would have to be maintained in such a reorganization.)

The essential point of SDD NLD markup is that subsequent agents acting on the document with a refined Terminology could mark it up further. For example, the HEAD section could be refined to separate markup of the ocelli, the eyes, etc., all the way to detailed character/state based markup that would be as searchable with XQuery as had it derived from a database in the first place. Accomplishing that markup is either expert handwork, or the subject of research in natural language processing of morphological characters such as that in Bryan Heidorn's lab. What's missing in this coding. It's important to the AMNH work to also include certain automatically marked document characteristics, including page boundaries, figure locations, etc. I haven't tried to capture that here in part because there have not yet been any attempts to use SDD markup as fragments, e.g. via namespaces, of something else. This doesn't seem difficult in principle, although the heavy dependence of SDD on key/keyref mechanisms could make validation complicated in such a scenario. Particularly unclear to me are the consequences of the (common) case where these document objects come within an informational object, such as the text of the use of a Concept in a NaturalLanguageDescription. So I have just ignored them in NovitatesTreatment_1_0.xml. By design, SDD has only a single top-level element (DescriptiveData) which might make it annoying to use SDD objects embedded in something other than an SDD document, which is what I have coded here.

Less obscure is that I have omitted the (well-defined) use of mechanisms for describing special objects like tables, figures, and images. Normally in SDD these might be treated as part of the ExternalDataInterface except that they are embedded in the document in this case. So possibly they are just another kind of Concept, though I solicit SDD'ers opionion on whether that makes sense. This might be made moot by suitable mechanisms for embedding SDD elements in other kinds of documents. Possibly there is a difficult schema integration problem here.

Finally note two things:

1. NovitatesTreatment_1_0.xml has had the debugRef tool applied so that it is easier to see in situ what the Concepts used in descriptions are since the descriptions just use references, much akin to foreign keys in a traditional database. The tool inserts a heuristically chosen text (usually a mandatory label) from the referenced object---here a Concept---at the point of its reference.

2. My experience with SDD, both in the design and use, has been with CodedDescriptions. I hope someone who has attempted NLDs will comment on how to improve what I've done here. -- Main.BobMorris - 06 Jan 2005

-- Main.BobMorris - 06 Jan 2005