GarryJolleyRogers - Wed Nov 25 2009 - Version 1.14
Parent topic: ClosedTopicSchemaDiscussionSDD09

ClosedTopicAreKeysDescriptions

(This is a probably short-lived, but urgent request for comment/opinion! Input is urgent because it currently prevents upload of a new beta version for SDD 0.91)

In SDD 0.9 Document/Descriptions used to have an unbounded choice resulting in a sequence of NaturalLanguageDescription (NLD) and CodedDescription (CD), which could, e. g., be NLD, CD, CD, NLD, CD, NLD, NLD ... The reason is historical: there used to be an IPR/metadata envelope around multiple description which has now been "normalized" away. The unbounded choice has been a cause of misunderstanding and should go away. To clarify the differences and simplify applications interested only in one of these variants, separate containers for NLD and CD have now been introduced. We thus have three description containers in the wider sense (i.e. including Keys; the relation to terminology and entities is identical in Keys and NaturalLanguageDescriptions).

How should they be arranged? I believe we agree that the form of the schema is important for modularization and communication. I therefore ask:

should we have:

DataSet 
  Terminology
  Entities
    Classes
    Objects
  Descriptions
    CodedDescriptions
    NaturalLanguageDescriptions
    Keys

Or perhaps:

DataSet
  Terminology
  Entities
    Classes
    Objects
  CodedDescriptions
  NaturalLanguageDescriptions
  Keys

Or:

DataSet
  Terminology
  Entities
    Classes
    Objects
  Descriptions
    CodedDescriptions
    NaturalLanguageDescriptions
  Keys

Note: "DataSet" was named "Document" in SDD 0.9)

I prefer the first because structural explanation of relations of parts of the schema are easiest. But is it sufficiently intuitive to consider a diagnostic key with statements as another form of description, or will this prevent people from finding what they are looking for? I wait for your social engineering input!

-- Gregor Hagedorn - 19 Apr 2004

The main reason I don't like the first is that I think a description is one or more paths through a guided key. While it is possible to argue that a guided key is a description of a superclass of the taxa described by paths therein, this class is not always part of any interesting hierarchy. Sometimes it has no better informal description than "the things described by this key". Examples for us include a key to the aquatic invertebrates of New England and a key to the invasive plants of New England. Somewhat characteristic of these keys is that the leaf nodes vary in their taxonomic rank for no other reason than lack of knowledge. Also we also sometimes mix life forms within the keys.

As to the second, I wonder if it presents a problem in Descriptions which are part Coded and part NaturalLanguage. Hence, for the moment, I prefer the third, though maybe it still doesn't address my objection to the second.

-- Main.BobMorris - 19 Apr 2004

The first argument is interesting, because it highlights a possible misunderstanding: I was thinking of keys as containing a number of descriptions of the classes/objects that are keyed out. These descriptions are not organized by class--but it is trivial to dynamically rearrange them in such a way. However, whereas the levels of Descriptions/CodedDescriptions/CodedDescription and Descriptions/NaturalLanguageDescriptions/NaturalLanguageDescription exactly correspond, Descriptions/Keys/Key is a different level.

"Coded and part NaturalLanguage": Syntactically this does not exist in SDD. Does anybody have a use case that requires it (or makes it desirable due to much simpler implementation)? Currently, natural language descriptions offer the full expressiveness of coded descriptions (but not all the validations and not the handling simplicity regarding rearranging data). The other way round, a coded description may contain fragments of free-form text (albeit without markup). So there are some limitations.

BTW: I have less a problem with the structure of the third solution (I agree that CD and NLD are closer related than CD or NLD to Keys). It is more that the name "Descriptions" seems to imply that Keys are not class descriptions. I believe Keys are a special arrangement of descriptions, even though they are not normally called "descriptions" in biology. But perhaps we should stick to conventional use rather than inventing new terminology that may in the end be more appropriate?

-- Gregor Hagedorn - 20 Apr 2004

The third option addresses the needs. The main distinction that I see between keys and coded descriptions is ordering constraints.

Mixed CD and NLD: We mark-up natural language descriptions at varying levels of detail including down to the character data. Thus we end up with mixed representations of CD and NLD. However, because of ordering constraints and other processing issues, we end up pulling the character data into other elements. In SDD I think we would end up with non-parallel CD NLD sections. I do not think this is a problem. Originally, I had envisioned that we might merge them in SDD. For example, we can recognize that a sentence in a NLD is about the flower and marked up appropriately. We may recognize and markup that a clause in that sentence is about the sepals, another the petals, and another the bud. We may even markup that there is a bud tip shape character and the value is blunt. Having done this however it does not fit into the SDD structure at all for characters. It is better to move the bud tip shape=blunt up to the CD sections and leave the NL in an independent section.

-- Main.BryanHeidorn - 30 Apr 2004

Donald Hobern and Steve Shattuck whom I asked at GB8 were of the same opinion. I will use the third arrangement as proposed by each of you. This is also closest to our previous arrangement.

Mixed CD and NLD: I find it interesting and suprising that you have problems with ordering constraints in NLD. Clearly, the text that is being marked up in an NLD imposes ordering constraints, but there is the CD is intended as a completely unordered set. I am not clear what you are trying to achieve. Can you elaborate the example why "... it does not fit into the SDD structure at all for characters."?

-- Gregor Hagedorn - 1 May 2004

In our own present EFG => SDD we have a mix for exactly these reasons as Bryan, and not very good, reasons. Doing no NLP (or more precisely, doing only static pattern recognition from metadata we produce by hand from the underlying data schema), we nevertheless have some fields that are easily, or with minor parsing, generated as CD. We deal with the rest verbatim. Hence it is not the case that we have a CD and an NLD that are the same as one another. Rather, they represent disjoint sets of characters, and the union of the two is the total description. I am fairly sure that this is not really what any of us on the committee envisioned, except as we thought of "partially coded" descriptions. OTOH, I prefer to think of them as "partially natural language". :-)

Jacob will talk about this briefly when in Berlin when he describes how our implementation was done. BTW, as I have been saying publicly for a long time, a great deal of the code is automatically generated, in his case by the Castor XML databinding framework which automatically generates marshalling/unmarshalling (i.e. Java->XML/XML->Java) code for each of the two schemas (EFG and SDD). The main work is then deriving SDD Java objects from EFG Java objects.

-- Main.BobMorris - 02 May 2004

I see no problem in having part information in CD and part in NLD (with markup). Having overlapping or disjoint sets of information in multiple CD is in fact part of the design of SDD, and the processors is expected to aggregate a "complete" description. I believe aggregating CD + NLD is a factual problem, but not a design problem for SDD, right?

Regarding the "dirty data" export problem, I think the best solution is in fact to export it all as NLD. In SDD the NLD is designed such that it can express as much as CD. If not that is a bug in SDD... There is no requirement on NLD to have the same level of markup for all characters. So the "good" CD-enabled characters could be exported into NLD with full character state markup, the more difficult character only with a character bracket around the text.

-- Gregor Hagedorn - 3 May 2004 Our difficulty with order is not in keeping order in the CD or in the NLP but in keeping order in the mix of the two. If for example the parsers recognize three CD level descriptions randomly distributed within a paragraph of text we would like to make those characters available for a different type of processing or use than the full text, a key for example. We wish to keep the NLP to use with different search parameters simultaneously. If we leave the CD characters imbedded in the NLP we have some redundancy. I do not think this is a critical problem for SDD.

-- Main.BryanHeidorn - 03 May 2004