GarryJolleyRogers - Wed Nov 25 2009 - Version 1.11
Parent topic: ClosedTopicSchemaDiscussionSDD09

StateMapping

Introduction

A frequent problem when trying to design descriptions for both detailed output (e.g. natural language descriptions) and for identification is that the first usually prefers detailed categories, whereas the latter is better served with more coarse-grained categories.

The distinction comes from the fact that fine-grained categories ("lanceolate, oblanceolate, obliquely lanceolate, , obliquely oblanceolate, ovate-lanceolate") can be understood as "passive vocabulary" by many consumers of the data, but are "active vocabulary" only for the expert. The expressiveness of the finer categories is considerable and may often be relevant when comparing an object with a full description, but it is a source of error or simply inability to continue identification, when required to score by non-experts.

SDD proposal in 0.9

Terminology/Character/Character/Categorical/Mappings provides a mechanism that allows to map a number of states to other states. The relation support n:m mapping, i.e. a source state can be mapped to multiple target states:
- ovate-lanceolate -> ovate
- ovate-lanceolate -> lanceolate

Open points and questions

1. SDD currently implictly assumes that mappings may be defined in cases where a long list of categories (states) should be shortened to a shorter list, OR if error prone states should be "degenerated" to achieve error tolerance. These two goals are identical in the case of identification, but are not the same. Which use cases would have to distinguish between them? Do we need extra metadata on why a mapping occurs? In principle the degeneration (1 -> n states) can be distinguished from combination (n -> 1) based on the data itself. Is this enough?

2. A serious problem I see is that two types of mapping are possible:
a -> d
b -> d
c -> d
("d" is a generalized state and deprecated for scoring data / should not be used)

or:
a -> c
b -> c
c -> c
where c now has a sensu stricto for scoring and a sensu lato after mapping (for identification)

a) Should it be possible to score "sensu-lato states". What if one flora/fauna treatment has a coarse concept, and another a fine one, would it be useful to allow "coarse scoring"? Or if experts and non-experts score? Or if the identification trail (those states scored during id) is maintained as persistent data to document observations during specimen identification? The latter use case almost seems to require that c s.str. can be distinguished from c s. lat.!

If that is so, what do we have to do? Suggestion for changes of the SDD schema?

b) I believe the first case d could be allowed for scoring (rather than being "deprecated", something that does not exist in SDD anyhow - should it?). However, this would still leave us with two sets of concepts:

 1 2 3 4 5 6 7 8 9 10 (ordered fine-grained states)
 --A-- B -- C--- -D-- (coarse-grained states, corresponding to above)

Now, I think it is a bad idea NOT to distinguish between the fine and coarse grained state set: "1, 2, A, 3, 4, B, 5, 6, C, 7, 8, 9, D, 10". Currently the only option would be two create two characters and map concepts between them. This seems to be less than ideal, but I still

3. Should mapping be possible on the reusable concept level? Or would mapping depend on the specific structure that is used? Do we need both and add another complication doing so (currently it is possible only on the character level).

-- Gregor Hagedorn - 30 Mar 2004


The problems that you are considering relate to how states can be generalized - (e.g by forming hierarchies of related states and substates). If I understand correctly, your mappings can represent a categorisation or grouping of refined states under (or belonging to) a less precise parent state.

With Main.PrometheusII we briefly considered allowing specialisation of states: for example pale violet might be a type­ of violet which might be a type of purple. However, we decided that state definitions should stand on their own and there would be no structured relationship between states and substates (of course the textual definition might record such relationships textually).

We considered that these type of relationships might only be significant/relevant for some states (eg colours, shapes) and even for them it might be contentious for one authority to make the categorisation/grouping. Furthermore, unless the score was actually recorded in the context of this defined hierarchical terminology, an automatic generalisation would only be our interpretation of what the original author meant.

At the end of the day, because we were focussing on accuracy of description, we decided that unless a complete hierarchy was represented in the defined terminology/ontology for description, AND this was fully appreciated and understood by the working taxonomist AND they agreed with the relationships in it… it would be difficult to have any confidence in the relationships/mappings.

The formation of a taxon description can be considered a generalisation of the states present in the constituent taxons (and ultimately the states represented in actual constituent specimens). We envisage this in Main.PrometheusII as being a collection of atomic statements – e. g. – leaf shape can be x, y, z or b., or leaf shape is typically x or y, or occasionally z. etc. - - we have not addessed whether these categorisations could have generalized labels. SDD would seem to require this because it is attempting to provide one data format to fit all uses (for example supporting the generalisation of detailed or composite descriptions into keys), which will require translation between expertise/accuracy levels.

I think this will be wonderful if it works – allowing automation of key generation etc, but might not be reliably accurate enough for the purposes that Main.PrometheusII was targeting (or more importantly it might not be accepted as accurate or useful ) – this comes down to the basic problem of creating acceptable shared terminologies, which is compounded if the terminology is increasingly constrained with multiple ontological relationships.

-- Main.TrevorPaterson - 30 Mar 2004


Mapping can be used to express two things, and part of what I ask is whether these are strictly separate or not (I believe in practice they mix):

Now perhaps this is significant, since in the first case recording a state as the generalized state is meaningful, but recording the perhaps confused state is not. However, if I want to record the process of identifications, I have to live with possible misinterpretations, and will be happy if they are known in the character definition (or "ontology"). As an aside: is there anything in OWL that represents misinterpretation?

Trevor, can you explain what the "substates" in Main.PrometheusII are (perhaps as subtopic of that topic)? That is one thing I still haven't understood. You say "there are structured relationship between states and substates" - but what makes something a substate rather than a state in Prometheus?

Regarding the inference process from specimen to class (= taxon) and from classes/taxa to higher classes/taxa: SDD considers this a generalization as well (see, e.g., Lisbon minutes). Your example "typically x or y, or occasionally z" would normally be reached without using a state mapping definition, however. Mapping primarily provides views of different detail and if all data would be coded on the detailed state level, there would be no reason to use the coarse generalized states for a generalized taxa. However, if some descriptions use the broad concepts and others the narrow ones, it may be beneficial to first map narrow to broad, and then generalize only the broad concepts in the higher taxon. Nevertheless, the taxon description generalization hierarchy and the terminological generalization hierarchy are fundamentally independent. Note that the class-generalization process would be expected to add "typically" and "occasionally" as SDD frequency modifiers. (I am not sure that this is the "categorisations could have generalized labels" you refer to, or whether you ask about a label for the combined statement?)

I think I do not agree with the arguments based on "authority" and "allowing only a single opinion". I think is not part of the scientific process. You can have that in law or theology, but not in science (see AuthoritativeTerminology). This is probably one of the premises where Prometheus differs fundamentally from SDD. Also, I believe it will be very problematic to use Main.PrometheusII for identification, if you have long lists of detailed states (as in my example above) which for most users are only passive vocabulary.

* * * * * * * * *

But this is a side discussion. I have no concrete cases or experience, but in principle Trevor's statement that "creating acceptable shared terminologies" becomes more difficult if "terminology is increasingly constrained with multiple ontological relationships" may be true. However, I view the mapping relations as rules that help in the application of data for identification and not as part of the ontological definition of states. Stating is good progress in this discussion has achieved. However, the problem then is no longer relevant, and changes in the state mapping would not affect definition and use of a terminology.

Thus, please let us come back to the original questions 1 to 3!

-- Gregor Hagedorn - 30 Mar 2004 / 20 Apr 2004


Just a note: this is still an unresolved topic in SDD after the Berlin meeting!

-- Gregor Hagedorn - 28 May 2004