GarryJolleyRogers - Wed Nov 25 2009 - Version 1.7
Parent topic: ResolvedTopicStateAndStatisticalMeasureBothPermitted

ContingencyProblem

The Contingency Problem is a specific problem of matrix-based keys, and derives from the fact that such keys treat all characters as independent, and there is nowhere in the matrix where IF-THEN statements can be recorded. The problem only occurs when one or more taxa in the key are polymorphic. Polymorphism is usually greater at higher taxonomic levels (e.g. a key to genera or families), but doesn't go away entirely in species-level keys.

Here's an example of the problem. I have a Lucid matrix key to the Families of Flowering Plants of Australia. I have the following characters:

Habit
- shrub
- tree
- climber
Leaves
- opposite
- alternate
Indumentum
- simple hairs
- stellate hairs

Many families are scored for several states in each of these characters.

Now suppose a user of the key is trying to identify a plant that is a climber with opposite leaves and stellate hairs. They choose these states in the key. They will probably get a long list of matching families. Unfortunately, most of these families are "false matches" and should not actually be in contention.

The problem is that the key returns all families that have some members that are climbers AND have some members that have opposite leaves AND have some members that have stellate hairs. In most families in the list, unfortunately (or rather fortunately! Gregor), no species actually combines all these. So it would be nice to have a way to encode the data in such a way that the list returns all families that have some members that are climbers AND have opposite leaves AND have stellate hairs.

One way to do this is to always encode all data at the species level, but there are obviously problems with this in real life. So it would be nice to be able to encode "This family has stellate hairs ONLY WHEN it's a climber"

Interestingly, the contingency problem is not a problem with pathway (e. g. dichotomous) keys - an area where these keys are superior over matrix keys.

-- Main.KevinThiele - 22 Mar 2004

Could you give a short example illustrating the point in the last paragraph?

-- Main.MikeDallwitz - 22 Mar 2004

Family 1 has (among others) these 3 species:
- shrub, opposite leaves, simple hairs
- tree, alternate leaves, simple hairs
- climber, alternate leaves, stellate hairs
Family 2 has (among others) these 3 species:
- shrub, opposite leaves, simple hairs
- tree, alternate leaves, simple hairs
- climber, opposite leaves, stellate hairs

The generalized diagnostics are:

Family 1:
- shrub, tree, or climber, opposite or alternate leaves, simple or stellate hairs
Family 2 has (among others) these 3 species:
- shrub, tree, or climber, opposite or alternate leaves, simple or stellate hairs

Searching for climber with opposite leaves and stellate hairs returns both families based on generalized descriptions, and only 1 based on species descriptions, which are generalized to family only after identification.

SDD assumes that for most taxa, multiple descriptions (= description containers, or fragments of knowledge) will be generated, based on more basic relations to published source of data, authorship, or specific specimens. These would then be generalized by software routines. If that is done, a smarter identification could look into the non-generalized containers and make the identification there.

What we really don't have is:

a) there is no way to tell such a routine that the species are really only examples - which are expected to cover the range of family descriptions - and which are expected to be meaningful only if after id the taxon name is generalized to family. (I realize this a bad description of the process, please help if you can express it better...)

b) it is still undecided whether generalization information should be stored (and marked up as computed data, and ideally marked up as which data it is based upon. Adding computed aggregation/generalization data the output xml is desirable to allow simple processors, that don't support the aggregation/generalization procedures to consume the data. If computed data are added, in the problem described above it would be vital to recognize them. I think BioLink has been thinking about this problem, but we had no feedback from them recently.

(PS: I think the title "contingency problem" is a bit confusing; I would think contingency means how to continue in the case of a problem barring the original approach.)

-- Gregor Hagedorn - 22 Mar 2004