GarryJolleyRogers - Wed Nov 25 2009 - Version 1.9
Parent topic: WebHome

AuthoritativeTerminology

Some notes on changing terminology and whether projects should be based on protying and the fixing terminology, or whether change should be allowed throughout the project. This has bearing on SDD, because some centralized (non-federated) models of terminology would not support change (or at least we would have to think about how they still could be made to do so). Federation is not yet properly tackled in SDD but it will be central for its success and is on the Agenda for review meeting in Berlin. The following is from a draft of a paper I am currently writing, I appreciate any comments on this!

Static versus dynamic terminology models

As soon as descriptive data using a specific terminology have been recorded, any substantial change of the terminology on which these descriptions depend has significant consequences. It is obvious that terminology changes must be carefully considered. Nevertheless, experience with systems allowing the ad-hoc definition of new characters and states (and thus an evolutionary development of terminology) is positive and often considered an indispensable feature (pers. comm. of users of DeltaAccess, which strongly supports reorganization of terminology). Why is that so?

Designing a good terminology is not much different from software development. It is in fact part of the design of an information model. Furthermore, this part of the model is created by the domain expert, who typically has little experience with such activities. Few projects are able to finance a collaboration between domain experts (taxonomists, plant pathologists, etc.) and persons experienced in descriptive data modeling. Even in this fortunate situation, the contradictory nature of conventional terminology, which usually surfaces only during the development of a structured terminology, will be a major obstacle. In software engineering, the attempt to first design a near-perfect information and object model is sometimes called "Big design up front (BDUF)". This classical model of software engineering may still be preferable in certain situations. However, development processes that explicitly contain iterative elements (e. g., the "Rational Unified Process" of inception, elaboration, iterative cycles of design, implementation, testing, and refactoring during construction, and final transition to finished product, Jacobson & al. 1999) are often more successful. The rate of change is even greater in development processes like "extreme programming" (see, e. g., Beck 2000).

Similar to software engineering, the best development process for descriptive terminology depends on the circumstances. Starting with a badly designed terminology, and then having technical personnel record a large amount of data is almost guaranteed to either fail or produce low-quality data. The advantage of iterative or evolutionary development processes can be reaped only if the designer her- or himself struggles with recording data and uses the experiences to improve the terminology. Note that the usefulness of prototyping is very limited when recording biological descriptions. The diversity of organisms is so large that it is very difficult to assess the validity of a terminological model until all organisms have been studied. It is, however, very inefficient to study all descriptions of a taxonomic group in a first pass to ensure that all required characters are present in the terminology, to then record actual data in a second pass. Depending on project size and circumstances:

Some large projects may, nevertheless, fare best with extensive prototyping and then settling on a fixed terminology for a prolonged period of time. This will especially be the case if a large amount of resources are available and a large number of personnel has to be trained.
Most large collaborative projects will, however, fare best with a development process similar to the "Rational Unified Process", where change is explicitly expected, but occurs in a controlled way.
Small projects with only a single primary "developer" of data and terminology will appreciate "extreme describing" and a software design that allows rapid evolutionary changes of terminology even when data already have been entered.

Also, note that many changes in terminology are near to neutral in their effect on other data items (Table 2). Even where adverse effects exist, these changes often occur shortly after the start of data recording, because problems that have been overlooked in the design phase become evident. Those changes that occur later typically affect rarely used characters, allowing a manual revision of affected data. Changing a bad terminology and reducing the validity of a few data items is preferable to sticking to a bad terminology for the remaining 90% of data to be recorded.

Reaching terminological stability

Despite the fact that it is unavoidable to give researchers the freedom to define their own terminology, it is evident that standardization is very desirable. Standardization has a technical aspect and a semantic aspect.

Technical standardization: To compare and integrate descriptive data sets, the identity of terms from different terminologies has to be assessed. This is not possible based on natural language descriptions: For example, two data sets defining a character with the same name "leaf structure" may have completely incongruent definitions of what is considered a "leaf structure" (e. g., rough or smooth surface vs. anatomical features). Even where "ontological" definitions are present, the current semantic techniques express only part of the information. Semantic information can greatly help to find similar terms, but can not yet decide whether the similarity is large enough to warrant data integration. The terminology itself has to be defined through unambiguous identifiers (e. g. through URIs as in the RDF). Multiple terminologies can then make assertions about comparability of their concept with a standard concept.
Semantic standardization: It has repeatably being stated that conventional terminology is still full of inconsistencies and local usage patterns. Improving terminology and reaching convergence of usage is an important goal for biodiversity research. This is in principle independent of the use of computer programs for descriptions, albeit attempting to define a structured terminology helps in finding problems that may otherwise be overlooked. However, different opinions exist as to whether the current phase of terminological instability is transient and will soon be replaced by a period of terminological stability – or not. I believe the latter, i. e. that the assumption that a group of eminent scientists should form a standard group and agree on a fixed terminology is contradictory to the fact that science:
- is constantly discovering new structures and features,
- develops new methods,
- that the results of these new methods also modify the interpretation of features studied previously by older methods (e.g. SEM studies influence how morphology is interpreted in the light microscope),
- reinterprets features because of improved causal and functional understanding,
- but also is an opinionated, often very personal enterprise, where schools of thought arise that identify themselves through use of proprietary terminology and concepts, and who are replaced by an evolutionary process of survival of the fittest ideas rather than through logical argument (Kuhn 1970, Hull 1988, Hull & Ruse 1998).

-- Gregor Hagedorn - 30 Mar 2004

Some thoughts as we come to the end of our Prometheus 'Experiment'

Re: Static Versus dynamic Terminology Models

I would agree that a lot of what Gregor discusses above definitely agrees with the experiences of our Main.PrometheusII project.

We realized that any terminology developed would have to be expandable so that it could cope with new concepts and with a widening user base, but that if it was to allow backwards compatibility, expansion must not alter the semantics of earlier versions of the terminology.

This is problematic because it can be argued that addition of new terminology does change the semantics of data collected with previous versions which lacked the new terminology - but at an operational level a taxonomist would have to choose to 'ignore' this subtlety in order to achieve compatibility or integration.

Prometheus tried to test whether if terminology is standardized at a lower level than that at which characters are currently defined - i.e. the defined terminology is used to actually compose characters/states from more atomic statements - it might be possible that the same data can be used to build up new or alternative 'characters'. We hoped that this approach would mean that new ways of describing variation (i.e. when a new conceptual character is 'created') would still be compatible with old data at the level of the underlying decomposed description. However, in practice this seems to be quite a difficult concept for working taxonomists to develop. The idea of what constitutes useful taxonomic characters seems to be ingrained at a higher compositional level and the taxonomists do not see value in decomposing their observations into more atomic statements about variation. As a non-taxonomist I could propose an alternative methodology where specimens are merely described in factual terms, and 'characters' of taxonomic interest are 'discovered' by analyzing the variation in recorded data. Taxonomists however prefer to evaluate the variation before scoring (or at least they do an initial quick evaluation to decide on their 'characters' of interest - before going to detailed scoring of specimens).

-- Main.TrevorPaterson - 01 Apr 2004

Indeed Prometheus has a huge advantage of testing approaches with users. I believe the problems you mention partly come from the tediousness of recording descriptive data on specimens. I agree that your approach of an "alternative methodology where specimens are merely described in factual terms" would work, but it runs into two problems:

actual concern about efficiency. Taxonomists are few and the amount of biodiversity information to process is huge. The human brain is an excellent parallel pattern processing machine. Decomposing this into an atomic and sequential process is difficult and time consuming. It is also extremely demotivating, if your excellent pattern processor already believes to know the answer (that actually works often, albeit not always...).
it is also a social issue of intelligent people with a high self esteem, who love to work with diversity, feeling themself reduced to technical personell and all exciting part of the work delegated to software algorithms.

SDD has all the time struggled with these problems. The decision to keep a basic flat character model (with largely optional extensions over DELTA) is based on both, the need to incorporate DELTA/Lucid etc. legacy data content, and the believe that many "content providers" are already struggling with structural requirements of character/states, and that highly complex models would probably not find acceptance (we also realized that the believe may well be wrong and that the struggling may be as result of the inadequacies of the character model). However, all through the SDD process, we also tried to accomodate the experience of large projects, who found the analytical capabilities of the character model unsatisfying, and were struggling with managing around a 1000 characters in their project.

So we have added mechanisms that optionally allow to organize characters in more meaningful ways. Unfortunately (?), we currently have two such mechanisms: a fundamental "ontology" definition (= glossary), and an operational mechanism (concept trees mapping on characters; specific concepts like part-hierarchy = structures, properties, methods are marked). I still find this somewhat unsatisfactorily, so the input of how Prometheus organized descriptive knowledge and then thinking about how this could be combined with a relatively low-structure "character model" is very valuable to SDD.

-- Gregor Hagedorn - 2 Apr. 2004

Some thoughts as we come to the end of our Prometheus 'Experiment'

Re: Technical and Semantic Standardisation

It seems very important to scope the range/level/extent over which semantic standardisation is attempted, possible or even useful.

A Top-Down attempt to impose standardisation across a large domain can never succeed. A standard terminology must be developed by/for a specific user/expert group, who can agree on the semantics of terminology for the range of their work. Perhaps it may never be valuable/meaningful to integrate data between distinct user groups - in which case mapping of ontological relations between different terminologies is not worthwhile. However, if different user groups do wish to share information - they would have to agree on a shared terminology between user groups - or do some horrendous non-automatic mapping between their separate terminologies. We have argued about whether it is possible to create a generic terminology - for a wide taxonomic domain, with layers of increasing specificity (perhaps as plug-in modules) of more specific terminology for distinct taxa. Or whether we require totally separate terminologies - that possibly could be partially mapped to each other. There is no obvious answer, but we have to make people aware that if individualised personal terminologies are developed and used - the possibilities for data integration are severely reduced. (A problem being that at an individual level a taxonomist may not care about data integration).

-- Main.TrevorPaterson - 01 Apr 2004

"... at an individual level a taxonomist may not care about data integration": perhaps, but if the data are readily available and can be easily imported, they care a lot about saving time and avoiding spending months on defining terminology. So I do not think that the manual mapping is necessarily horrendous, provided:

part of it can be achieved by voluntarily using certain blocks of terminology as "building blocks" for new projects. The use of building blocks means that different projects will have a common phylogenetic ancestry.
the mapping can be restricted to the most important terms for the integration task. Often across large groups such as "spores that can be found in the air" only a few terms need to be integrated. Manually all terms of large terminologies will be difficult, I agree.

Attempts like in Main.PrometheusII or http://www.plantontology.org to provide well defined sets of definitions will be very useful if they are provided in a way that new projects can incorporate them as parts (and preferably, predefined modules, rather than selecting individually from the the whole). That brings us back to SDD. If Prometheus higher plant terminology could be provided in a way that a new project could import it (to extend it locally, or to combine it with other imports), that would both save time for the new project, probably increase the quality of the descriptive data in the project, and provide a base for integration. NOTE that SDD DOES NOT include all the tools for this! Although we carry this scenario around with us (and have made structural decisions to support federation) We still have not decided on an include or import mechanism.

-- Gregor Hagedorn - 2 Apr. 2004

There's a little rambling on the subject in SDD.ExternalTerminology. Possibly it belongs here, but instead I'll link to here. Mostly it concerns the nature of the use of GUIDs for discovery of external terminologies. -- Main.BobMorris - 18 Apr 2004