Changes in 0.91 beta 15 (relative to the 0.9 Dec. 1. 2003 release)
This is an updated version containing most of the minor changes discussed at the [[SDD2004Berlin][meeting in Berlin]]. Some changes are still pending.
The current version of the SDD schema can always be found at CurrentSchemaVersion. Please do read through the report of changes, except perhaps for the few trivial at the start.
Please take a look at the schema to verify that you agree with the changes and that they make sense to you.
Note: I have tried to document changes, but I cannot guarantee that everything is properly documented. In fact, since GenerationMetadata and ProjectDefinition are heavily changed in an attempt to find common ground between the various GBIF standards (current discussion involves only ABCD so far), I have given up on documenting all detailed changes therein (but some are commented).
Trivial omissions that were present in 0.9, corrected in 0.91
audiencekey in ProjectDefinition/Audiences/Audience was specified to have a pattern in the documentation, but the pattern was not defined in the schema, regular expression pattern added to schema 0.91.
RevisionData were required in Description, Keys, and GlossaryEntries, now made optional.
The Keys collection could be missing, or empty (0 to unlimited Key objects), now changed to 1 to unlimited Key objects.
Non-trivial changes enacted plus proposals not enacted
Root
In an attempt to converge with ABCD:
Document root element changed to DataSets/DataSet collection. DataSet takes the place of the original Document. Multiple "Projects" can now be transported in one file or data stream. This is not urgent for SDD, but does not hurt either.
GenerationMetadata changed to TransformationHistory, conceived as a collection of at least one, possibly multiple Transformation elements. Alternative names: ConversionHistory, UBIF.DerivationHistory, HistoryMetadata, ContentHistoryMetadata, or DataHistoryMetadata.
ProjectDefinition
Element itself changed to ProjectMetadata
AudienceSpecificData/Representation split into Description/Representation and IPRStatements/Representation.
IPRStatements is a list of various copyright, terms of use, disclaimer, acknowledgment etc. statements (new type common to SDD and ABCD schema). However, this is also present in TransformationHistory!
ProjectDefinition/HistoryWebAddress dropped. Annotation was: "@@ To be discussed. The idea is that a project may point to a web resource that informs about details about the history of the data (previous versions or a detailed log of changes)." Unless somebody needs it now, I propose that this should be an addition in a later version rather than included in the first release.
ProjectDefinition/Icon moved to new ProjectDefinition/Description/Representation, thus making it audience specific. Icon (or logos) are not necessarily language independent since they may include text!
ProjectDefinition/WebAddress moved as well, different audiences/languages may be referred to different URIs!
New after Berlin meeting: attempt to use across standards (see UBIF.SchemaDiscussion), therefore audience-dependent project Description and IPR-Statements changed to language dependent. Language should simplify the adoption of common framework elements for all TDWG/GBIF standards.
New after Berlin meeting: Version structure revised.
Version/PublicationDate changed to VersionReleaseDate to avoid possible confusion with LastRevision or data generation date in online situations.
A Modifier element added (for beta, rel. candidate, etc.).
Increment removed (because considered application-internal management mechanism, no need for interoperability).
Major and Minor left as integers to improve interoperability and comparability (nobody commented on the proposal "change version to string" posed in previous version of the change log.)
New after Berlin meeting: The narrative (unconstrained text) elements GeographicCoverage and TaxonomicCoverage in ProjectDefinition|Projectmetadata/Description/Representation combined to Coverage.
Constrained ClassScope added, __OtherScope needs a proposal how to link it to other vocabularies. SourcePublication changed from a single to possibly several, and considered a scoping mechanism as well.
NOTE: Project Definition could also be called "Envelope". This avoids "project", which is meaningful in SDD, but perhaps problematic in ABCD/taxon names?)
QUESTION: Can project definition be merged with transformation history?)
PROPOSAL: Need documentation of quality control methods and standards, e. g.
QualityControlStandard: Name (and version, if applicable) of the published or internally documented quality control standard used.
QualityControlDescription: Free-form description of methods used to ensure the quality of the descriptive data. In the absence of a standard, this should be a short description of the quality control procedures taken.
QUESTION: ProjectDefinition/RevisionData/InitiationDate is xml:dateTime and required, which may cause problems in legacy projects. See discussion under InitiationDateForImportedLegacyData. The proposal makes sense in the context of project definition. However, RevisionDataType is also used in several other contexts (single descriptions, glossary, characters, etc.) and the proposal does not make sense there. Do we need two slightly derived types? Has anybody a better idea?
GeneralDeclarations
New root section "GeneralDeclarations" created for concepts not specific to SDD, but needed in the schema. Alternative names for this section are: GeneralDefinitions, OverarchingIssues/Functions, CrosscuttingIssues/Functions, GeneralTerminology, GeneralTerms, GeneralVocabulary (the latter three do not cover the possible inclusion of "language rules"). The following elements moved there:
(Newly created:) Global definitions for MeasurementUnits (Character definition Numerical/MeasurementUnit is consequently changed to a ref type). The optional generalization allows to define relations between units such that two size measures, one expressed in mm the other in cm become comparable.
In each of CodingStatus, UnivariateStatisticalMeasure, MeasurementUnit, the "Generalization" element (containing the machine-readable partial semantics of an object) was renamed to Specification.
The Audience definitions lang and expertiselevel, previously defined as attributes, have been reorganized to follow the pattern of Label + Specification.
The defaultaudience attribute present at Audience was only appropriately placed because all audience definitions were considered part of the project definition. Now it is separated and moved to ProjectDefinition/DefaultAudience.
StatisticalMeasures renamed to UnivariateStatisticalMeasures (compare Bob's comment on TWIKI about ClosedTopicMultivariateStatistics).
Related: the fact that Char. def. Numerical/StatisticalMeasures had both a ref and a key confused several reviewers. To clarify, the key has now been renamed from ref to GeneralDeclarationRef and both this and the key on GeneralDefinitions/UnivariateStatisticalMeasures/UnivariateStatisticalMeasure is typed as StatisticalMeasureKeyValue.
Element "Dimensionless" added to Specification of UnivariateStatisticalMeasures (answers whether the measurement unit apply to a statistic or not).
Terminology
Sequence of sections changed, Terminology section placed after Entities and Resources sections.
Multiple new ontological relations between terms added and subsumed under a new Ontology element. This urgently needs review!
SensuLabel and KindOfTerm added. The first allows to distinguish between multiple definitions of a term (Term does not have to be unique, but Term + SensuLabel has to be!), the latter categorizes terms (is that doubtful??).
With the introduction of SensuLabel, Term is no longer a keyref in the ontological definitions (synonym, antonym, etc.). Replaced with TermListType = List of GlossaryEntryRefType.
Ontology now refers to GlossaryEntry keys rather than Term strings in a specific language. This is partly necessitated by the introduction of a SensuLabel.
As a result, other parts of the GlossaryEntry (Citations, RevisionData) have now been made language/audience-independent as well. This also resolves some anomalies, e.g. that RevisionData were one the audience-specific part instead on the language-independent object as in all other cases in SDD.
ExternalReference changed to ExternalDefinitionURI
CharacterDefType
Label changed from LabelPlusAbbreviationType to SimpleLabelType. This simplifies the model: Only a single label can be defined at the character level, all extended concepts (abbreviations, export tokens, images) are definable only in concept trees. Since concept trees require a terminal node for each character, the same expressiveness is maintained.
Type changed to MeasurementScale, value list completed to include "ratio".
Section Assumptions added to the character definition, MeasurementScale moved there
Categorical and Numerical are tentatively changed to a choice rather than co-occurring. This needs discussion!
PlausibilityRange added to numeric character definition. Applies to all values and statistics, except those that are dimensionless (like variance).
GenericStates renamed to ConceptStates (= states that are present at nodes in the concept tree; this is the only place where GenericStates was present). "Generic" was considered to be confusing since for biologists it may be understood as referring to states describing a Genus.
"Probability modifiers" have been renamed back to "Certainty modifiers" (they were previously called "Uncertainty modifiers" before changing to "Probability". As already discussed in Brazil (but later forgotten), Probability is ambiguous since low occurrence frequency of a state also results in a low probability that a given object has a given character state.
Terminology/Modifiers/Sets (intended to define reusable modifier sets which would then be associated with characters) and CharacterDefType/ModifierSets where both replaced with a new Concept/ApplicableModifiers element in the concept trees. For the modifier sets a key and a label had to be defined so they could be selected in each characters through a keyref. The new solution avoids both the label and the key/keyref mechanism: The concept label also identifies the modifier set, and the characters are already defined by all characters included in a concept branch. The disadvantage is, that some tree-walking is required to find which modifier is applicable to which character.
In frequency modifiers "ProbabilityRange" was changed to "CertaintyRange".
Frequency and Certainty modifiers changed to now contain the Range definition inside a Specification element.
Concept trees: An organizing element "Specification" added (similar to definitions in GeneralDeclarations). The types, roles, etc. inside were reorganized and the enumerations changed (e. g., MethodHierarchy to InstrumentationHierarchy, PartHierarchy split into PartOfHierarchy and PartGeneralizationHierarchy). Also please critizise the current structure: "DesignedFor/Role=Filtering". Do the element and value names make sense to native speakers? Any better suggestions?
PROPOSAL: Rename AutoAddStates to UpdateStateRefsTriggers (those state from a generic state set must be as StateReference in Character/Categorical/States). GH: I believe it should be the other way round, i.e. instead of a state-set reference at the character, there should be a list of characters referenced at the place concept node. I have started to do this, but not yet finished! See "####" at the end of the document!
QUESTION: Allow multiple mappings of fine-grained states to coarse-grained states, and make these mappings expertise-specific (part of audience definition)? Do we need multiple state sets within a character? Broad categories and narrow categories? Currently mapping of state is within a single character, and the two state sets need to be detected by application (those present minus those mapped away. Note: mapping can be indirect a-> b-> c, only c should remain.) Do we need multiple named mapping definitions in the future? See StateMapping for further discussion.
Entities
The "connector" metapher was not well received and not considered intuitive. As an attempt, I propose to use a proxy metapher: The proxy object is a local object "standing-in" for the external, often asynchronously available resource on the internet. In programming this is called the "proxy-pattern". As a variation proxy objects may, however, also "stand-in" if no external object can be found and a local object (e.g. in biology: taxon name, specimen) has to be defined. Specific changes:
ResourceConnectorBaseType changed to ProxyBaseType
ClassNameConnectorType, ClassHierarchyConnectorType, DescribedObjectConnectorType, etc. all changed to ...ProxyType
Within the ProxyBaseType, the FreeFormDescription was changed to Label. For all internal SDD object like characters or states, Label signifies a human readable representation, which is the intent of this data element as well.
The ID/external object linking was strongly changed. The previous version (which was never really worked out so far) worked only if the object query could be embedded into a single URI query string, or if the old ServiceProvider referred to a web service wsdl with a single method and a single parameter. Now the ObjectLink rather than the old "ExternalID" points to the object in case of a single URI query string. The method and parameter names, and the ID-values are now given separately for web services. Furthermore, ABCD does not plan to provide a single or unified ID for collection units, but uses three separate variables that together uniquely refer to a specimen object. This is supported, but it would still be desirable to have a single ID to simplify ID comparison and distinguish ID from other parameter values that may be required to use a webservice method (but may be constant for different objects).
In addition to URL and webservice, tentative support for DOI (digital object identifiers) and LifeScience ID (LSID) was added (including an LSIDs defining a pattern constraint).
New after Berlin meeting: Sequence of Label (= FreeFormDescription in 0.9) and ObjectLink changed; Label is now first. This agrees with the use of Label throughout the other parts of the schema (characters, states, etc.).
Entities/Classes changed to Entities/ClassNames, //Class to //ClassName. Note: in addition to the ClassName (taxon name) pointers present we may need alternative pointers into the class concepts (taxon concepts) as present ClassHierarchy!
"TaxonNameInSource" renamed to "ClassNameInSource". Related open issue: Combine with Location? Else we need to have a CitationBaseType without ClassNameInSource used in Glossary and Keys, and a derived type used in Descriptions!
New after Berlin meeting: ClassIdentification changed to ClassAssignment; the process will be an identification, but the result is assigning the object description to a class. The term Identification caused confusion in the discussion.
Bob pointed out the inconsistency of declaring the standard to be independent of the biodiversity domain (thus using class/object instead of taxon/specimen) and still having taxon, taxonauthor, etc. in UBIF.FormattedText. For the time being I have removed these (they are still preserved in an unused backup version of the type, so they can easily be put back).
Similarly, the biology-specific elements Sex and Stage were removed from ClassNameProxyType (= ClassNameConnectorType in 0.9; = the type of the proxy object defining links to external name databases). SDD assumes that ClassNameConnectorType in the future will connect to nomenclators or species databases and these are unlikely to provide separate records for sex and stage. It would have been possible to move Sex and Stage to DescriptionBaseType, but they are required at the end of the diagnostic keys as well (sexes or stages may be keyed out separately!). Thus, a new type ClassRefWithAdditionalClassifierType has been derived from the ClassRefType and used for DescriptionBaseType/Class (which is the basis for coded as well as natural language descriptions) and StoredKeyDefType/Lead/Class. Furthermore the Object identifications may be sex/stage specific (but also many objects will have multiple stages in a single specimen...). At the moment the new ClassRefWithAdditionalClassifierType has also been used at DescribedObjectConnectorType/ClassIdentification.
The above mentioned type ClassRefWithAdditionalClassifierType should be defined generalized, avoiding biology-specific concepts like sex and stage.
ClassHierarchies was previously restricted to single hierarchy, now allows multiple ClassHierarchy objects. A ClassHierarchy is the only way available in SDD to define taxon subsets (character subsets are defined in the ConceptTrees).
PROPOSAL: Add an Abbreviation element to Class and Object in Entities? Would not likely be updated by service, but may be useful or even required for reports. Update problem is related to problem with updating the Caption of MediaResources.
Descriptions
In coded and natural language descriptions a Header element was introduced to improve the overview and organization of information.
CharacterData_BaseType/Sequence with values "terminology" or "description" was considered difficult to understand. Bob proposed to replace it with a boolean "StatesAreOrdered" which has been done.
PROPOSAL: Rename CodedDescriptions to SymbolicDescriptions, see Analytical Philosophy (I only checked the Enc. Britannica, I am no expert in this!)
Keys
Keys/Key was changed to IdentificationKey/IdentificationKeys. The term "key" was perceived as too general, causing especially misunderstanding for non-biologists like programmers. Instead of the depracated "guided key" other terms are "Pathway key" and "Stored key". "Dichotomous key" is inappropriate.
CodedStatements in Keys (coded terminology equivalent to the natural language key statement) used to be a simple list of states. To accomodate the frequently occurring more complex statements in keys, e. g. "margin of fruitbody yellow (or orange and hairy)" -> i.e. not if only orange, or "margin of fruitbody yellow, never with denticles" -> other surface structures may be present, a boolean operator logic modeled after MathML has been added to CodedStatements inside Keys.
Related: Should Boolean logic (not, and, or) be added to any natural language markup?
Should guided keys be marked up using the natural language markup method rather than using a separate section, as currently proposed? Currently, the key markup was thought to follow the coded description model, but now it has been extended. Problem: Boolean logic is frequently found in the lead statements of keys, but rarely in natural language taxon descriptions. However, if Boolean logic operators are introduced to both, it would be a strong argument to use the same method in NLD and Keys, rather than having three variants.
Alternatively, we may want to extend the CodedDescriptions and provide Boolean logic operators there as well. This would be a heavy burden on database-oriented descriptive data processing, however. Or can someone provide a simple model how to handle arbitrary logical and/or combinations in a relatively simple database model?
General
CitationType: optional LastVerified and InvalidSince date elements added, important for volatile online publications.
The application-specific data containers (= extension mechanism to store non-SDD data) has been renamed from ApplicationData/Application to CustomExtensions/CustomExtension. Several applications may agree on common extensions, in which case the old names would not have been appropriate. The mechanism itself remains unchanged.
Model groups like "(Rich)AnnotationGroup" containing only optional elements have been themself made optional. This changes nothing in the validation and schema, but seems to help when using Castor data binding.
In the LabelPlusAbbreviationRepresentationType (used frequently in Label/Representation elements) the Selector element containing media (usually images) was renamed to MediaResources. This is the same element name used generically throughout the schema.
The name "Selectors" was intended to express that only certain media should be added here - those that are sufficiently informative and concise at the same time to be used as selectors instead of text labels. However, the use of Selector lead to more confusion than clarification, and the purpose of the media is expressed through the Label context, i.e. these are labeling images etc.
The only other media resource is Icon which remains semantically labeled.
Open Questions
Class names (= taxon names referenced in descriptions or keys) may have to be audience specific! See LanguageSpecificClassNames for a discussion!
Descriptions generalization questions, i.e. inferring descriptions from other descriptions:
Main.PrometheusII proposes to explictly reference descriptions that are to be included or generalized into a current description. Currently we expect in SDD this to rely on am automatic "description resource discovery" mechanism, i. e. all object descriptions with the same class name are generalized, and classes are generalized to higher classes following the class (taxon) hierachy defined in Entities.
BioLink proposes (correct?) to explicitly flag which characters or states allow generalization, and whether from above or below.
(= the first is explicit generalization on the object/class hierarchy level, the second explicit which characters/states are included in generalization.)
Related: SDD probably needs a mechanism to mark the results of aggregation/generalization, computed characters, calulated statistics to document whether they are calculated / inherited or directly entered.
Related: Do we have to document original terminology labels during data entry (i.e. in the language/audience representation used during scoring). The audience itself may be interesting (as a code), but even more the terminology may have been changed slightly (evolution of terminology) since scoring. A record of score-time representation would increase the trust in the coded scores and allow some backtracking of problems.
In Descriptions we call an element GeographicalScope, in ProjectDefinition basically the same thing GeographicalCoverage! However, Descriptions refers to defined objects in Resources, whereas in ProjectDefinition it is free-form text (modeled directly after DublinCore). Make this consistent and always use Resources/Geography/Location object references?
Problem of storing calculated data and marking them as "autogenerated" (or which term to use?). Related to problem of inheriting information up and down taxonomic tree. Similar problems are already marked up in the "Origin" element in character and NLD data, and in the inherited attribute associated with character ratings. In the case of statistical measures, marking the Origin as calculated would refer to the raw data in an observation set. However, there is some discussion on the Wiki (see RepeatedObservations) whether we need a keyref to exactly one observation set or not.
We probably need to have more than one class hierarchy and add a marker to indicate which hierarchy is formal, and which contains non-taxonomic groupings. In Brazil Kevin reported on Lucid providing a "tag" mechanism to mark "silly characters" intended only to group items like "100 worst weed species: yes/no". XPER reported a similar tag mechanism for items (instead of characters as in Lucid) to tags items for specific problems: diseases / quarantine species / disease vectors. To me both kind of problems seem to be most appropriately handled as a non-taxonomic class hierarchy. Any proposals how to handle this? As a first step an additional attribute "IsPhylogenetic" in the class hierarchy is proposed (already done).
Glossary:
Do we need some method to express ranges for cardinality: How many legs may there be etc.?
Do we need some method to associate states with properties/types?
Should the natural language markup be brought closer to xhtml by using <span class=""> for markup?
Basing character states on concept states (= reuse of state sets in multiple characters) causes problem with order (ordinal scale) characters. The states in a character may be inherited from from multiple concepts nodes. Each of these will probably have order in the concept, but the final order can only be defined in each character. This seems unfortunate.
Can we describe images? Is this automatically implied in reversing the association between a description and an image or not? Images may only illustrate parts of the description.
Can we format numeric values in reports? See DELTA *DECIMAL PLACES. How do we format sets of statistical measures in natural language or other reports? The (min-) lowerrange - central - upperrange (-max) format is not necessarily universal. Currently it is nevertheless fixed in application code and cannot be defined by users. Since many variants which individual measures are present exist, this can probably not be done with a TextBefore/After strategy (possible for Min, Max, but not for ranges with/without mean, "3-6", "5", "3-5-6"). Also, open ranges exist, which should be output as "at least 3 cm long" in natural language. Also: formats are audience/language-specific!
Can we find a smart method to format related and dependent value like width x length?
Using polymorphism for character definitions. Color as separate character type?
Media Resource may need a location detail (if figure has multiple labeled fragments). Perhaps call this FragmentLabel?
Media-"FragmentLabels", but even more the "Location" in Citations may be language sensitive! "table 1", "tab. 2", "figure 3" in English, "Abbildung 3" in German etc.!
Problems I believe cannot be solved in xml schema
(please tell me if you disagree!)
We have a frequently used type that prevents validation of requiredness in SDD schema: Most labels use FormattedSimpleTextType, which if the element is required should always be non-empty. However, in contrast to simple text strings, FormattedSimpleTextType allows limited formatting (sup/sub etc.) and has a mixed content model. As a result, it is not possible in xml schema to require the length of it to be at least 1. This may be a case where we have to make a recommendation not to output empty elements, and a requirement that a missing element and an empty element are to be considered identical (applications should not attach different semantics to empty elements).
The missing element issue seems approachable by declaring things nillable and allowing xsi:nil="true" to distinguish from the missing case. This arose also in the discussion ResolvedTopicIsDiGIRadequateForBDI.SDD -- Main.BobMorris - 29 Apr 2004
I cannot follow your argument. The problem I state above is that I cannot constrain the Labels to actually contain a string, the element must be present but may contain nothing. There seems no mechanims in schema to prevent that. I know you warned us against mixed content model! -- Gregor Hagedorn - 3. May.
One reason why this is relevant is that I believe we have to introduce a similar mechanism for StatisticalMeasures, to allow defining sets of statistical measures centrally (min-max range, a simple range/mean type like DELTA, extensions including variance and sample size, etc.).
Also, we have modifier sets as well. Can we also run them over a concept-node-based system, so that we have very similar systems for States, Measures, and modifiers? That seems to improve the schema. Unfortunately, with modifiers I am uncertain how well this works. Modifiers almost cry for inheritance down the concept tree, something we have not yet done so far!