ObjectTypePattern

GregorHagedorn - Sat Mar 19 2005 - Version 1.16
Parent topic: WebHome

Introduction

This topic discusses the various forms in which an object type (such as taxon concept, geographical area, publication, etc.) may occur in a TDWG xml schema. The goal is to address the problem of object relations, object resolution, and persistence of object and relation semantics in a more general framework and allow the construction of truly modular and flexible TDWG schemata. The proposal is based on discussions between me, Gregor Hagedorn, Bob Morris, Donald Hobern, Jessie Kennedy and Markus Döring over the last months. It closely corresponds to Donald Hobern's message "[tcs-lc] Modularisation of standards, Tue, 8 Mar 2005 10:51:19 +0100".

-- Main.GregorHagedorn - 09 Mar 2005


Levels of detail and structure

Traditionally, information models distinguish primarily between a singular object (instance of a class) and references to it. However, in the context of data interchange objects in many cases are only representation of richer objects available elsewhere (e.g. in a local institutional database). These representations may have different levels of detail and structure (compare first slide in DataModelMatrix). Levels of detail may be: 1 A short/concise human readable label (or "title"), possibly in multiple languages. * Example: Publication references are currently mostly expressed through such human readable strings 2 A resolvable identifier such as url, lsid, doi that informs machines about where additional information can be found * Example: Publication references using Pub-med URLs or the publishing industries DOI identifiers 3 A combination of the above 4 A free-form, unconstrained text * Example: A natural language description, a text listing and commenting distribution areas 5 Unconstrained text with additional semantic mark-up using xml or other methods * Examples: html formatted text (mixed xml content); a natural language description with SDD markup (no mixed content) * Note: plain text with no markup may evolve into text with partial markup, and complete markup. It is unclear whether different types are desirable, or whether the no-markup case should simply be considered a special case of partial markup. A possible name for these alternatives may be "BagType" * Note: to avoid mixed content and mix a limited amount of formatting with semantic markup, the UBIF text formatting conventions may be used, see FormattedText 6 CoreType data: constrained, relatively flat content model (DarwinCore, LinneanCore, AlexandriaCore, etc.) 7 DetailType data: detailed content model (ABCD, SDD, TCS-like)

-- Main.GregorHagedorn - 09 Mar 2005


Other representation parameters

In addition to representations of different level of detail and structure, an abstract (a character, a taxon name) or concrete (a specimen, a publication) object may have representations that differ in:

One result of this is that many of these representations (e.g. a resolvable id, with or without label or core-data) are at the same time object representations and links to further object representations. A representation without machine-readable links consisting only of a label addressing humans (Level 1, above) may be interpreted as a "human resolvable" link to resources found elsewhere (as in free-form textual publication references). Thus the conventional distinction between objects and references is blurred.

-- Main.GregorHagedorn - 09 Mar 2005


#AnchorInstanceID

Instance ID versus abstract IDs

Whether an "object ID" refers to an abstract/concrete object or to a specific representation is in many cases not clear.

Another aspect of id references is whether they are:

It seems appropriate to distinguish between the digital identifier of an object representation that is used for referencing it in its exact form ("instance id"), and digital identifiers of abstract or concrete concepts which the current object instance represents ("abstract IDs" such as DOI IDs). Proposal:

1. The term Instance ID is used for values identifying a specific digital representation of an abstract or concrete object. In UBIF-conforming xml standards, this ID is expressed in an attribute with name "id" placed immediately in the element corresponding to the object type.

#AnchorInstanceIDRecommmendation

2. Relations to fundamental physical or abstract objects, of which the current instance is a representation are expressed as *Abstract IDs. These should be expressed in elements "Link" within objects corresponding to concepts found in XLink.

(the use of further XLink attributes needs to be investigated!!!)

#AnchorObjectReferences 3. Object references. The link information is considered a form of possible object representation, since it will often occur together with any level of detail, as discussed above. This is different from a form of reference referring by instance ID to a specific representation of an object, either within the same document (as used, e.g. in xml schema, where elements, attributes, etc. may either be defined locally or by reference), or on the internet, e.g. through lsids. Since both the abstract ID link and the instance reference are a form of reference, the latter may be used "instance reference".

-- Main.GregorHagedorn - 13 Mar 2005

See IdentifierFunctionality for my take on this. A few minor points of disagreement, but maybe not enough to justify how much I wrote there :-) -- Main.BobMorris - 14 Mar 2005


Semantic persistence

Some kinds of digital identifiers (resolvable or not) are in practice only in temporary use, spanning perhaps several years, the most notorious example being standard internet urls. This is no problem for many commercial uses, but it is a serious problem for scientific use. In taxonomy, the expected life time of information is often centuries, but even in new fields like molecular biology decade-old information may be highly relevant. The problem of changing technologies and the possibility of consequential breaking of id-resolution contracts can not be overcome, but it need not be. For the purpose of science a human readable semantic equivalent, similar to the object identification methods used in conventional printed publications is fully sufficient. In principle, it is possible to embed such information into a machine-readable identifier. However, the fact that urls are usually at least partly overloaded with semantics is a major reason for their instability. Part of this is habitual (semantic values lead to their use in debugging, and having erratically misleading semantics because concepts change may be worse than having no semantics at all), part of this is based in legal reasons (an former employee is usually legally prevent from continuing to use institutional or product terms). The identifier (urn, lsid, etc.) itself should therefore not contain any semantics.

Conclusion: At places where an identifier for an object is used, a second attribute should preserve the semantics of the object in human readable form. The proposed name for this attribute is either "label" or "title", please vote on this!

Uniqueness: It is recommended that the text in this "label" attribute should be as uniquely identifying the object as possible. The uniqueness is, however, not subject to strict validation. Requiring scientists to achieve strict uniqueness would consume an inadequate amount of time and effort.

-- Main.GregorHagedorn - 09 Mar 2005


Social and technical issues regarding size of xml document

In certain cases truly large data sets may have to be transferred. For example, in the main dataset behind www.lias.net contains 250 000 descriptive records. To attempt a lichen identification without knowledge of family, it may be necessary to obtain the entire data set. SDD attempts already to balance readability against size in this case and still needs about 50 Unicode character per such statement, resulting in a file size of ca. 12 MB (plus data for terminology with lesser scaling problems). Such file sizes still do cause some problems in which size matters. To achieve a relatively modest file size, SDD defines the character and states only once, and then uses these objects in the description by (mostly document-internal) reference. Requiring internal references to also bear semantic information, despite the fact that such information is available in the same document (and the link can therefore not possibly break) is not desirable. On the other hand, for the purpose of debugging it may be desirable to enrich internal cross-reference with human-readable text. This has been tested successfully in SDD (@debugkey/ref, add link!!).

Conclusion: Whereas for most objects, a human readable label (preserving the semantics of the scientific data) should be provided, an exception should be made for document-internal cross-references. No label should be required here. For the purpose of debugging, optional attributes may be provided the content of which may be dynamically filled or deleted.

-- Main.GregorHagedorn - 09 Mar 2005

Standard compression techniques such as gzip seem to give compression rations of 2-3:1 easily. See Compress XML files for efficient transmission at the ibm developerworks. Various claims turned over by google claim 20-30:1 with various techniques-- Main.BobMorris - 14 Mar 2005


Design pattern for object types

From the above, a fundamental pattern for xml object types in UBIF (and in fact outside of biosciences) can be derived. XML Object types may occur as:

-- Main.GregorHagedorn - 09 Mar 2005


Naming pattern for xml elements

My personal preference would be to use only a single element name for each domain type like "publication". For example I find it desirable to have Publication id"11" and Publication ref"11" for object definition and reference, and distinguish the type by the presence of id or ref attribute. However, this is not possible in w3c schema, which is largely unable to recognize a type based on element attribute and complex content (the first particle encountered must be able to distinguish the types, which is very difficult for attributes which may occur in any order, and it seems even required single attributes are not used for this purpose).

It is possible to design a schema that uses the same element name for different types, but this implies that the schema must be fully prescriptive whether an object composition or a object reference, and which level of detail or kind of reference is to be used. The believe of the current proposal is that it is desirable to create xml schemata that leave the user some choice as to which level of detail can be supplied. To our knowledge, this implies that different types must use different element names.

Below is an attempt to propose a possible naming pattern. "{Object}" would be replaced with "Publication", "Agent", "TaxonName", "TaxonConcept", "Specimen", etc.

Aside: In addition to the _{Object}Ref_ pattern, schema designers may use a different pattern for references that is based on a abbreviations. In some cases, including SDD, reference names should be as short as possible, considering the social problems with document size. Thus instead of "Character" and "Publication", "Char" and "Pub" may be used. The important point in the pattern is that these abbreviations need to be different from the names of elements they refer to. (Part of naming pattern is the agreement between SDD, ABCD, TCS to use !CamelCase for elements and complete lowercase for attributes.)

-- Main.GregorHagedorn - 18 Mar 2005


Rules governing placement and relations of objects

Most objects may occur either in object collections defined in the root of the dataset, or as part of the composition of another object.


Examples

######## needs revision ###########

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="@@@ remove this?">
  <Label xml:lang="en">... probably still multilingual ...</Label>
  <Label xml:lang="fr">francais</Label>
  <Link rel="alternate" title="IPNI version" href="http://www.ipni.org/xyz/?id=2349872" />
  <Link rel="meta"		  title="abstract and copyright"		href="../Help/xxx"  />
  ...
</Object> 
######## needs revision ###########


-- Main.GregorHagedorn and Bob Morris - 09-14 Mar 2005


Updates to UBIF:

Following the example of TCS (Jessie Kennedy et al.), UBIF now also allows multiple object collections in the root of the data set. In the case of validated reference, the object may be in any dataset within the datasets collection. It is thus possible to have one object type as "payload" of each dataset as previously, but it is also possible to have multiple objects all from the same provider with the same metadata, similar to the previous use of proxy objects.

We should worry a bit about whether this inhibits desired capability in <xs:import>. Do global things suddenly become non-global? Actually, in general, I don't even see how you can support multiple root collections at all in XML Schema, so maybe my point is moot. All in all, I am nervous about what happens under XML inclusions of stuff that tries to xi:include something that tries to make a reference into multiple object collections -- Main.BobMorris - 14 Mar 2005

CURRENTLY UNFINISHED - I still need some feedback on how to do it!

To achieve this object model, we also need an ontology of knowledge domains and corresponding object types, see ObjectOntology. How shall the totality of biodiversity information be subdivided into object classes?

Part of this is also touched in ObjectIdentifierPattern (which needs revision and discusses in details the questions how to represent object identifiers and identifier references).

-- Main.GregorHagedorn - 09 Mar 2005