ObjectIdentifierPattern

GregorHagedorn - Mon Mar 14 2005 - Version 1.19
Note March 2005: Although the proposed solutions may not be ideal, there have been no comments on this since December 2004. I therefore plan to summarize the discussion and present conclusions soon. Since I cannot do that today. please feel free add comments - I will incorporate them when summing up. Please also see ObjectTypePattern.

Probably any kind of object exchange (through TCS, SDD, ABCD, perhaps LinneanCore (LC)) has the following requirements for identifiers:

Inform about the kind of identifier in the data source and whether they are temporary or stable
Cross-link between objects from the same or different knowledge domains (e. g. refer to a taxon name)
Provide machine-resolution mechanisms (e.g. URL, LSID, DOI)

Part of the following discussion is still centered around LC or TCS, but really I think it is a very general question and e.g. SDD objects (characters, descriptions, as well as ClassName or Unit proxies) should follow the same pattern.

The discussion is already a bit long, if you are in a hurry, you could try to pick it up at the id-type diagram.

When revising LC, Sally proposed a general "RecordId" type that is both used to define the identity of a LC object, and to link to other objects. Sally's annotation: "The locally or globally unique Identifier for this name record. Could be one of three kinds of ID: GUID (persistent, machine resolvable such as an LSID), human resolvable Global ID (e.g. 'IPNI citation ID 14098-1') or locally (within this dataset) resolved ID (e.g. '1' or 'K001928712') - probably belongs somewhere else if it looks useful and could end up being widely used". This type has two attributes: "type" and "id". "id" is xs:string. The attribute "type" belongs in Sally's proposal to an enumeration of "Human Resolvable", "Locally Resolvable", "Globally Resolvable". Example: <NameRecordId type"Human Resolvable" id"IpniCitation 1003545-1 version 1.1"/>

The design of UBIF discussed in Christchurch differs in several aspects: 1 "id" is used only for object identifiers, whereas Sally uses this attribute name both for definition and reference. In UBIF, when referring to an id the "ref" attribute is used. TCS uses the same pattern as UBIF (SDD had to change from using "key" to "id" so that we are more compatible). The "ref" pattern makes it clearer that this is not an id of the element, but an object is referred to. The design is also used in xhtml (id/name and href) or xml schema itself. I believe we should follow this. 2 More importantly (and probably UBIF should change here?) UBIF always uses two id systems in parallel: One internal for cross-referencing within a document scope (id/ref) and one to Link to external objects. Under Links you can have general URL, LSID, or DOI. There is no equivalent to Sally's human-resolvable ID, i.e. something that provided you find the right web query interface, you can enter somewhere and resolve it. Please take a look at the top of diagram on ProxyDataPublication. "ObjectLink" is now called "Link", and Webservice linking methods removed by now. 3 There is no way to store a source-database-local id in current UBIF. The assumption is that if such a thing exists, it would be used for the document cross-referencing id - but this assertion can nowhere be made explicit or rejected! 4 On the other hand, UBIF proposes in parallel to the machine-readable links a Label (with text and abbreviations possibly in multiple language) that are intended for human consumption. In the LC world, that is where I believe the FullName belongs - and I actually propose to rename it to "Label".

I think Sally's example falls somewhere between in combining a local id with a text helping humans to guess about how to resolve this. On the other hand, it seems that such a text is not ideally suited to cross referencing names.

Who can help us to make the best of both worlds?

-- Main.GregorHagedorn - 03 Nov 2004

Rich commented by email:

NameRecordID: One of my (overly ambitious?) hopes for LC was to get away from project-specific ID numbers for taxon names (e. g., separate IPNI, ITIS, Species2000, GBIF, etc. ID numbers for the same name instance). I would be very happy if we could agree on a GUID system for LC name instances, perhaps administered by GBIF. That way, we could avoid the issues surrounding LUIDs altogeher, and just pass around one set of GUIDs. I realize this may be too much to ask for, but my main concern is, how can we cross-map the same funtional "name" instance across LC datasets provided by different nomenclators? In theory, LC should only include information records that are "objectively discernable" -- not really open to interpretation (except, perhaps, for interpretations on the relevant Code and how it should be applied to form the "correct" name in rare cases). So, my point is -- would it be impractical to restrict this to GUID only, after we identify a logical GUID provider?

Separate from the previous comment, would it be better to to allow the ID element to be Unbounded -- so multiple LUIDs and/or GUIDs can be provided, when more than one are known to be congruent (i. e., refer to the same name instance?)

Gregor: I basically agree with you, but I think if we want to be inclusive and access data sources, we can not require a single GUID system. I agree with you that a specific one (LSID?) should be recommended, however. Ultimately, the nice thing about LSID is that simply by prefixing the IPNI, ITIS, etc. ids with a namespace, authority, etc. of the LSID, the old system is turned into the new system. This allows the local software to remain unchanged and the LSID created or interpreted on the fly in the provider interface. - 16 Nov 2004

Just stepping back and thinking. In a way, one may want to distinguish between the following types of identifiers:

temporary identifiers valid only in the current xml document or stream context
- Examples: an ID/IDREF pair in an xml document, but also an SDD document where the cross-referencing identifiers are created on the fly
permanent local identifiers that will remain valid in the foreseeable future
- Example: IPNI-IDs (not yet searchable with web interface, but probably planned), SDD document, where multiple documents (perhaps different character subsets) will refer to the same character always by the same numeric ID
public identifiers that are human resolvable via a web form
- whether these are Local ID or Globally Unique ID (GUID) probably does not matter yet
- Examples are IndexFungorum IDs, Specimen collection IDs
- Is is relevant to inform whether an identifier is "local" or "public"? Seems to be closely related.
machine resolvable identifiers
- these are always GUID
- Examples are LSID, DOI, any kind or URL
- A distinction may be made about how permanent the link can be expected to be.

Currently, in the UBIF proxy model only the resolvable GUIDs are addressed in linking. Example (assuming IPNI would support LSIDs):

<Link><LSID>urn:lsid:lsid.gbif.net:IPNI:157927-1</LSID>
-- or (with alternatives) --
<Link>
  <LSID>urn:lsid:lsid.gbif.net:IPNI:157927-1</LSID>
  <URL>http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view</URL>
</Link>

Markus Döring proposes to attempt to get rid of the Link layer and put the ids as attributes on the element defining the object. This does not allow multiple URLs as the current UBIF model does, but if they are required, a denormalization having url and alturl may be acceptable. This could look like:

<ScientificName lsid="urn:lsid:lsid.gbif.net:IPNI:157927-1"/>
-- or (with alternatives) --
<ScientificName lsid="urn:lsid:lsid.gbif.net:IPNI:157927-1" url="http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view" alturl="http://www.ipni.org ..."/>

I am willing to follow a denormalized attribute model - there is little value in supporting a truly unlimited number of id and resolution services, but for a transition period we have to be flexible and support a few.

However, this still does not allow to express all the options listed above. Also, when cross-referencing within a document (having a list of literature/publications somewhere and referring to these, as currently done in SDD or TCS) it is unclear to which of the multiple attribute the ref value would refer. UBIF currently always introduces a secondary id for this purpose, but this seems undesirable in the longer run when people actually start using directly resolvable ids like lsids.

Thus, an alternative model could be to follow Sally's "id-type" model:

<ScientificName id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid" id2="http://www.ipni.org ..." id2type="url" />
-- or (preferring to use a public/local id for cross-referencing within the xml document) --
<ScientificName id="238723" idtype="local" 
					 id2="urn:lsid:lsid.gbif.net:IPNI:157927-1" id2type="lsid" 
					 id3="http://www.ipni.org ..." id3type="url" />

Please give me some feedback on these loose ideas!

-- Main.GregorHagedorn - 16 Nov 2004

I would like to see all this in the light of the different use for object-ids and object-references here:

Object ID:
<object id=... idtype=... id2=... id2type=... />

Object Reference:
<object ref=... label=... />

An object could have several IDs whereas a reference to an object would only need 1. A single simple label string would also be nice for object references, but not for objects themselves which would probably need multiple languages, etc. Also the reference does not necessarily need an idtype, as it is given in the schema where the reference is pointing to. Just for references pointing beyond its local document it would be good to know the type. But as this has to be a global resolvable ID it might not be important, cause LSID and URL are self explanatory.

Remaining questions I see:

do we need multiple IDs for an object or/and reference and how could we deal with that?
- I guess for a reference it is enough to reference the "primary" id of an object
- multiple IDs would be good to have for objects so a client could chose between them (if he prefers LSIDs for example)
- to avoid new elements we could agree on a limited amount of possible ids, maybe 2 are already enough (local/global) e.g. id, id2 ?
list of identifier types being used. The types listed already are delimited by different attributes such as local/global, permanent/temporary, resolvable/unresolvable, URI/non-URI. But instead of capturing this I suggest to use a small definite list of idtypes that primarily distinguishes between temporary and permantent ids and global/local ones:
- temp (e.g. generated on the fly for the xml document only)
- luid (permanent but local; this would have to include IPNI etc.)
- guid (other than the special cases below, e.g. GUID strings)
- doi (resovable special case guid)
- url (resovable special case guid)
- lsid (resovable special case url listed to promote its use)
Does a reference need an idtype?

-- Main.MarkusDoering - 16 Nov 2004

I would vote for at least 3 alternatives, to allow giving both an official local id for cross-referencing, a conventional resolvable URI, and have space to migrate to a new GUID system like LSIDs. The reference needs no idtype I believe.

I really like Markus's idea of having a label attribute on the object reference side. In UBIF there is a label in the proxy, and a debugref as an option to create a human readable id-ref analogue. But the debugref depends on the ref-id relation. In contrast the label idea could be interpreted as making the ref itself optional. That could be a migration option for all those cases where people only have a single string in their databases and somehow justified recoil from the idea to have to add this string to a publication proxy object list, add an id to it, and then ref this id - at the place where in their database there is just a simple string type.

So a system may define

Object definition or Object Proxy has one required id/idtype, and two optional alternative id2/id2type/id3/id3type attributes. Semantics:
- The first id is used for document-internal cross-referencing. The data provider can choose which type is preferred for cross-referencing based on the convenience of implementation.
- If several of the three ids are of a resolvable type, the sequence indicate a usage-preference by the provider.
Object reference has one optional ref and an optional label.
- The implied syntax is that either the ref or the label must be present (however, this is not expressible in xml schema)
- What would be expressible is that the label is required - I am not sure how desirable this is - since the ref would point to a place inside the document, which should have sufficient human-readable semantics.
- Yet an alternative to have label attribute would be to use the string content of the element that contains the ref attribute. The advantage would be that even in cases where the alternative to "by-ref" is not a single string but 2-3 elements if would be possible to offer the local-or-by-reference pattern as alternatives. I am not sure whether this complicates the design too much, however.
- In TCS 0.8 the ref attribute has an annotation "Points to a top-level element via an ID reference; use gref to reference an external entity via a GUID." - however there is not gref attribute in TCS. But it gives me the idea that yet another method may have a choice of label alone, ref + optional label pointing within xml document, ref + required label pointing directly to external resolvable ID. Unfortunately, xml schema does not allow me to make these alternatives strict without introducing three different element names for each usage of this pattern which somewhat bloats the schema.

Regarding the idtype categories: Markus and I drew up the following possible ontology "diagram":

					ID
				/			temporary		  permanent
						 /		  						/			 				local ID			 global ID (GUID)
				/		\				/				 private	public	 resolvable  non-resolvable
								  /		\				/		\ 
								URI	  Other		private public	 
							  /  \		 | 
							URL  URN  Example: DOI
									|
							 Example: LSID
NOTES:
ID = comparison possible
temporary = comparison only in single document, two repeated queries are not comparable based on ID
permanent = IDs from repeated queries are comparable 
local = IDs are comparable only in usage/provider context
global = IDs are globally comparable
private = not queryable
public = queryable, = Sally's human-resolvable IDs

As Markus points out we do not want it to be that complicated... Here is a modified version of Markus's enumeration again:

temp: temporary, e. g. generated on the fly for the xml document only
local: permanent but local id (luid); this would include human-resolvable ids like IF, IPNI etc.
guid: global ids other than the special cases below, e.g. 128 bit GUID integers or universal resource names (urn) that do not offer resolution mechansims
url: resolvable standard internet address. In practice "permanent" only over moderate periods of time.
doi: resolvable "digital object identifiers" (not a urn)
lsid: life science IDs (resolvable special case urn that promises excellent permanency and resolution mechanisms

Note: I propose to replaced luid with local because I find luid rarely used (according to google) and if used strictly defined as a local 64 bit integer.

Question: Should we separate "urn"s other than lsid, rather than including it in the non-resolvable other guids? Is there perhaps a better name for this non-resolvable guid type? As the ontology shows, url, doi, lsid are all in principle a guid. I would prefer to highlight lsid as a type if TDWG and GBIF make this the recommendation. It should be clear that the flat enumeration is based on practical considerations, not on the ontology itself.

So this would mean (for Object insert any of "Publication", "TaxonConcept", ABCD "Unit", SDD "Character", etc.)

Object ID:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid" 
		  id2="http://www.ipni.org/ ..." id2type="url" />

-- or (preferring to use a public-local id for cross-referencing within the xml document) --

<Object idtype="local" id="238723" 
		  id2type="lsid" id2="urn:lsid:lsid.gbif.net:IPNI:157927-1" 
		  id3type="url"  id3="http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view" />

and Object Reference:

<Object ref="238723" label="some human readable label text capturing 
 the semantic identity of the object" />

-- Gregor Hagedorn - 17 Nov 2004

A thought that we had here at Kew, which I think will clarify some of our concerns ... One thing I think is important is that anyone who has an existing set of 'permanent public local ids' (in Gregor's terms, Human Resolvable in my terms) has a good upgrade path from those to LSIDs. For instance, we have been giving out IPNI ids for a few years now, so there are a lot of datasets out there with IPNI ids and even version numbers in them. It should be a goal of any Global ID system that if, say, IPNI switches over to using LSIDs, then a simple batch process should be able to convert those legacy Human Resolvable Ids into LSIDs - for example by adding the string 'urn:lsid:lsid.gbif.net:IPNI:' onto the front of them. A simple point, but one that will prevent a lot of problems in the long run! -- Main.SallyHinchcliffe - 30 Nov 2004

As a separate, but related topic: Sally mentions a problem when some records are the result of a query, and others are provided only because they are required by refs in these records. "if the search request was for 'everything in Poa' and some of the related concepts weren't in Poa, then how was the requester to know which records were returned because they matched the request, and which were returned because they were referred to in the other records? I could imagine that if you had a very interconnected set of concepts, then it would always return the entire data set, whatever you asked it for!" -- In TCS v. 08 a xs:attribute name="primary" type="xs:boolean" use="optional" is intended for this. xs:annotation: "If primary='true' the concept is the first level response to a query. If 'false' the concept may be a secondary concept linked directly or indirectly to the definition of a primary concept." -- I think not delivering the required referenced records is not desirable. Should some attribute like primary (or "request=primary?") be part of a general design pattern for all kind of objects that is defined together with the object id design pattern? Or is this a protocol issue? -- Gregor Hagedorn - 17 Nov 2004
Again this is a question of goals. As a data server, my goal is to serve up the least amount of data required to satisfy a query. So if someone asks for 'everything in Poa' I don't want to have to serve up names from another genus just for completeness. I'd rather serve up the simple list of results along with some ids that allow those interested to get the complete data by following up the links. Whether these be LSIDs or whatever, I'm not that bothered. Bandwidth may be cheap, but getting a completely resolved set of data (e.g. everything in Poa, plus all of the basionyms, related records, child records, parent records, etc.) usually requires repeated hits of the database and when you've got thousands of queries a day that's _not_ cheap. -- Main.SallyHinchcliffe - 30 Nov 2004

I'd like to suggest that the suggested denormalised form for the id/idtype pairings may actually be harder to process than would be the case if we simply have a container with one or more subelements. In the latter case software to generate the XML needn't count ids to track how many have already been inserted, and software to process the XML can use simple XPath style expressions to extract elements of interest (whereas with the denormalised version three separate expressions would be needed). The arbitrary restriction on the number of ids could also be frustrating. I would recommend that we go with a container element for this. -- Main.DonaldHobern - 10 Dec 2004

I do prefer a collection as well (and as currently proposed in UBIF). The main reason for denormalization is probably shorter xml that is easier to debug. These social questions may be very important.

How would your proposed collection look? Object reference is not affected, but Object ID might look:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid">
  <Label>
	  ... probably still multilingual ...
  </Label>
  <Links>
	 <Link id="http://www.ipni.org/ ..." idtype="url" >
	 <Link id="238723" idtype="local">
  </Links>
  ...
</Object>

This includes the aspect that a link is always an alternative id (if a machine can resolve it unambiguously, it must have all properties of ids), but an id is not always a link. The sequence of id and the link objects also indicates a preference of usage by the provider. However, the provider may already have switched to an lsid system and prefer this to be used, but the client may be unable to do so. Multiple links allow coexistence and migration scenarios.

-- Gregor Hagedorn - 10 Dec 2004

This is exactly what I had in mind -- Main.DonaldHobern - 10 Dec 2004

Robert Kukla writes in email: "I think there are several aspects to this ID discussion:

1 Should a single (TaxonConcept) record have more than one ID? I think conceptually it should not. If there is a central repository that allows the retrieval of this (TC) record via the resolution of a GUID I am all in favour of using that GUID instead of a local ID. It could be argued that a local ID should be considered temporary or volatile (exists only for this document) and I am fine with that. Everything else, in my mind, is not a (TC) record ID. 2 Would it be useful to have a pointer to a web page that has the information from which the (TC) record has been derived?
Possibly. Reasons are:
- has better formatting of the data
- is without conversion artefacts
- has additional information
As there seems to be a requirement for it, we shall include it. This however is not an ID, but a reference. Potentially two or more concepts could point to the same web page. I would suggest a single container (only one such thing per (TC) record) e.g. <ProviderLink type="url">http://www.provider.com/concepthome.html<ProviderLink> 3 Would it be useful to have pointers to web pages that have additional information related to the (TC) record? IMO not. This information should be stored in seperate data sets pointing to (TC) records. If this is not possible for whatever reason the provider specific Element should be used.

-- Cheers, Robert" - 10 Dec 2004

Regarding Robert's "aspects":

1 I agree that each TCS object should have a single id. Other identifiers are references to the same data object under other schemes and it would be useful to keep these somehow separate, indicating that all others are alternate ids. 2 I agree that such pointers should be possible and I believe that they can be supported through any of the (normalised or denormalised) alternatives on this page. 3 I think we have to allow for this option as the only way in the near future to indicate the existence of much information which will not be transformed into a standardised structured form. There may be no provider-specific element. The associated data may take any of a wide range of forms and a fully open model is required until the day when we have managed to structure everything (i.e. probably forever).

Main.DonaldHobern - 10 Dec 2004

One thing needs perhaps to be clarified: If the id is one of "temp", "local", or "guid", then the only resolution mechanism is in Links, and if this is missing the object is only "human resolvable". If however, the id type itself is already url, doi or lsid, one resolution mechanism is through the id attribute in the object element itself, others are under Links.

I feel uncomfortable both about this choice, but the alternative would be to repeat the stuff under link.

And another second thought: "Link" is used in (x)html head both to cross-reference other relevant information and for alternative representations. Example:

  <link rel="stylesheet" type="text/css" href="markup.css" />
  <link rel="bookmark" href="#top" title="Page top" />
  <link rel="start" href="../" title="W3C Home Page" />
  <link rel="contents" href="#navbar" title="Navigation" />
  <link rel="appendix" href="Activity" title="Activity Statement" />
  <link rel="appendix" href="xhtml-roadmap/" title="Roadmap" />
  <link rel="appendix" href="2004/xhtml-faq" title="FAQ" />
  <link rel="help" href="../Help/siteindex" title="Site Index" />
  <link rel="glossary" href="../2001/12/Glossary" title="Glossary" />
  <link rel="copyright" href="#copyright" title="Copyright" />
  <link rel="alternate" type="text/html" title="HTML version" href=",html" />
  <link rel="alternate" type="application/xhtml+xml" title="XHTML version" href=",xhtml" />

So is the general term "Link" appropriate here? My current feeling is that in contrast to xhtml, for the biosciences domain we would rather like to distinguish between alternate id/resolution mechanisms (which are easily generalizable) and cross-reference kind of links, which are very specific to the exact subdomain (taxonomic names, specimens, descriptive data, etc.). So here only alternative representations are intended. We could follow the xhtml style and use both rel, type, href. Or perhaps the following pattern is more precise:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid">
  <Label>
	  ... probably still multilingual ...
  </Label>
  <Alternate>
	 <Link id="http://www.ipni.org/ ..." idtype="url" >
	 <Link id="238723" idtype="local"><!-- = "human resolvable" -->
  </Alternate>
  ...
</Object>

What do you think?

-- Gregor Hagedorn - 10/14. Dec 2004

One thing I keep stumbling on here is the fact that the urn mechanism actually provides for the same information that is in an idtype in that case. And I shudder when I see the same information in two places--too much chance of inconsistency. Secondly, urn's are a type of uri and possibly uri is in fact the notion being sought here. There is a great deal written about, and some standards for, the things supported in uri's, even though urn's evolved because uri's may have been too general. The recent approach to an integrated view of urn's and uri's is the series of RFCs leading up to RFC3404 Dynamic Delegation Discovery System (DDDS) Part Four: The Uniform Resource Identifiers (URI). DDDS is in fact the foundation of the only(?) resolution service for LSIDs and certainly the only kind mentioned explicitly in the LSID spec.

At the very least, probably the requirements here should be tested against those for uri's as well as urn's, if not for DDDS. Particularly worth study are RFC1737 Functional Requirements for Uniform Resource Names, RFC 2168 Resolution of Uniform Resource Identifiers using the Domain Name System (obsoleted by the DDDS RFCs) and RFC 2483 URI Resolution Services Necessary for URN Resolution, and the things they reference. In particular, RFC2483 lays out a set of operations and operands desired for resolution, along with their operands, error conditions, etc. RFC2915l The Naming Authority Pointer (NAPTR) DNS Resource Record is the specification of a specific part of RFC2168, also obsoleted by DDDS. In my opinion there is substantial motivation to remain close to the model of RFC2483, because there is a lot of existing infrastructure premised on it. The requirements in the RFCs obsoleted by DDDS remain worthy of mapping to those mentioned in this wiki topic.

-- Main.BobMorris - 11 Mar 2005

I think only few identifier types are in fact URIs, viz. URN (with LSID) and URL. How shall we deal with the other kind of identifiers? DOI may or may not be embedded in urns, as far as I remember the method to do so is contentious. How to providers that provider numeric guids inform what that is? How does a provider inform whether an integer number is created on the fly for cross-referencing purposes or is indeed stable? -- Gregor Hagedorn

Please see UriAndUrnAndUrl -- Main.BobMorris - 13 Mar 2005