ObjectIdentifierPattern

GregorHagedorn - Mon Mar 14 2005 - Version 1.19
Note March 2005: Although the proposed solutions may not be ideal, there have been no comments on this since December 2004. I therefore plan to summarize the discussion and present conclusions soon. Since I cannot do that today. please feel free add comments - I will incorporate them when summing up. Please also see ObjectTypePattern.


Probably any kind of object exchange (through TCS, SDD, ABCD, perhaps LinneanCore (LC)) has the following requirements for identifiers:

Part of the following discussion is still centered around LC or TCS, but really I think it is a very general question and e.g. SDD objects (characters, descriptions, as well as ClassName or Unit proxies) should follow the same pattern.

The discussion is already a bit long, if you are in a hurry, you could try to pick it up at the id-type diagram.


When revising LC, Sally proposed a general "RecordId" type that is both used to define the identity of a LC object, and to link to other objects. Sally's annotation: "The locally or globally unique Identifier for this name record. Could be one of three kinds of ID: GUID (persistent, machine resolvable such as an LSID), human resolvable Global ID (e.g. 'IPNI citation ID 14098-1') or locally (within this dataset) resolved ID (e.g. '1' or 'K001928712') - probably belongs somewhere else if it looks useful and could end up being widely used". This type has two attributes: "type" and "id". "id" is xs:string. The attribute "type" belongs in Sally's proposal to an enumeration of "Human Resolvable", "Locally Resolvable", "Globally Resolvable". Example: <NameRecordId type"Human Resolvable" id"IpniCitation 1003545-1 version 1.1"/>

The design of UBIF discussed in Christchurch differs in several aspects: 1 "id" is used only for object identifiers, whereas Sally uses this attribute name both for definition and reference. In UBIF, when referring to an id the "ref" attribute is used. TCS uses the same pattern as UBIF (SDD had to change from using "key" to "id" so that we are more compatible). The "ref" pattern makes it clearer that this is not an id of the element, but an object is referred to. The design is also used in xhtml (id/name and href) or xml schema itself. I believe we should follow this. 2 More importantly (and probably UBIF should change here?) UBIF always uses two id systems in parallel: One internal for cross-referencing within a document scope (id/ref) and one to Link to external objects. Under Links you can have general URL, LSID, or DOI. There is no equivalent to Sally's human-resolvable ID, i.e. something that provided you find the right web query interface, you can enter somewhere and resolve it. Please take a look at the top of diagram on ProxyDataPublication. "ObjectLink" is now called "Link", and Webservice linking methods removed by now. 3 There is no way to store a source-database-local id in current UBIF. The assumption is that if such a thing exists, it would be used for the document cross-referencing id - but this assertion can nowhere be made explicit or rejected! 4 On the other hand, UBIF proposes in parallel to the machine-readable links a Label (with text and abbreviations possibly in multiple language) that are intended for human consumption. In the LC world, that is where I believe the FullName belongs - and I actually propose to rename it to "Label".

I think Sally's example falls somewhere between in combining a local id with a text helping humans to guess about how to resolve this. On the other hand, it seems that such a text is not ideally suited to cross referencing names.

Who can help us to make the best of both worlds?

-- Main.GregorHagedorn - 03 Nov 2004


Rich commented by email:

NameRecordID: One of my (overly ambitious?) hopes for LC was to get away from project-specific ID numbers for taxon names (e. g., separate IPNI, ITIS, Species2000, GBIF, etc. ID numbers for the same name instance). I would be very happy if we could agree on a GUID system for LC name instances, perhaps administered by GBIF. That way, we could avoid the issues surrounding LUIDs altogeher, and just pass around one set of GUIDs. I realize this may be too much to ask for, but my main concern is, how can we cross-map the same funtional "name" instance across LC datasets provided by different nomenclators? In theory, LC should only include information records that are "objectively discernable" -- not really open to interpretation (except, perhaps, for interpretations on the relevant Code and how it should be applied to form the "correct" name in rare cases). So, my point is -- would it be impractical to restrict this to GUID only, after we identify a logical GUID provider?

Separate from the previous comment, would it be better to to allow the ID element to be Unbounded -- so multiple LUIDs and/or GUIDs can be provided, when more than one are known to be congruent (i. e., refer to the same name instance?)

Gregor: I basically agree with you, but I think if we want to be inclusive and access data sources, we can not require a single GUID system. I agree with you that a specific one (LSID?) should be recommended, however. Ultimately, the nice thing about LSID is that simply by prefixing the IPNI, ITIS, etc. ids with a namespace, authority, etc. of the LSID, the old system is turned into the new system. This allows the local software to remain unchanged and the LSID created or interpreted on the fly in the provider interface. - 16 Nov 2004


Just stepping back and thinking. In a way, one may want to distinguish between the following types of identifiers:

Currently, in the UBIF proxy model only the resolvable GUIDs are addressed in linking. Example (assuming IPNI would support LSIDs):

<Link><LSID>urn:lsid:lsid.gbif.net:IPNI:157927-1</LSID>
-- or (with alternatives) --
<Link>
  <LSID>urn:lsid:lsid.gbif.net:IPNI:157927-1</LSID>
  <URL>http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view</URL>
</Link>

Markus Döring proposes to attempt to get rid of the Link layer and put the ids as attributes on the element defining the object. This does not allow multiple URLs as the current UBIF model does, but if they are required, a denormalization having url and alturl may be acceptable. This could look like:

<ScientificName lsid="urn:lsid:lsid.gbif.net:IPNI:157927-1"/>
-- or (with alternatives) --
<ScientificName lsid="urn:lsid:lsid.gbif.net:IPNI:157927-1" url="http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view" alturl="http://www.ipni.org ..."/>

I am willing to follow a denormalized attribute model - there is little value in supporting a truly unlimited number of id and resolution services, but for a transition period we have to be flexible and support a few.

However, this still does not allow to express all the options listed above. Also, when cross-referencing within a document (having a list of literature/publications somewhere and referring to these, as currently done in SDD or TCS) it is unclear to which of the multiple attribute the ref value would refer. UBIF currently always introduces a secondary id for this purpose, but this seems undesirable in the longer run when people actually start using directly resolvable ids like lsids.

Thus, an alternative model could be to follow Sally's "id-type" model:

<ScientificName id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid" id2="http://www.ipni.org ..." id2type="url" />
-- or (preferring to use a public/local id for cross-referencing within the xml document) --
<ScientificName id="238723" idtype="local" 
					 id2="urn:lsid:lsid.gbif.net:IPNI:157927-1" id2type="lsid" 
					 id3="http://www.ipni.org ..." id3type="url" />

Please give me some feedback on these loose ideas!

-- Main.GregorHagedorn - 16 Nov 2004


I would like to see all this in the light of the different use for object-ids and object-references here:

Object ID:
<object id=... idtype=... id2=... id2type=... />

Object Reference:
<object ref=... label=... />

An object could have several IDs whereas a reference to an object would only need 1. A single simple label string would also be nice for object references, but not for objects themselves which would probably need multiple languages, etc. Also the reference does not necessarily need an idtype, as it is given in the schema where the reference is pointing to. Just for references pointing beyond its local document it would be good to know the type. But as this has to be a global resolvable ID it might not be important, cause LSID and URL are self explanatory.

Remaining questions I see:

-- Main.MarkusDoering - 16 Nov 2004


I would vote for at least 3 alternatives, to allow giving both an official local id for cross-referencing, a conventional resolvable URI, and have space to migrate to a new GUID system like LSIDs. The reference needs no idtype I believe.

I really like Markus's idea of having a label attribute on the object reference side. In UBIF there is a label in the proxy, and a debugref as an option to create a human readable id-ref analogue. But the debugref depends on the ref-id relation. In contrast the label idea could be interpreted as making the ref itself optional. That could be a migration option for all those cases where people only have a single string in their databases and somehow justified recoil from the idea to have to add this string to a publication proxy object list, add an id to it, and then ref this id - at the place where in their database there is just a simple string type.

So a system may define

Regarding the idtype categories: Markus and I drew up the following possible ontology "diagram":

					ID
				/			temporary		  permanent
						 /		  						/			 				local ID			 global ID (GUID)
				/		\				/				 private	public	 resolvable  non-resolvable
								  /		\				/		\ 
								URI	  Other		private public	 
							  /  \		 | 
							URL  URN  Example: DOI
									|
							 Example: LSID
NOTES:
ID = comparison possible
temporary = comparison only in single document, two repeated queries are not comparable based on ID
permanent = IDs from repeated queries are comparable 
local = IDs are comparable only in usage/provider context
global = IDs are globally comparable
private = not queryable
public = queryable, = Sally's human-resolvable IDs

As Markus points out we do not want it to be that complicated... Here is a modified version of Markus's enumeration again:

Note: I propose to replaced luid with local because I find luid rarely used (according to google) and if used strictly defined as a local 64 bit integer.

Question: Should we separate "urn"s other than lsid, rather than including it in the non-resolvable other guids? Is there perhaps a better name for this non-resolvable guid type? As the ontology shows, url, doi, lsid are all in principle a guid. I would prefer to highlight lsid as a type if TDWG and GBIF make this the recommendation. It should be clear that the flat enumeration is based on practical considerations, not on the ontology itself.


So this would mean (for Object insert any of "Publication", "TaxonConcept", ABCD "Unit", SDD "Character", etc.)

Object ID:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid" 
		  id2="http://www.ipni.org/ ..." id2type="url" />

-- or (preferring to use a public-local id for cross-referencing within the xml document) --

<Object idtype="local" id="238723" 
		  id2type="lsid" id2="urn:lsid:lsid.gbif.net:IPNI:157927-1" 
		  id3type="url"  id3="http://www.ipni.org/ipni/plantsearch?id=157927-1&query_type=by_id&output_format=object_view" />

and Object Reference:

<Object ref="238723" label="some human readable label text capturing 
 the semantic identity of the object" />

-- Gregor Hagedorn - 17 Nov 2004


A thought that we had here at Kew, which I think will clarify some of our concerns ... One thing I think is important is that anyone who has an existing set of 'permanent public local ids' (in Gregor's terms, Human Resolvable in my terms) has a good upgrade path from those to LSIDs. For instance, we have been giving out IPNI ids for a few years now, so there are a lot of datasets out there with IPNI ids and even version numbers in them. It should be a goal of any Global ID system that if, say, IPNI switches over to using LSIDs, then a simple batch process should be able to convert those legacy Human Resolvable Ids into LSIDs - for example by adding the string 'urn:lsid:lsid.gbif.net:IPNI:' onto the front of them. A simple point, but one that will prevent a lot of problems in the long run! -- Main.SallyHinchcliffe - 30 Nov 2004


I'd like to suggest that the suggested denormalised form for the id/idtype pairings may actually be harder to process than would be the case if we simply have a container with one or more subelements. In the latter case software to generate the XML needn't count ids to track how many have already been inserted, and software to process the XML can use simple XPath style expressions to extract elements of interest (whereas with the denormalised version three separate expressions would be needed). The arbitrary restriction on the number of ids could also be frustrating. I would recommend that we go with a container element for this. -- Main.DonaldHobern - 10 Dec 2004


I do prefer a collection as well (and as currently proposed in UBIF). The main reason for denormalization is probably shorter xml that is easier to debug. These social questions may be very important.

How would your proposed collection look? Object reference is not affected, but Object ID might look:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid">
  <Label>
	  ... probably still multilingual ...
  </Label>
  <Links>
	 <Link id="http://www.ipni.org/ ..." idtype="url" >
	 <Link id="238723" idtype="local">
  </Links>
  ...
</Object> 

This includes the aspect that a link is always an alternative id (if a machine can resolve it unambiguously, it must have all properties of ids), but an id is not always a link. The sequence of id and the link objects also indicates a preference of usage by the provider. However, the provider may already have switched to an lsid system and prefer this to be used, but the client may be unable to do so. Multiple links allow coexistence and migration scenarios.

-- Gregor Hagedorn - 10 Dec 2004


This is exactly what I had in mind -- Main.DonaldHobern - 10 Dec 2004


Robert Kukla writes in email: "I think there are several aspects to this ID discussion:

1 Should a single (TaxonConcept) record have more than one ID? I think conceptually it should not. If there is a central repository that allows the retrieval of this (TC) record via the resolution of a GUID I am all in favour of using that GUID instead of a local ID. It could be argued that a local ID should be considered temporary or volatile (exists only for this document) and I am fine with that. Everything else, in my mind, is not a (TC) record ID. 2 Would it be useful to have a pointer to a web page that has the information from which the (TC) record has been derived?
Possibly. Reasons are:
- has better formatting of the data
- is without conversion artefacts
- has additional information
As there seems to be a requirement for it, we shall include it. This however is not an ID, but a reference. Potentially two or more concepts could point to the same web page. I would suggest a single container (only one such thing per (TC) record) e.g. <ProviderLink type="url">http://www.provider.com/concepthome.html<ProviderLink> 3 Would it be useful to have pointers to web pages that have additional information related to the (TC) record? IMO not. This information should be stored in seperate data sets pointing to (TC) records. If this is not possible for whatever reason the provider specific Element should be used.

-- Cheers, Robert" - 10 Dec 2004


Regarding Robert's "aspects":

1 I agree that each TCS object should have a single id. Other identifiers are references to the same data object under other schemes and it would be useful to keep these somehow separate, indicating that all others are alternate ids. 2 I agree that such pointers should be possible and I believe that they can be supported through any of the (normalised or denormalised) alternatives on this page. 3 I think we have to allow for this option as the only way in the near future to indicate the existence of much information which will not be transformed into a standardised structured form. There may be no provider-specific element. The associated data may take any of a wide range of forms and a fully open model is required until the day when we have managed to structure everything (i.e. probably forever).

Main.DonaldHobern - 10 Dec 2004


One thing needs perhaps to be clarified: If the id is one of "temp", "local", or "guid", then the only resolution mechanism is in Links, and if this is missing the object is only "human resolvable". If however, the id type itself is already url, doi or lsid, one resolution mechanism is through the id attribute in the object element itself, others are under Links.

I feel uncomfortable both about this choice, but the alternative would be to repeat the stuff under link.

And another second thought: "Link" is used in (x)html head both to cross-reference other relevant information and for alternative representations. Example:

  <link rel="stylesheet" type="text/css" href="markup.css" />
  <link rel="bookmark" href="#top" title="Page top" />
  <link rel="start" href="../" title="W3C Home Page" />
  <link rel="contents" href="#navbar" title="Navigation" />
  <link rel="appendix" href="Activity" title="Activity Statement" />
  <link rel="appendix" href="xhtml-roadmap/" title="Roadmap" />
  <link rel="appendix" href="2004/xhtml-faq" title="FAQ" />
  <link rel="help" href="../Help/siteindex" title="Site Index" />
  <link rel="glossary" href="../2001/12/Glossary" title="Glossary" />
  <link rel="copyright" href="#copyright" title="Copyright" />
  <link rel="alternate" type="text/html" title="HTML version" href=",html" />
  <link rel="alternate" type="application/xhtml+xml" title="XHTML version" href=",xhtml" />

So is the general term "Link" appropriate here? My current feeling is that in contrast to xhtml, for the biosciences domain we would rather like to distinguish between alternate id/resolution mechanisms (which are easily generalizable) and cross-reference kind of links, which are very specific to the exact subdomain (taxonomic names, specimens, descriptive data, etc.). So here only alternative representations are intended. We could follow the xhtml style and use both rel, type, href. Or perhaps the following pattern is more precise:

<Object id="urn:lsid:lsid.gbif.net:IPNI:157927-1" idtype="lsid">
  <Label>
	  ... probably still multilingual ...
  </Label>
  <Alternate>
	 <Link id="http://www.ipni.org/ ..." idtype="url" >
	 <Link id="238723" idtype="local"><!-- = "human resolvable" -->
  </Alternate>
  ...
</Object> 

What do you think?

-- Gregor Hagedorn - 10/14. Dec 2004


One thing I keep stumbling on here is the fact that the urn mechanism actually provides for the same information that is in an idtype in that case. And I shudder when I see the same information in two places--too much chance of inconsistency. Secondly, urn's are a type of uri and possibly uri is in fact the notion being sought here. There is a great deal written about, and some standards for, the things supported in uri's, even though urn's evolved because uri's may have been too general. The recent approach to an integrated view of urn's and uri's is the series of RFCs leading up to RFC3404 Dynamic Delegation Discovery System (DDDS) Part Four: The Uniform Resource Identifiers (URI). DDDS is in fact the foundation of the only(?) resolution service for LSIDs and certainly the only kind mentioned explicitly in the LSID spec.

At the very least, probably the requirements here should be tested against those for uri's as well as urn's, if not for DDDS. Particularly worth study are RFC1737 Functional Requirements for Uniform Resource Names, RFC 2168 Resolution of Uniform Resource Identifiers using the Domain Name System (obsoleted by the DDDS RFCs) and RFC 2483 URI Resolution Services Necessary for URN Resolution, and the things they reference. In particular, RFC2483 lays out a set of operations and operands desired for resolution, along with their operands, error conditions, etc. RFC2915l The Naming Authority Pointer (NAPTR) DNS Resource Record is the specification of a specific part of RFC2168, also obsoleted by DDDS. In my opinion there is substantial motivation to remain close to the model of RFC2483, because there is a lot of existing infrastructure premised on it. The requirements in the RFCs obsoleted by DDDS remain worthy of mapping to those mentioned in this wiki topic.

-- Main.BobMorris - 11 Mar 2005

I think only few identifier types are in fact URIs, viz. URN (with LSID) and URL. How shall we deal with the other kind of identifiers? DOI may or may not be embedded in urns, as far as I remember the method to do so is contentious. How to providers that provider numeric guids inform what that is? How does a provider inform whether an integer number is created on the fly for cross-referencing purposes or is indeed stable? -- Gregor Hagedorn

Please see UriAndUrnAndUrl -- Main.BobMorris - 13 Mar 2005