GarryJolleyRogers - Wed Nov 25 2009 - Version 1.28
Parent topic: BDI.SDD_

ObsoleteTopicProxyDataModel

(This is part of the UBIF.SchemaDiscussion discussion. See also InterfaceDiscussion. The name for the section contain Agent, Publication, etc. Proxy objects in the TopLevelStructure is discussed under NameForProxyOrInterfaceSection.)


Fundamentals:

1 Ignore the problem and pretend the external objects can be represented by a simple human readable string (publication, person name, taxon name), disregarding the fact that this only allows humans to guess identity and prevents integration of different datasets. Unions of such sets can be queried using fuzzy query operators, but datasets cannot be joined - e. g. not GenBank sequences with ABCD specimen and GLOPP host-parasite interaction data. 2 Realize the problem and provide a internal object representation that is specific to the application. Thus Specify is also a literature database, DarwinCore and ABCD have geography and taxon name models, the TaxonConcept schema proposes to use the EndNote literature information model. These methods do not allow to truly reference external object providers. 3 Use a URL web address. This has two major problems: * It may break and then not only the link is lost, but also often the semantics of the referencing object (a descriptions claiming to describe "http://x.y.org/taxon?ID=234764" is worthless). * Although most objects can be provided some are not available. No external taxonomic name service will ever be able to provide all, including new taxon names so that they can be described.

To solve these problems I propose to consistently use a class of objects called proxy data. These primarily have very simple representation of an object composed of

Together with a developer Annotation and extensibility containers (Ext and CustomExtensions) this constitutes the base functionality of the ProxyData type.

In addition, each kind of proxy data class may contain extensions providing additional, more detailled data. To discuss both the base proxy elements and such extensions, I propose to discuss publications. This is external data to all currently active groups and is of similar importance to most biodiversity groups. Also it is obvious that no external publication database will ever fully provide the needs of scientists both citing difficult-to-find 200 year old literature and publications that have just appeared.

So in > > ProxyDataPublication < < please discuss both the basics of Proxies and the specifics of reference models!


The proxy architecture is proposed as a generally used architecture for all biodiversity knowledge domains (see ObsoleteProxyListsInAllTdwgGbifStandards). The following is a graphical overview of the use of proxy in the current version of BDI.SDD_:

ProxyTypeOverview.png

I personally believe that proxy data objects are what the GBIF indexer should be built upon. A data provider that participates in GBIF could export all internal objects into a ProxyData object, mapping the local complex data model to the shared data model. This is similar to the flat representations that DiGIR uses, with the slight difference that they are not exclusively coupled to a query protocol, but can also be used to cache data locally. This would be ideal as a basis for the GBIF indexer.


Other proxy discussions (except ProxyDataPublication):

History and term for the concept: In version BDI.SDD_ 0.9 (dec. 2003) ProxyTypes used to be called ResourceConnectors. This was considered not intuitive. At the Berlin 2004 meeting the term "proxy" (-data or -type) was accepted by all those present. The term "proxy" is used in the frequently used proxy programming pattern. It refers to the fact that the objects may either be proxies if no external object exists, or proxy cache if the asynchronous external object is temporarily unreachable (see also the earlier WIKI discussion ResolvedTopicConnectorOrProxy). -- Gregor Hagedorn - 18 June 2004


Donald Hobern commented in email:

My comments: Moving things out is the general trend on the web, but I am very wary about this. I am very much in favor of federating data, collaborating and using IDs to link information. However, I believe data streams and documents should maintain a sense of continuity and preserve the semantics of links for human consumption. This is the essential model of printed books and libraries - and it is one of the foundations of science.

Imagine printed scientific articles - instead of giving human readable references - refer to some GUID code that you can enter into the computer and then obtain the information. This is the current state on the web. Unfortunately, we all know that institutions, even if well managed, change. Data citing specimens solely by case numbers used at a given time in a specific collection may become worthless after some restructuring. In our institution we find that the knowledge of coded references that was preserved over decades, often disappears with retirement of colleagues and valuable data become trash.

So when talking about proxies I have two interfaces in mind:

Perhaps these could be separated. Would it make sense to have the base type (perhaps modified and simplified itself, e.g. using fewer types in Links), and leave the truly immature simple-data interface question out?

I am not sure whether I am ahead (realizing that we need additional interfacing beyond simple guid/uri) or behind times (because the web will become so stable that references are easily retrieved after 100 years) - both is quite possible.

-- Gregor Hagedorn - 29 July 2004


Markus Döring commented on 11. August: "Assume we have always a service which resolves IDs by appending it to the webservice URL: http://www.bioservice.org/alpinevegetation/426781876 or like this http://www.bioservice.org/alpinevegetation/getObject?id=426781876. Couldn't we live with a simple proxy object defined just as this: "<Taxon datasource="http://www.bioservice.org/alpinevegetation" id="426781876">Abies alba Mill.</Taxon>"?

Gregor Hagedorn: I think the problem is that only a small number of services will be fully under our control (e.g. not publications like MedLine, geographical gazetteers, molecular sequence databases like GenBank). This makes it difficult to require such an automatic mapping. However, if I change your proposal to: <Taxon lsid"urn:lsid:www.bioservice.org:alpinevegetation:426781876" id"426781876">Abies alba Mill.</Taxon> this is already close to what the proxy base model tries to achieve: a locally referrable id that defines identity even if no external service id is present, a link to the outside, and a human readable label.

Two more problems: a) Most likely at least URL, LSID, and DOI will exist in parallel. That is the only reason why Link is a collection. I personally have no major problem in making it three attributes. It seems a bit artifical, but if that improves acceptance I will gladly endorse it! As you can see in the new UBIF versions (BDI.SDD_.CurrentSchemaVersion) the complex webservice proposal is underscored (starts with "_"), meaning it is tentative and should perhaps be removed. b) Almost all object labels are potentially multilingual. Examples are geographical names, and even the full Agent label often needs transcriptions (Chinese to roman letters) or contains Place names to improve name uniqueness ("Hans Heinrich, Munich" or "Hans Heinrich, München"?). I believe this is a problem for GBIF, which is already now viewed as being a shop dominated by English speaking countries. And Chinese is the most widely used language on earth... However, providing several languages is impossible with an attribute approach. Can this be better hidden than I do? The proposed model is simply the model used in BDI.SDD, which throughout is multilingual. For BDI.SDD_ it is easier to keep it as it is, because this responds to generic code.


Markus Döring: Is it really required to reference another object using XML validation techniques? I could imagine the above simple proxy model to be used just in place somewhere inside the xml hierarchy and not referencing via xml to global proxy objects. So something like:

<SDD>
 <Taxon datasource"local" id="123">
  <ScientificName>Abies alba Mill.</ScientificName>
  <HigherTaxon>Pinaceae</HigherTaxon>
  <Genus>Abies</Genus>
  <Synonyms> ... </Synonyms>
  ...
 </Taxon>

 <Taxon datasource"local" id="124">
 ...
 </Taxon>

 <Description datasource"local" id="567">
  <Taxon datasource"local" id="123">Abies alba Mill.</Taxon> 
  <HumanDescription>...</...>
  ...
 </Description>
</SDD>

I don't very much like to impose the xml ID/IDREF constraints on users (must be document global numbers, which usually means that it is more complicated to output a document, since the numbers have to be created on the fly rather than being able to use or hash existing ids). Some people think that the xml id/idref mechanism should be considered depracated. However, replace in <Taxon datasource"local" id="123">Abies alba Mill.</Taxon> "id" with "ref", and it seems you are close to the proxy ref that UBIF is proposing - plus an optional copied Label inside the ProxyRef. A similar thing is actually proposed in the MicroAgent (although that is structured) and in the MicroMeasurementUnit (although again an extra element is present there, which could be removed). I see no problem to generalize this and allow in any place where a proxy ref is expected to have either

<Publication ref="123123" />  or 
<Publication ref="123123">Smith & Gordon 1999</Publication>
The more important distinction at the moment is that in the Micro... types the ref is optional, like in:
<Publication>Smith & Gordon 1999</Publication>
I am unclear whether it is desirable to allow this, but it may be possible. If this is the accepted migration path to a truly linked system that is fine. Perhaps one could optionally also allow:
<Publication language="en">Smith & Gordon 1999</Publication>
to simplify later migration to full proxy objects.

-- Gregor Hagedorn - 11 August 2004


Note: In the change from UBIF 1.0 beta 14 to beta 15 I have now removed the Webservice-Linking mechanism, see ProxyLinkByWebservice. -- Gregor Hagedorn - 13 August 2004