GarryJolleyRogers - Wed Nov 25 2009 - Version 1.19
Parent topic: ImplementationsOfBDI.SDD
David Remsen (dremsen-at-mbl.edu) from uBio (Universal Biological Indexer and Organizer) is leading the development of a new xml-based internet identification key. In this topic, he and his principal programmer Patrick will discuss questions about or problems with SDD. See http://uio.mbl.edu/services/key.html in general, and http://uio.mbl.edu/SDD/player.php for ongoing development of making it SDD compatible. -- Main.GregorHagedorn - 07 Jun 2004
Note Gregor: We are currently rewriting the generalities of the Lucid LIF to SDD conversion in a new topic: ConvertingLIF2BDI.SDD!
A version of the X:ID Player is available online using XML in the structure of the XML Schema as defined by SDD. Gregor Hagedorn supplied us with a current version of the SDD schema, and we have X:ID running off of this structure. This can be found at http://uio.mbl.edu/SDD/player.php . This version of the X:ID Player only runs the key to Atlantic Tunas transformed by our default stylesheet. By clicking the "XML" option in the functions bar will allow you to view the SDD compliant XML and the source code of the default stylesheet used to transform the XML.
I have attached a example of the XML (XIDwithBDI.SDD.xml) that is now created by the X:ID Player. It is meant to comply with the latest version of SDD as supplied to us by Gregor. I was still unclear as to the structure and content of the "
[..Main Data (txs)..] 6100000100001100100010000011111 6010000010001010010010000101110 6001000100001001010010000131110 6001000100010100001001000101100 6001000001001010100000101000000 6000100100001010000100100101000 6000010100001010001000310101100 6000001000101010000100100101000
This matrix scores taxa by rows and states by columns. So, it seems you would like, for each taxon, the values of each state in their respective character. If this is not how you intended the
I just attempted to load a rather large LIF (1070 states X 540 taxa) into the SDD version of X:ID with the above declaration of the matrix information. The XML file that it created was 4.5 megabytes and over 185,000 lines long, and this was only after 30 seconds when the process timed out. Only the scores of 54 of 540 taxa were displayed in this manner, meaning the XML file, if complete, would be more than 9 megabytes and would take more than 60 seconds to create.
This is something important for each of us to keep in mind when using a verbose language such as SDD to describe sets of data that are big. Perhaps the X:ID data can be presented without the behind-the-scenes information, such as the matrix information itself. Consider the key above (1070 X 540) where there are 577,800 values in the matrix to be considered. This means that, to represent the data in a detailed language such as XML, where every value gets its own line, the
-- Main.PatrickLeary - 07 Jun 2004
SDD is certainly verbose, but not that much. In a way xml is meant to be verbose so that it is extensible and self-documenting. There is even some unneccessary redundancy between state and character refs: the character refs are redundant, but this was explicitly requested to simplify life for character-based processors). -- However: In SDD only positive statements are made, the absence of a state is implicit, provided that the character has been scored at all. The value you are using in your example:
State ref="2"
Value 1 /Value
/State
is not SDD. You can provide values only for numeric measurement data, not categorical states. If I understand LIF correctly, the 0 indicates absence, 1 presence, and other values are used to code frequency, uncertainty. So step 1 would be to provide a translation of values other than 1 to SDD (these facts about states would be expressed as frequency and certainty modifiers).
So in the conversion, drop all 0 valued states, drop the value from those with 1, and add the SDD modifiers equivalent for those greater 1. I will try to help with the modifier question. Also, you can mimize the space needed for the state refs within each character, by not indenting them.
By the way: Your work provides a very valuable LIF to SDD converter that has its own utility - separately from X:ID!
-- Main.GregorHagedorn - 07 Jun 2004
Separately: any reason why you use xidisopen"F" xidlast
"F" xidimage"F" xidmetric
"F" instead of defining your extensions in a namespace like xid:isopen"F" xid:last
"F" xid:image"F" xid:metric
"F" as I proposed in the example file? ´Did the namespace extension not work for you? I am asking this because the question whether attributes extension from other namespaces should be allowed in SDD needs to be answered. I tried to add that in the special XID-modified SDD version experimentally. Did I make it wrong? -- Main.GregorHagedorn - 07 Jun 2004
-- Main.GregorHagedorn - 07 Jun 2004
It would be helpful if the Schema to which this complies were provided here also. In general, possibly SDD should require that a schema be named in the XML PI (including enough to guarantee that the correct schema version is deducible for validity checking). -- Main.BobMorris - 07 Jun 2004
I have attached the SDD Schema and example XML that was given to me by Gregor. It has been cut down in size for use specifically with X:ID: SDD_for_XID.zip.
I tried to enter attributes with an xid namespace, for example - xid:isopen, but our XML parser crashed at this point and said it did not recognize the namespace. I have never used different namespaces in XML files, so I'm not sure if there is some way to explicitly define a new namespace somewhere in the file. So, I temporarily renamed the attributes by removing the colon, for example - xidisopen.
As far as the
You were correct in your assumption about Lucid scoring. Lucid scores taxon under the following scheme: 0-absent 1-present 2-unknown 3-rare 4-present but may be misinterpreted as absent 5-absent but may be misinterpreted as present 6-metric data
So, values are to be used only for metric data? [Yes] In a Lucid-style key, metric data is not entered as a single value, but instead there are parameters. There are extreme upper, normal upper, normal lower, and extreme lower values. So, for one metric state/taxon pair, there are four metric values to be entered.
Finally, if I am correct in my current understanding of the
-- Main.PatrickLeary - 07 Jun 2004
So to map LIF to SDD, we need:
"2" debugkey
"Unknown"). However, this differs from the fact that a specific state is unknown. The way to express this in SDD is to say it is "perhaps" this state. Example: the Lucid statement "flower elliptic, unknown whether ovatate" is in SDD interpreted as "flower elliptic, perhaps ovatate". Thus we first need to define in Terminology/Modifiers a Certainty modifier for perhaps: (see Modifier key="41" in the SDD_tech.xml example file). Example for the application of this in a description:
<CodedDescription key="101"><Header><ClassName ref="1"/></Header>
<CodedData>
<Character ref="1">
<State ref="1"><Certainty ref="41"/></State>
</Character>
<CodedDescription key="101"><Header><ClassName ref="1"/></Header>
<CodedData>
<Character ref="1">
<State ref="1"><Frequency ref="22"/></State>
</Character>
<CodedDescription key="101"><Header><ClassName ref="1"/></Header>
<CodedData>
<Character ref="1">
<State ref="1"><Certainty ref="42"/></State>
</Character>
The size issue is known. It is common to most xml data. SDD does not make any assumptions that processors will work natively on SDD. An identification tool may easily read the SDD data, process it throwing aways those parts it is not interested in, and storing it in a matrix view. I believe it will be very valuable when you have created a large file and it would be good if this file could be shared for testing purposes. However, to me the issue seems to be to either define an extensible format based on xml or a non-extensible one optimized for a specific purpose like LIF. Any suggestions about options for SDD are appreciated!
Regarding extensibility: Note that descriptions may be associated with images or documents. The "file" element you add in the
you can refer to such resources for the entire description like:
<MediaResource key="125">
<Label><Representation language="en"><Text>Melampsora evonymi-caprearum</Text></Representation></Label>
<!-- Label is required, but if the source provides no separate title or description of a resource, the url may be used here as well -->
<ObjectLink><URL>www.xxx.org/img/Melampsora_evonymi-caprearum.png</URL></ObjectLink>
<Type>Image</Type>
<Caption><!-- Caption is optional -->
<Representation language="fr"><Text><i>Melampsora evonymi-caprearum</i> Kleb., stade II sur <i>Salix caprea</i>L.</Text></Representation>
<Representation language="de"><Text><i>Melampsora evonymi-caprearum</i> Kleb., Sporenstadium II auf <i>Salix caprea</i> L.</Text></Representation>
<Representation language="en"><Text><i>Melampsora evonymi-caprearum</i> Kleb., spore stage II on <i>Salix caprea</i> L.</Text></Representation>
</Caption>
</MediaResource>
<CodedDescription key="101">
<Header><ClassName ref="1"/></Header>
<MediaResources><MediaResource ref="123"/><MediaResource ref="124"/></MediaResources>
...
-- Main.GregorHagedorn - 7/8. Jun 2004
I updated our code to better match SDD as I am starting to understand it. The file XIDwithBDI.SDD.xml was also updated to represent the current working version of the X:ID Player running over SDD. The link to this version of the X:ID Player can be found above.
It seems odd to me that the different scoring options all have different tags (
Also, for the time being, I changed all Lucid-style scores of "4-present but may be misinterpreted as absent" as just being presentl, since the feature itself is present seems to me the most relevant point. The fact that it can be misinterpreted could perhaps be another type of
I have left the metric data untouched. Could you provide a code sample of how I should report this? I downloaded xmlSpy, but I couldn't find anything about metric data in the simplified version of SDD that you made for me, and I think you mentioned earlier that you had left it out. (* ... danger of trying to simplify for what I thought X:ID was going to handle. I did not notice numerics in the keys I looked at. -- Main.GregorHagedorn)
I also moved the description of the taxon media down to the
-- Main.PatrickLeary - 08 Jun 2004
This is certainly progress! Some points, however: In "3" you try to use a media resource. However, similar to agents media resources are managed in the central
"1" (compare example above).
I could not see a coding status or certainty statement. Perhaps you can introduce them into a dummy taxon at the start of the data set if they are not present. State ref"23"/Frequency ref
"22" is ok. However, you have not added the terminology definitions yet. Similar to characters, audiences, etc. you first have to define any term you want to use in the terminology. Please take a look at the normal example file (not the XID one) in beta 16 or later. The X:ID hierarchy version differs slightly, but you can copy the modifiers and coding status examples directly into your file. If you want you can only insert those I gave in the examples above. -- I will try to put out a generally updated SDD that is closer than beta 16 to what you have.
Very minor point: the audience should always correspond to an audiencekey. Currently you define en, and use en or en5. It would be good if the code gets these values from a single variable, so to allow conversion of non-english Lucid data as well.
I have a couple of questions on X:ID to check whether we have equivalent things in SDD that could be used directly:
"F" key
"3" xidremain="Y""en" xidsource
"local"-- Main.GregorHagedorn - 9. Jun 2004
Patrick wrote: "It seems odd to me that the different scoring options all have different tags (
(Question for numeric example is still open, I know. We need real SDD discussion there, since we currently have alternative models in SDD.)
-- Main.GregorHagedorn - 9. Jun 2004
Once again I have made some changes. Now, the X:ID with SDD Development version plays a key to automobiles. This key is extremely simplified and meant for development only. I once again updated the file XIDwithBDI.SDD.xml to match the recent changes.
I have added definitions to the coding types and I believe they are in the appropriate places. I was wondering what information is necessary in these definitions. In
Also, the Certainty Modifier has Text and
With the Frequency Modifier for rare, could you tell me more about the
I also changed the position of the
As far as the atributes that we use that are specific to xid:
"local" simply defines that the name came from the LIF file itself. An alternate source will be source
"tns", meaning the name came from the taxonomic name server.
I can understand the need to clean up the XML to match SDD, but is it not SDD compliant if there are extra attributes? Some of these attributes are basically shortcuts that cut down on programming and save processing time, which is necessary for extremely large keys. Some of the tricks are the reason we can process such large amounts of information in such a small amount of time. That is why some tags that don't necessarily have anything to do with names are listed in the name tag. It is for the sole reason that when we print the name, we need to know other things.
Let me know of I am being unclear, as these attributes are X:ID-specific and I am familiar with them and I understand no one else is. Hopefully we can come up with a common ground, as we are trying to make our XML backbone compliant with SDD standards.
-- Main.PatrickLeary - 09 Jun 2004
Just very quick: State ref"4"/CodingStatus ref
"2"/State is not SDD,
Regarding the last question: I have no problem with processing tricks and extensions. You explained them very clearly. Yes, currently it will break validity of SDD to put such things anywhere else but the
My interest is however, to see which is a processing shortcut and which is something general that either is already or should be in SDD (the latter is especially interesting, SDD is not necessarily finished). The point about image and xidsource seems to be such a point for me. So if you can, try to use SDD directly here, and then see whether you need shortcuts indeed. If they don't I have no problem. If you think SDD should change, I appreciate the comments!
-- Main.GregorHagedorn - 9. Jun 2004
I changed the
For State images, I left a short, empty <Icon/> tag in the Representation section for the states. Is this incorrect according to SDD? The reason for this, and for the xidimage tag in the Class definitions is that we initially were trying to avoid listing out all the image titles and URLs. On our server-side, we do not need this information for the functioning of X:ID, so we left out this information to save time and space.
I had another idea, but it is also a little unusual. I thought about making a
-- Main.PatrickLeary - 09 Jun 2004
Very good! As I said, I have absolutely no problems with the UI-extensions like isopen or ischosen. The xidsource attribute intention of linking to a name source is actually something very close to the heart of SDD and its
The icon/empty Media resources issue I don't fully understand. Yes, empty <Icon/> is not SDD, SDD will validate that the claim there is an image also provides the information. Viewed from an information exchange point that makes sense. But I think/hope I am beginning to get the idea. You say, the server will look it up that there are images somewhere, and in the UI you only need the information whether that lookup will fail or not, is that correct?
Basically I see no problem with omitting data (although it would be nice to send them if the xml-source is requested). This seems to be also the case in the
So first, suppressing information to reduce traffic for UI issues is certainly legal. The hint to the UI that in the full data there are media resources defined could then be validly UI-specific. I would recommend something like "mediaavailable" "or resourcepresent" etc. rather than image (you yourself use both image and a full html page!). The proposal to use a dummy media definition is possible and generates legal SDD, but since you redefine semantics, I am not sure it is valuable.
One minor point: I would recommend changing the boolean xid-attribute from values T and F to 1 and 0. This allows to type them as xs:boolean, which only accepts true, false, 1, 0 as values. That seems to me closer to programming and semantics.
Separately: Earlier question: "In
Should probability and frequency specifications be made optional in SDD, preventing processor from relying on them? I could make them optional with default 0 and 1 respectively?
-- Main.GregorHagedorn - 9. Jun 2004
I made a few more changes. I no longer have to rely on that messy
As far as the name sources go, we do plan on combining the two project eventually, but we are really trying to get X:ID fully functioning and comprehensive by itself. Perhaps a future version of X:ID will be integrated, and we will be sure to let you know when that happens.
The frequency issue is not a problem for us, but it may be something you want to consider. Since "rare" is a loose term, it will be varying levels of discriminatory to different users. I would say, as long as the creator of the SDD file can set the levels, and perhaps create multiple Frequency Modifiers for varying levels of "rarity", then I would say SDD is fine and does not need to be changed. But, again, this is not relevant to us, and we are not taxonomists, but that is my opinion from a programmer's point of view. Thank you for your explanations, and I can now feel that I understand all that goes in to our XML files.
One small thing: "Answer: the "Perhaps" certainty modifier you should not have
I also performed a small test on the file sizes of our old XML format and the new SDD-compliant format with the "Key to Marine Fishes". Our old XML format with the XID-specific DTD created a file 213 lines long and about 8KB in size. The SDD-compliant version creates an XML file 2339 lines long and 60KB in size. This is troubling to me from a programmer's standpoint, because we want everything to be small and run quickly. So, I was thinking of using SDD in an export function, or even create a new program that was a LIF->SDD converter.
I will work on making the LIF->SDD converter, as I think it will be useful to all interested in SDD. It will serve as a good example of a real-time application of SDD, and hopefully helps to futher its development. I was wondering if you, or an of your colleagues had and Lucid Keys (preferably in .LIF format) that we could use to test this application. All of our keys are small and not of any real taxonomic significance, so I would like to test my conversion with a real example of a large and detailed key.
I also posted the newest version of our XIDwithBDI.SDD.xml file so you can take a look. Thanks for all the help.
-- Main.PatrickLeary - 10 Jun 2004
I just finished my LIF->SDD converter. It can be found at http://uio.mbl.edu/SDD/converter.php . I also attached the PHP code to this forum, so that people may check out how the conversion actually happens. I included some sample .LIF files that we have here at the MBL, but they are demonstration file as opposed to actual comprehensive taxonomic keys. LIFs can also be uploaded from the user's computer, or a URL to a LIF file somewhere in the web can be typed in (this URL must start with http://). I hope this is a useful application, and I would certainly like any feedback regarding how it can be improved.
-- Main.PatrickLeary - 10 Jun 2004
No doubt that the LIF->SDD converter is a great idea!!!
-- Main.GregorHagedorn - 11 Jun 2004