Real world examples (SDD 1.1)

To help application developers in understanding SDD and testing their software, we provide a number of "real world" example data sets for SDD, version 1.1. We try to provide different data sets with different properties and originating from different applications.

1. Data sets for multi-access keys ("matrix keys")

Transformed DELTA example data set

Introduction: The following two datasets are either distributed together with the CSIRO DELTA programs, or used in feature comparisons. They are provided here to help people with a DELTA background to understand the relation between SDD and DELTA.

Description of data sets: The first data set is a minimal set with 4 characters and 3 descriptions of beetles. It is used as the example on Data Requirements for Natural-language Descriptions and Identification and provided there in various formats (DELTA, NEXUS, Lucid Interchange Format File v. 1.1 (old version of Lucid), and XDELTA). The second set contains a larger character set and 14 grass species. It is distributed together with the DELTA programs (version from 2000, see "All programs (including Intkey)" on DELTA Programs and Documentation). The original DELTA file is ANSI (not ASCII) encoded and uses RTF for character markup. This example is provided both as a single-document and as a multifile xml-document set. The multifile approach uses multiple xml fragments that can be individually edited or placed in different repositories and which finally can be combined using xml entities into a "master-document". To some extent this mirror the most common use of DELTA using a folder plus multiple directive files.

Data conversion: Both datasets were initially converted by importing the DELTA data into DiversityDescriptions 2.0 beta 10 and exported to SDD 1.1 from there. Since the SDD produced in this way contains various DiversityDescriptions-specific information, the datasets were slightly cleaned by hand afterwards.

Copyright and license: Both datasets are used by specific permission by Mike Dallwitz 2008. They are not placed under a general license.

SDD documents:

FOLLOWING ARE 3 BROKEN LINKS - NEED TO BE FIXED:

Beetles.sdd11.zip: Beetles dataset in SDD 1.1 format
DELTA2000-Sample.sdd11.zip: DELTA.exe (2000) sample dataset on 14 Grass genera in SDD 1.1 format
DELTA2000_Multifile.sdd11.zip: As above, but using multiple xml fragments combined into a master document

LIAS data set

Introduction: LIAS is a global information system for Lichenized and Non-Lichenized Ascomycetes. The vision of LIAS is to establish a non-commercial global information system for the collection and distribution of descriptive, phylogenetic, and other biodiversity data on these taxonomic groups that uses advanced technology and where published biodiversity data of all ascomycetes are joined in a multi-authored database and used for the most sophisticated queries. Specific goals are to

provide a working space for cooperation and collaboration of experts on ascomycetes in the Internet
establish a multi-authored worldwide database on descriptive data of all ascomycetes
design user-friendly web tools for an easier access and remote editing of database records via Internet
offer a online database system for multiple usage and therewith dissemination of expert knowledge especially by providing public access to database generated identification keys and natural language description of ascomycetes
promote the gathering, furnishing and administration of data by experts in a standard database system which allows an information deposit for individual use only (e. g. for revision) and – after agreement – the public access to the data via Internet
promote common standards on descriptive data connected with taxonomic names of ascomycetes to facilitate interoperability and data exchange

LIAS is the work of many collaborators. The primary editors are G. Rambold and D. Triebel. Cooperating Institutions are the University of Bayreuth, Department of Mycology, Botanische Staatssammlung München, Arizona State University, Department of Plant Biology, University of Hamburg, Herbarium Hamburgense, and the University of Oslo, Botanical Museum. It is or was supported by funds from the Bayerisches Staatsministerium für Wissenschaft, Forschung und Kunst, Bundesministerium für Bildung und Wissenschaft (BMBF), Deutsche Forschungsgemeinschaft (DFG), and Staatliche Naturwissenschaftliche Sammlungen Bayerns.

Description of data set: The data set provided here is the complete LIAS main data set as of 2007-07. It provides descriptions of 2480 genera and species of lichens using 987 characters with a total of 7632 categorical state definitions (plus 3128 status values or statistical measures for quantitative characters). The descriptions are atomized to a total of 221821 values. Only relatively few characters and states are "pseudo-" or "management-characters", dealing with taxonomy, revision management, etc. Of the total LIAS main character data matrix of 2480x987 = 2447760 cells, 157041 cells are filled (6.4%). Part of this low fill factor is due to the taxonomic diversity encompassed in the data set, but it also shows that significant work still has to be done.

Related data sets: Two datasets are closely related with "LIAS main": (1) LIAS light contains fewer characters but has been more extensively revised and has a higher fill factor. It is therefore more suitable for practical identification and currently strongly expanded as part of two major joint projects: the BIOTA Africa project and the Greater Sonoran Desert Lichen Flora. (2) A key to around 700 powdery mildews (Erysiphales), which for reasons convenience is coupled with LIAS main, has been excluded from this release.

Data conversion: The "LIAS main" dataset is managed in DiversityDescriptions; the attached SDD 1.1 export file was created by the DiversityDescriptions export routine.</p>

Highlights for testing SDD: The LIAS dataset is a large dataset and is especially suitable for testing the behavior of an application with large and rich keys.

Copyright and license: The "LIAS main" dataset attached here is © 1996-2007 by Botanische Staatssammlung München. All rights reserved. It is here released under the Creative Commons non-commercial, by attribution, share-alike license in version 2.5. Further details are included in the file itself.

SDD document: LIAS_Main.sdd11.zip: LIAS Main dataset in SDD 1.1 format

Interactive Key to Species of Erythroneura

Introduction: The Interactive Key to Species of the Genus Erythroneura (Homoptera, Cicadellidae) by D. Dmitriev & C. Dietrich is also available online under the 3I software created by Dmitry A. Dmitriev. 3I (Internet-accessible Interactive Identification) is a set of software tools for creating on-line identification keys, taxonomic databases, and virtual taxonomic revisions. By organizing illustrations and nomenclatural, morphological, bibliographical, and distributional data into a single database 3I also facilitates production of traditional, printed taxonomic papers and monographs. As such it is more comprehensive that SDD alone, pointing into the direction into which SDD plans to evolve (online monographs including nomenclature as well as descriptions and identification tools).

Description of data set: The data set is a small sized key for 54 taxa, using 42 characters, 171 categorical state definitions, and 2401 values. It contains only a single quantitative character.

Data conversion: The export to SDD occurred indirectly, importing the original 3I database into DiversityDescriptions (converter available since version 2.0) and creating SDD from there. As a result, some details (specimen, nomenclature, publication data), which could in principle be expressed in SDD 1.1, were lost because they were not fully supported by DiversityDescriptions.</p>

Highlights for testing SDD: The dataset is a small fully revised and published dataset with rich illustrations. Although the images are not included here, as of 2007-10-12 the given URLs were resolvable. Note: the dataset does not use any Status values ("unknown", "not applicable", etc.).

Copyright and license: The Erythroneura dataset attached here is © 2003-2006 D. Dmitriev & C. Dietrich. The SDD version is released here under the Creative Commons non-commercial, by attribution, share-alike license in version 2.5.

SDD document: Erythroneura.sdd11.zip: D.Dmitriev's Erythroneura key in SDD 1.1 format

An Interactive Key to Tribes of Leafhoppers / Интерактивная Определительная Таблица Цикадок (Cicadellidae, in English and Russian)

Introduction: This key by D. Dmitriev & C. Dietrich is used to demonstrate the multilingual properties of the 3I software and is available in English and Russian. See the "Interactive Key to Species of Erythroneura" above for further information on 3I.

Description of data set: The data set is a small to medium sized key for 152 taxa, using 146 characters, 414 categorical state definitions, and 13252 values. It contains no quantitative or text characters. The revision of the dataset is not complete.

Data conversion: The export to SDD occurred indirectly, importing the original 3I database into DiversityDescriptions (converter available since version 2.0) and creating SDD from there. As a result, some details that could in principle be expressed in SDD 1.1, were lost because they were not fully supported by DiversityDescriptions.</p>

Highlights for testing SDD: This dataset is provided as a fully multilingual dataset. Note that at the moment the natural language features are only partly exported in both languages; this is solely due to incomplete conversion, neither to 3I nor SDD.

Copyright and license: The "Key to Tribes of Leafhoppers" dataset attached here is © 2003-2006 D. Dmitriev & C. Dietrich. The SDD version is released here under the Creative Commons non-commercial, by attribution, share-alike license in version 2.5.

SDD document: Cicad.sdd11.zip: D.Dmitriev's English/Russian multilingual example data set.

2. Data sets for natural language descriptions including markup

(None at the moment, please help us providing such a data set!)

3. Data sets for branching (static dichotomous or polytomous) keys

Dichotomous key to higher plants from Val Rosandra (Italy)

This SDD dataset is an export of the FRIDA key to the higher plants of the Val Rosandra nature reserve in Italy. The original FRIDA key is available online. The dataset has been created as a prototype for more widespread adoption of SDD in the context of the Key to Nature EU project.

Description of data set: The data set is a medium to large sized dichotomous key covering 1149 taxa in 1154 couplets (2308 leads). 1949 images are linked into the key. The dichotomous key itself is fully translated to English. It key contains a single inner reticulation (where a lead can be reached by multiple paths) and many "terminal reticulations", i.e. taxa that are keyed out multiple times. It also contains 400 Italian natural language descriptions. In addition to the real FRIDA key, the dataset contains a second dummy key, to illustrate two points: a) a dataset may have multiple labeled keys, b) the optional question/answer style available in SDD.

Data conversion: The dataset is semi-manual prototype export from the FRIDA database. It is planned that the export routine will be fully automatized and that all available FRIDA keys will in the future be also available in SDD format.

Copyright and license: The "Val Rosandra" dataset attached here is © 2008 P.L. Nimis & S. Martellos. The SDD version is released here under the Creative Commons non-commercial, by attribution, share-alike license (Creative Commons 3.0 NC-BY-SA unported).

SDD document: Val-Rosandra-FRIDA-Key.sdd11.zip: Dichotomous key to higher plants from Val Rosandra (Italy).

Key to Dutch reptiles and amphibians (by ETI)

The dataset has been created as a prototype while implementing SDD in the ETI BioInformatics mobile key created in the context of the Key to Nature EU project. Its goal is to create a small, but realistic identification dataset for testing purposes, combining several features of SDD.

Description of data set: The taxon names here contain atomized data (CanonicalName; this is the only dataset that features this), the key is dual language in Dutch and English. The key contains only categorical characters (no quantitative or text). The characters are labeled in question style, with the states giving the answers. Each taxon has a short Natural Language description (plain text without semantic markup; note that the English text is not a fully reflection of the Dutch). The key contains both coded descriptions to use with a multi-access key, and a manually created, fixed single-access key (polytomous). The latter in part uses question/answer style ("Does it have legs? yes/no"), in part couplet style with leads ("Warty skin, pupil horizontal/Warty skin, pupil vertical/Smooth skin, pupil vertical"). The size of the data set is small, with 24 taxa and 20 characters.</p>

Data conversion: The dataset is semi-manual prototype export from ETI data.

Copyright and license: The dataset attached here is © 2008 ETI. The SDD version is released here under the Creative Commons non-commercial, by attribution, share-alike license (Creative Commons 3.0 NC-BY-SA unported).

SDD document: ETI_rept_amph_key.sdd11.xml.zip: Key to Dutch reptiles and amphibians (by ETI)

-- Main.GregorHagedorn - 20 Nov 2008

Examples

On this page