GarryJolleyRogers - Wed Nov 25 2009 - Version 1.8
Parent topic: SDD2004Berlin

RepeatedObservations

Issues related to repeated observations / observation sets

"ObservationSet" etc. can be confused with collection vs. observation of an object. Should we rather call it "RawData", like the type? The "raw" seems to be colloquial though. Any better term?

Also, the association b/w raw data and synthetic statements (categorical data with frequency categories, univariate statistical measures) is not preserved. Note that it can not be automatically assumed that statistical measures come from those raw data recorded in an ObservationSet. They may have been calculated separately and entered without the raw data they are based on. Although this makes little sense, it may even be possible to have an ObservationSet from which nothing is calculated, and statistical measures calculated from other data. Not clarifying this seems to be poor design.

-- Gregor Hagedorn - 3 May 2004

I guess here you mean that the raw data may not be available in the instance document, even by reference, and that the stat. measure is "original" data, at least to this instance document. So I guess the point is the distinction between derived and primary data. Alas, "primary" has a certain cachet not intended here. "Anecdotal" or "uncomputed" might. Another possibility is that if there is a reference to raw data for the computation, then end of story. The distinction is thus that for which there is a reference and that for which there isn't. Machines can, in principle, redo the computation in the first case but not in the second. All other semantics of the lack of reference would be unspecified.

-- Main.BobMorris - 3 May 2004

"uncomputed" is misleading, the statistical measures would have been calculated, but not in a way that can be reproduced by an SDD consuming process. Same for anecdotal: statistical measures published in a scientific paper are not appropriately called "anecdotal". The assumption is that they are correctly calculated.

So, is it sufficient to say for data source of the statistical measure "inherited (from above)", "aggregated from below", "computed from raw data"? Or do we have to ID-ref it as well? What is a good element name for this "data source" concept?

-- Gregor Hagedorn - 4 May 2004

I guess we need a key/keyref to a data set or a resource: <computedData keyref=refToDataSetOrResource>

I'd most like a program to be able to verify the claim. This might be possible with a reference to data in the current instance doc, but for external data might require us to define a protocol for data service. That is not a small issue, and is being addressed in the communities that do modeling servers.

-- Main.BobMorris - 04 May 2004

The problem of attributing the source of data may be a more general problem, also relating to inheritance of data for other reasons, e. g. from higher or lower taxa. Again, either all data resulting from inheritance should be considered dynamic and never stored, or the storage must maintain the reason/association.

-- Gregor Hagedorn - 04 May 2004

Multiple Observation sets, e. g. 5 individuals, each with 30 leaf measurements. The hierarchy of measurements may be relevant to detect that perhaps one individual is in fact a different species. Should we distinguish between "observation of repeated part" and "observation on whole individual = repeated as individual". How?

Do we need to change our base model (i.e. make the part hierarchy special and required / which would prevent exporting DELTA legacy data to the new format!) or can we find a better solution. Perhaps have instead of Observations Individuals and Parts containers? Would this work?

A related problem is the handling of outliers in raw observation data. It seems to be unsatisfactorily to simply purge them from the list of recorded data. Instead they should preferably be marked up to indicate they are not to be included in the statistical evaluation.

-- Gregor Hagedorn - 03 May 2004