Syntaxus baccata
This page isn't actually my blog, it's more of a proxy, and a sandbox to play around with MDL and AngularJS (see Attribution). My real blog page is hosted here. You are, of course, still welcome to view the posts on this page. I have limited post length and the number of shown posts (I generally don't like infinitely-scrolling pages), so to read everything you're better off visiting the real blog.

New paper: Identification of Cholevinae larvae
In 2022, I started my Master’s in Biology, at Radboud University in the Netherlands where I had just finished my Bachelor’s degree. The Master’s programme includes two research internships of 36 EC (approx. 6 months), both of which include writing a thesis. As I had been working on a database of identification keys, I was interested in a project focused on taxonomy for my first research internship.
Thanks to Henk Siepel I ended up contacting Menno Schilthuizen at Naturalis, who suggested I work on Cholevinae larvae. Schilthuizen had been collecting Cholevinae larvae since the 1980s, and had also received material from Peter Zwick who started collecting larvae in different areas of Germany in the 1960s. The challenge was to use this material to make an identification key based on these specimens.
Although the first description of a larva of Cholevinae was published back in 1961 by J. C. Schiødte, descriptions have since been relatively few and far between. This also meant that there are almost no existing identification keys for the larval Cholevinae. Making these descriptions and keys is difficult, as you need larvae from a known species. This is only possible if the larvae are cultured from adults, which takes time and effort, if molts are collected and the emerged adults is identified, or if DNA barcoding can be used. The specimens collected by Zwick and Schilthuizen mainly used the first method.
However, there happened to be a recent, detailed description of Sciodrepoides watsoni, a species for which I also had specimens. I started by comparing the larvae of S. watosni (as well as a few of the related S. fumatus) to the drawings and descriptions made by Kilian and Mądra. From there, I could start looking at different species and identify potential areas and types of characteristics that are consistent enough within a species, but that differ between separate species. To illustrate these differences I also made schematic drawings (Fig. 1) of different sets of characteristic features. Finally, I measured certain parts of the larvae, where possible for specimens preserved in microscope slides.
Figure 1: Illustrations of Cholevinae larvae
At the end of the 6 months, I had a complete key to all species for which specimens were available, but only for the 1st instar. When the larvae molt for the first time, they gain secondary bristles, grow in size, and more, meaning the identifying characteristics cannot always be used for both the 1st instar, and the 2nd and 3rd instars. I ended up spending another year or so to finalize the key for all instars. This includes 28 of the 39 species of Cholevinae occurring in the Netherlands, and a lot of descriptions for which no (detailed) description was available previously. In a true full circle moment, I could add my own work to the aforementioned database of identification keys (as B1860).
Ultimately, collaborating with Schilthuizen, Siepel, and Zwick, this culminated in an article, Comparative morphology of the larval stages of Cholevinae (Coleoptera: Leiodidae), with special reference to those in the Netherlands. We were able to publish this in the final issue of Tijdschrift voor Entomologie, which is unfortunately being discontinued after 167 volumes. Again, many thanks to Menno Schilthuizen, Peter Zwick, and Henk Siepel for this great opportunity. Check it out!
References
- Willighagen, L. (2022, augustus 6). Library of Identification Resources. Syntaxus Baccata. https://doi.org/10.59350/h8qka-z4a05
- Schiødte, J. C. (1861). De metamorphosi eleutheratorum observationes: Bidrag til insekternes udviklingshistorie (pp. 1–558). Thieles Bogtrykkeri. https://doi.org/10.5962/bhl.title.8797
- Kilian, A., & Mądra, A. (2015). Comments on the biology of Sciodrepoides watsoni watsoni (Spence, 1813) with descriptions of larvae and pupa (Coleoptera: Leiodidae: Cholevinae). Zootaxa, 3955(1), 45–64. https://doi.org/10.11646/zootaxa.3955.1.2
- Willighagen, L. G., Schilthuizen, M., Siepel, H., & Zwick, P. (2025). Comparative morphology of the larval stages of Cholevinae (Coleoptera: Leiodidae), with special reference to those in the Netherlands. Tijdschrift Voor Entomologie, 167, 59–101. https://doi.org/10.1163/22119434-bja10033
Written with StackEdit.
Citation.js: 2024 in review
This past year was relatively quiet for Citation.js as well.
Ulex europaeus, observed December 24th, 2024, Vlieland, The Netherlands.
Changes
- BibTeX: output of non-ASCII characters was improved.
- BibLaTeX: support for data annotations was added!
- DOI: the DOI pattern was broadened to include non-standard DOI formats.
- Support for ORCIDs was improved, making it possible to map authors’ ORCIDs to different formats.
New Year’s Eve tradition
After the releases on New Year’s Eve of 2016, 2017, 2021, 2022, and 2023, this New Year’s Eve also brings the new v0.7.17
release. The CSL field publisher
is now mapped to the BibTeX field organization
for paper-conference
(inproceedings
) entries.
Happy New Year!
Next.js, SWC, and citeproc-js
Last year I got a bug report that Citation.js was not working when built in a Next.js production environment for unclear reasons. Next.js is a popular server framework to make web applications with React, and by default transforms all JavaScript files and their dependencies into “chunks” to improve page load times. In production environments, Next.js uses the Rust-based “Speedy Web Compiler” SWC to optimize and minify JavaScript code.
I was able to figure out that somewhere, this process transformed an already difficult-to-grok function (makeRegExp
) in the citeproc
dependency into actually broken code. After some trial and error I found the following MCVE (Minimal Complete Verifiable Example):
function foo (bar) {
var bar = bar.slice()
return bar
}
foo(["bar"])
// equivalent to
function foo (bar) {
var bar // a no-op in this case, apparently
bar = bar.slice()
return bar
}
foo(["bar"])
But then, in the chunks generated by Next.js, the argument bar
gets optimized away from foo()
, generating the following code (it also inlines the function).
var bar;
bar = bar.slice();
Now, this is a simple mistake to make. If you expect var bar
to actually re-declare the bar
argument, the argument is clearly unused and can be removed. Due to the quirks of JavaScript that is not the case though, and the incorrect assumption leads to incorrect code.
This is not a one-off thing though: last August I got another, similar bug report with the same cause: some slightly non-idiomatic code (CSL.parseXml
) from citeproc
got mis-compiled by SWC. I found another MCVE:
function foo (arg) {
const a = { b: [] }
const c = [a.b]
c[0].push(arg)
return a.b[0]
}
The compiler misses that c[0]
refers to the same object as a.b
and thinks that makes the function a no-op, though it does not optimize it away fully, instead producing the following:
function (n) {
return [[]][0].push(n), [][0]
}
This was apparently already noticed and fixed last May though the SWC patch still has to land in a stable version of Next.js. Interestingly, the patch includes a test fixture that uses CSL.parseXml
as example code; apparently citeproc
is a good stress-test of JavaScript compilers.
This is all fine with me, I am not going to blame the maintainers of a complex open-source project like SWC for occasional bugs. However, I would like to see a popular framework like Next.js, with 6.5 million downloads per week and corporate backing, to do more testing for such essential parts of their infrastructure. I also do not see them among the sponsors of SWC.
Edited 2024-10-15 at 17:26: Actually, the creator for SWC is also a maintainer for Next.js, though I do not know in which order. Given that, it makes more sense that they switched away from the well-tested but slower BabelJS in version 12, and more confusing why they did not test it a bit more thoroughly.
Citation.js: BibLaTeX Data Annotations support
Version 0.7.9
of Citation.js comes with a new feature: plugin-bibtex
now supports the import and export of Data Annotations in BibLaTeX files. This means ORCID identifiers from DOI, Wikidata, CFF, and other sources can now be exported to BibLaTeX. Combined with a BibLaTeX style that displays ORCID identifiers, you can now quickly improve your reference lists with ORCIDs.
const { Cite } = require('@citation-js/core')
require('@citation-js/plugin-bibtex')
require('@citation-js/plugin-doi')
Cite
.async('10.1111/icad.12730')
.then(cite => cite.format('biblatex'))
This produces the following BibLaTeX file (note the author+an:orcid
field):
@article{Willighagen2024Mapping,
author = {Willighagen, Lars G. and Jongejans, Eelke},
author+an:orcid = {1="http://orcid.org/0000-0002-4751-4637"; 2="http://orcid.org/0000-0003-1148-7419"},
journaltitle = {Insect Conservation and Diversity},
shortjournal = {Insect Conserv Diversity},
doi = {10.1111/icad.12730},
issn = {1752-458X},
date = {2024-03-02},
language = {en},
publisher = {Wiley},
title = {Mapping wing morphs of \textit{{Tetrix} subulata} using citizen science data: Flightless groundhoppers are more prevalent in grasslands near water},
url = {http://dx.doi.org/10.1111/icad.12730},
}
References
- Willighagen, L. (2024). Including ORCID identifiers in BibLaTeX (and using them). Syntaxus Baccata. https://doi.org/10.59350/bk8yd-b1307

Including ORCID identifiers in BibLaTeX (and using them)
On the Fediverse, @petrichor@digipres.club posited the question how to include identifiers for authors in Bib(La)TeX-based bibliographies:
Any Bib(La)TeX/biber users have a preferred way to include author identifiers like ORCID or ISNI in your .bib file? Ideally supported by a citation style that will include the identifiers and/or hyperlink the authors.
I have wanted to try including ORCIDs in bibliographies for a while now, and while CSL-JSON makes it nearly trivial to encode, neither CSL styles nor CSL processors are at the point where those can actually be inserted in the formatted bibliography. However, BibLaTeX may grant more opportunities, so this piqued my interest.
I first thought of the Extended Name Format (Kime et al., 2023, §3.4), which allows breaking names up in key-value pairs. Normally, those are reserved for name parts (family
, given
, etc.), but I believed I had seen a way to define additional “name parts”, one of which could be used for specifying the ORCID. However, in the process of figuring that out, I found the actual, intended, proper solution.
BibLaTeX has, exactly for things like this, Data Annotations (Kime et al., 2023, §3.7). For every field, or every item of every field in the case of list fields, additional annotations can be provided. (There are some additional features and nuances; for a full explanation see the manual.) For ORCIDs, data annotations could look like this:
@software{willighagen_2022_7017208,
author = {Willighagen, Lars and
Willighagen, Egon},
author+an:orcid = {1="0000-0002-4751-4637"; 2="0000-0001-7542-0286"},
title = {ISAAC Chrome Extension},
month = aug,
year = 2022,
publisher = {Zenodo},
version = {v1.4.0},
doi = {10.5281/zenodo.7017208}
}
Now, implementing it in a BibLaTeX style proved more difficult than I hoped, but that might have been due to my inexperience with argument expansion and Biber internals. I started with the authoryear
style and looked for the default name format that it uses in bibliographies; this turned out to be family-given/given-family
. I copied that definition, and amended it to include the ORCID icon after each name (when available). To insert the icon, I used the orcidlink
package. This part was tricky, as \getitemannotation
does not work in an argument to \orcidlink
or \href
, but I ended up with the following.
\DeclareNameFormat{family-given/given-family}{%
% ...
\hasitemannotation[\currentname][orcid]
{~\orcidlink{\expandafter\csname abx@annotation@literal@item@\currentname @orcid@\the\value{listcount}\endcsname}}
{}%
% ...
}
You could repeat the same with ISNI links, or Wikidata, VIAF, you get the idea. Then you could put the \DeclareNameFormat
in a new authoryear-orcid.bbx
file so that the changes do not show up in the in-text citations, and set the bibliography style like so:
\usepackage[bibstyle=authoryear-orcid]{biblatex}
This can all be seen in action on Overleaf: https://www.overleaf.com/read/gvxqmrqmwswh#f156b5
References
- Philip Kime, Moritz Wemheuer, Philipp Lehman (March 5, 2023). The
biblatex
Package. Programmable Bibliographies and Citations. Version 3.19. http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf

Three new userscripts for Wikidata
Today I worked on three user scripts for Wikidata. Together, these tools hopefully make the data in Wikidata more accessible and make it easier to navigate between items. To enable these, include one or more of the following lines in your common.js
(depending on which script(s) you want):
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/properties.js&oldid=2039401177&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/references.js&oldid=2039248554&action=raw&ctype=text/javascript');
mw.loader.load('//www.wikidata.org/w/index.php?title=User:Lagewi/bibliography.js&oldid=2039245516&action=raw&ctype=text/javascript');
User:Lagewi/properties.js
Note: It turns out there is a Gadget, EasyQuery, that does something similar to this user script.
Inspired by the interface for observation fields on iNaturalist, I wanted to easily find entities that also used a certain property, or that had a specific property-value combination. This script adds two types of links:
- For each property, a link to query results listing entities that have that property.
- For each statement, a link to query results listing entities that have that claim, e.g. this property-value combination. (This does not account for qualifiers.)
These queries are certainly not useful for all properties: listing instance of (P31) human (Q5) is not particularly meaningful in the context of a specific person. However, sometimes it just helps to find other items that use a property, and listing other compounds found in taxon (P703) Pinus sylvestris (Q133128) is interesting.
User:Lagewi/references.js
Sometimes, the data on Wikidata does not answer all your questions. Some types of information are difficult to encode in statements, or simply has not been encoded on Wikidata yet. In such cases, it might be useful to go through the references attached to claims of the entity, for additional information. To simplify this process, this user script lists all unique references based on stated in (P248) and reference URL (P854). The references are listed in a collapsible list below the table of labels and descriptions, collapsed by default to not be obtrusive.
User:Lagewi/bibliography.js
Going a step further than the previous script, this user script appends a full bibliography at the bottom of the page. This uses the (relatively) new citation-js tool hosted on Toolforge (made using Citation.js). Every reference in the list of claims has a footnote link to its entry in the bibliography, and every entry in the bibliography has a list of back-references to the claims where the reference is used, labeled by the property number.

Citation.js: 2023 in review
This past year was relatively quiet for Citation.js, as changes to the more prominent components (BibTeX, RIS, Wikidata) start to slow down. I believe this is a good sign, and that it indicates the quality of the mappings is high. Nonetheless, following the reworks of BibTeX and RIS in the past couple of years, some room for improvement still came up.
Tytthaspis sedecimpunctata, observed May 9th, 2021, Sint-Oedenrode, The Netherlands.
Changes
- BibTeX: The mappings of the fields
howpublished
,langid
, andaddendum
are improved. Plain BibTeX now allowsdoi
,url
, and more (see below). - RIS:
PY
(publication year) is now always exported, and imported more resiliently. - Wikidata: Names of institutions, and author names that differ from the person name, are now handled better.
- CSL: a single citation entry can now be exported as documented.
New plugins
- The new TeX alternative Typst came with its own YAML-based bibliographical format, Hayagriva, which can now be imported and exported in Citation.js with the new
@citation-js/plugin-hayagriva
.
New users
- I started working on a publicly available API running Citation.js on Wikimedia Toolforge. This API can be used to extract bibliographical data from Wikidata items, or to import bibliographical data into Wikidata. Available at https://citation-js.toolforge.org/.
- I found out that Codeberg uses Citation.js for the “Cite this repository” feature.
New Year’s Eve tradition
After the releases on New Year’s Eve of 2016, 2017, 2021, and 2022, this New Year’s Eve also brings the new v0.7.5
release. It turns out that plain BibTeX has more fields than documented in the manual! At some point, natbib
introduced the plainnat
styles which include doi
, eid
, isbn
, issn
, and url
. These are now supported in bibtex
export, as well as strict BibTeX import. (The default, non-strict BibTeX import is basically just BibLaTeX, for which these fields were already supported.) Thank you to @jheer and @kc9jud for bringing this up!
Happy New Year!

Finding shield bug nymphs on iNaturalist
Working on translating a key to the European shield bug nymphs (Puchkov, 1961) I thought I would look for pictures of the earlier life stages (nymphs, Fig. 1) of shield bugs (Pentatomoidea) on iNaturalist and found few observations actually had the life stage annotation. I do not have the exact numbers of Europe as a whole at that point in time, but Denmark currently has around 19.8% and the United Kingdom has around 29.4% of the observations annotated (GBIF.org, 2023).
Figure 1: Fourth instar nymph of Nezara viridula (Linnaeus, 1758). 2023.vi.22, Bad Bellingen, Germany.
So I set out to add those annotations myself instead, starting with the Netherlands, followed by the rest of the Benelux, Germany, and Ireland. Last Monday, I finished the annotating the observations of France. These regions total to about 80 000 observations, of which I annotated a bit more than 40 000 (again, I do not have the exact numbers from before I started).
Methods
I made these annotations on the iNaturalist Identify tool which has plenty of keyboard shortcuts that I found after using the mouse for 2000 observations. This allowed me to develop some muscle memory, and I ended up annotating a single page of 30 observations in around 60 seconds, so 2 seconds per observation. Most of that time was usually spent waiting for the images to load, and there were plenty of small glitches in the interface to further slow me down (including a memory leak requiring me to reload every 10-ish pages).
I was not able to annotate 715 of the verifiable observations (i.e. those with pictures, a location, and a time). In some cases, the pictures were simply not clear enough (or taken too closely) for me to determine with certainty the life stage. Another issue I had to work around were observations of multiple individuals at different life stages. Common were observations of egg clusters and just-hatched nymphs of Halyomorpha halys (Stål, 1855); the “parent bug” Elasmucha grisea (Linnaeus, 1758) doing parenting; kale plants infested with adults and nymphs of Eurydema; and adults of various species in the process of laying eggs. However, there were also many observations containing multiple pictures where one was of an adult and a second of a nymph, with no indication that it was the same individual at different times. There is currently no way to annotate multiple life stages on a single observation on iNaturalist except through non-standard observation fields, which are a lot more laborious to use and can be disabled by users.
Results
Coloring the observations by life stage on a map clearly shows the effect of the work, with the aforementioned countries covered in red; and the most of the rest of Europe in blue (Fig. 2). (There are two other notable red patches, in Abruzzo, Italy and in Granada, Spain. These are not my doing, and seem to be caused by two prolific observers annotating their own observations, respectively esant and aggranada.)
Figure 2: Map of research-grade iNaturalist observations of Pentatomoidea in Europe, colored by whether or not they have a life stage annotation.
These annotations mean additional data is available on the seasonality of these species. For example, looking at the four most observed species already reveals that Pentatoma rufipes (Linnaeus, 1758) overwinters as nymphs, whereas the other three species overwinter as adults (Fig. 3). The larger volume of data also means that more detailed analyses with more explanatory variables can be carried out. For example, the effect of climate change on the life cycle of invasive species like H. halys.
Figure 3: Seasonality of nymphs and adults of the four most of observed shield bug species.
In addition, for less common species the classification of life stages makes it possible to find more about the morphology of the earlier life stages of these species. This is useful for individuals who are working on keys (such as me), but perhaps also for computer vision models. Classifying the not-yet identified observations of nymphs as such also allows for more targeted searches by identifiers, potentially leading to even more research-grade observations of rarer species.
It should be said though, that even Chlorochroa pinicola (Mulsant & Rey, 1852), which is not particularly common in West Europe, still has many more validated pictures on Waarneming.nl than on iNaturalist. In fact, nearly half (43.2%) of all observations with images of Pentatomoidea in Europe are in the Netherlands. These are not all annotated with a life stage though, and the Observation.org platform (which Waarneming.nl is part of) seemingly only allows curators and observers add life stage annotations to an observation.
Luckily, iNaturalist does allow for this and enables me to contribute hopefully valuable data to GBIF for further analysis, by myself or by others. I will continue adding annotations — I have now started on the observations from Switzerland, luckily a lot fewer than those from France. At the same time, I am maintaining the high rate of annotation in the countries I have already annotated. In August, this means annotating about 200 observations per day (10–15 minutes) which is entirely doable. It does quickly start to add up if you are on holiday for a week, as you do in August, but that is still fewer observations than the entirety of France. Still, for this reason I hope other identifiers (or even better, observers) start annotating more as well.
References
- GBIF.org. (2023, August 11). Occurrence Download [Darwin Core Archive]. The Global Biodiversity Information Facility. doi:10.15468/DL.NXCTJW
- Puchkov, V. G. (1961). Shieldbugs. In Fauna of Ukraine (Vol. 21, Issue 1). Naukova Dumka.
Written with StackEdit.

Ingesting Structured Descriptive Data into Pandoc
Ever since I found out about Structured Descriptive Data (SDD) I keep coming back to it. Right now I am planning to make two identification keys, one translated and modernized from a 1961 publication for which I have not found a good alternative, and one completely de novo (well, nearly). In both cases, I started writing out the key in a word processor with a numbered list and tabs and manual alignment etc., but in both cases that experience was pretty bad so far.
If the keys were interactive, multi-access keys there would be plenty of alternatives to word processors, including Xper3. However, these are both single-access, simple, dichotomous keys. There are CTAN packages to display such keys in LaTeX (biokey, dichokey), but those packages seem limited in several ways, and given how difficult LaTeX can be to parse, I would be locked into that choice. Normally I would encode the key data in some simple JSON format and generate HTML followed by a PDF, but at that point it makes as much sense to encode the data in SDD instead. One problem: how do I go from an SDD file to a PDF? Or HTML for a webpage, for that matter?
Pandoc
That is were Pandoc comes in. By creating a custom reader that transforms SDD files into an Pandoc AST, files can be generated as PDF, HTML, LaTeX, Word, ODT, and more. These custom readers allow for external dependencies so a pure-Lua XML parser library like xml2lua is easily imported. The challenge then becomes twofold:
- ‘Implementing’ the SDD format itself, and
- Building a layout that looks good in the various different formats, like HTML and PDF (which is LaTeX-based in Pandoc).
Firstly, SDD is in many places under-specified, showing only some suggested ways of encoding data, and not showing any examples in other places. This means that I personally need to decide on how I want certain data to be encoded in SDD (for the Pandoc reader), if multiple ways are possible.
Secondly, where in HTML (+ CSS) I could use float: right;
or display: grid;
to make pretty, dichotomous keys, or in LaTeX and for the PDF the environment longtblr
from tabularray
, a single design that works for both is more difficult. This is especially the case if you want to include numbered back-references, as it eliminates the possibility of using simple numbered lists.
Well, it turns out there is a trick for that. The AST supports RawInline
, with which a reader can insert markup specific to the output format. pandoc.RawInline('html', '<span style="float: right;">)
for HTML and pandoc.RawInline('tex', '\hfill')
for LaTeX (and PDF), et cetera. It is still better to use regular Pandoc features as much as possible to allow for more flexibility in selecting output formats, but I feel like progressive enhancements like these are definitely “okay” to apply.
SDD limitations
In spite of the freedom and under-specification that SDD gives in some places, the format (as documented) is limited in other places. Apart from not being able to incorporate related formats from TDWG such as Darwin Core (other than through non-standard extensions), some of features are missing:
- Endpoints of identification keys (
<Lead>
) can only have a<TaxonName>
or a<Subkey>
, not both, meaning that the many publications that list the key to genera separate to each genus subkey to species, cannot be encoded in an optimal way. - Endpoints of identification keys (
<Lead>
) are only shown to have one<TaxonName>
, so keys leading to a species pair (or a differently-sized, non-taxonomic group of species) cannot be encoded. (At the same time, descriptions can be linked to multiple taxa with<Scope>
.) - Descriptions (
<NaturalLanguageData>
) and identification keys are not shown to allow markup. Confusingly, the term “markup” is used, but as far as I can tell only for annotating the text with concepts and publications; not for putting taxon names in italics or other styling.
Results
After quite some time I was finally content with the key that I was testing it with (Fig. 1). The resulting filter is available at https://github.com/larsgw/pandoc-reader-sdd, and will probably see continued development as I try to use it for other keys. A major bottleneck is still getting the SDD in the first place, but I hope that becomes easier as I continue developing scripts and tooling. Overall though, the Pandoc reader has been a great success.
Figure 1: Page 17 and 18 of the PDF, made with Pandoc from a Structured Descriptive Data XML file. Translated, original text and figures from Puchkov, 1961 (Fauna of Ukraine, vol. 21 (1), http://www.izan.kiev.ua/fau-ukr/vol21iss1.htm).
On the modelling of the content of identification keys
Last november I wrote a blog post about how to model the taxonomic coverage of identification keys. I wanted to model this coverage to be able to determine to what extent an identification key applies to a given observation or specimen, for use in my Library of Identification Resources project. For the same project I also find it useful to be able to archive identification keys. Many keys can be downloaded as PDFs or saved in the Internet Archive. However, some keys cannot be archived so easily:
- Matrix-type keys (also known as multi-access keys) are usually presented as interactive apps. Sometimes these work from the Wayback Machine, but sometimes they dynamically load additional web resources in which case they do not work. Although this is less common, it also occurs for single-access keys.
- Even non-interactive dichotomous keys can be spread out over multiple web pages. For example, in B477 each step has its own page. This can be crawled normally (although it is not yet in this example) but the key still is not in one place.
- Some keys only exist in obscure print (or CD) sources or outdated or otherwise obscure file formats, where it may be useful to have a more standard format to archive or re-publish the data in (taking copyright into account of course).
In these cases it may be useful to archive the keys in a different format, especially if conserving the source is not feasible. In this blog post I will examine how I think I can best express these keys in more standard formats.
Adscita statices, observed June 28th, 2020
The Library of Identification Resources supports 7 types of resource types: collection
, checklist
, gallery
, reference
, key
, matrix
, and supplement
. The first one, collection
, is used for pages that link to a variety of different resource that are not all in the same series, and can be modelled with schema:Collection
or something similar. checklist
is also relatively simple, and can be modelled with a list of entities dwc:Taxon
, which could be linked to a schema:ItemList
.
gallery
, reference
, key
, and matrix
are a bit more difficult to model. gallery
and reference
are both supersets of checklists, that respectively have just an image, or a description and optionally an image, a distribution range, and/or more info for each taxon; key
are single-access keys; and matrix
are multi-access keys. Regarding gallery
: sure, Schema.org has a type for schema:ImageGallery
but that is specifically for websites, whereas gallery
also includes booklets, flyers, and CD-based programs, and the schema:Collection
type from before does not capture the cohesiveness in my opinion. reference
, key
, and matrix
are even more difficult to capture accurately.
For these types, I would like to introduce Structured Descriptive Data (SDD)¹, a standard from TDWG from 2005 that defines an format to encode descriptions and single-access and multi-access keys in XML files. gallery
, reference
, key
, and matrix
can all be expressed in SDD files. However, I think it is missing some features.
- It is not linked data, so it is difficult to relate the content of the key with the metadata.
- SDD has a
<Taxon>
element but it would be great if SDD files were instead interoperable with the definitions fromdwc:Taxon
. - For
reference
andgallery
resources, it should be possible to specify the relation between an image and a taxon. - For
matrix
keys, SDD has qualitative/discrete characters and quantitative/continuous characters. Characters relating to temporal, seasonal, and spatial distributions are common in matrix keys, but are difficult to express in SDD. I suggest temporal, seasonal, and spatial characters for both discrete and continuous characters. The former because old keys may for example have divided the continent into regions, but the user should be able to enter coordinates (so a regular discrete character does not suffice); the latter because newer keys may use the future to include more detailed distribution ranges. - For
matrix
keys it may be useful to assign probabilities to discrete character states per taxon.
Lastly, supplement
remains. This is a difficult one, even when just listing the taxonomic coverage. The goal when modelling this would be to list the differences between the original source and the “fixed” version in a standard format. If all original sources (i.e., the above types) could be expressed with linked data, the standard format could be the Linked Data Patch Format. Of course this would mean that unlike the other types, supplement
would not be expressed in linked data, which is a bit unfortunate.
¹ At the time of writing, TDWG only hosts SDD Primer in the twiki source format on GitHub. I have taken the liberty to parse the twiki source and produce HTML: https://larsgw.github.io/tdwg-wiki-archive/SDD/Primer/SddIntroduction.html
Citation.js: 2022 in review
Following up on the previous two updates this year (Version 0.5 and a 2022 update and Version 0.6: CSL v1.0.2), here are some updates from the second half of 2022, as well as some statistics.
Sapygina decemguttata, a small wasp, observed July 8th, 2021
Changes
- The mappings of the Wikidata plugin were updated, especially to accomodate for software.
- The
default-locale
setting of some CSL styles is now respected.
New plugins
plugin-enw
for.enw
files (similar, but different from Refer files).plugin-wikipedia-templates
for outputting Citation Style 1 templates to use on pages of the English Wikipedia.
New users
- WikiPathways is in the process of moving to a new site which uses Citation.js to generate citations.
- Page, R. D. (2022). Wikidata and the bibliography of life. PeerJ, 10, e13712. 10.7717/peerj.13712
- Boer, D. (2021). A Novel Data Platform for Low-Temperature Plasma Physics. [Master’s thesis] Radboud University, Nijmegen, The Netherlands. URL
Statistics
@citation-js/core
was downloaded approximately 240,000 times in 2022, against 205,000 times in 2021. The legacy package citation-js
was downloaded approximately 160,000 times (compared to 190,000 times in 2021). This is good, as it indicates people are starting to use the new system more often. Note that most downloads of citation-js
also lead to a download of @citation-js/core
so most people still use the legacy package. The shift to stop using citation-js
seemed to have started in October 2021, which coincides with the time time I forgot to update that package for 2 months after releasing v0.5.2 of @citation-js/core
.
Since February 2022 a relative decrease in downloads of @citation-js/plugin-wikidata
compared to core
is also visible. In December the Wikidata plugin was downloaded more than 50% fewer times, in fact. This is also good and the exact point of the modularisation introduced back in 2018: to let users choose which formats to include. The DOI plugin was similarly less popular, while the BibTeX and CSL plugins were — as expected — almost always included. Surprisingly, BibJSON, a very niche format, only had a 25% reduction (maybe due to confusion with BibTeX and the internal BibTeX JSON format), while RIS, after BibTeX the most common format, had a 50% reduction as well.
New Year’s Eve tradition
After the releases on New Year’s Eve of 2016, 2017, and 2021, this New Year’s Eve also brings the new v0.6.5
release which changes the priority of some RIS fields in very specific situations, most notably the date field of conference papers (now using DA
instead of C2
).
Happy New Year!
Explore identification keys on the world map
The website of the Library of Identification Resources has a new feature: a map view. The resources in the catalog are associated with a geographic scope to approximate which species in a taxonomic group can be identified. These geographic scopes are usually countries or continents but can also be subdivisions within countries, multinational ecological regions like mountain ranges, and biogeographical realms.
The scopes are also linked to Wikidata identifiers, and the Wikidata identifiers are linked to iNaturalist identifiers. This makes it possible to get GeoJSON of these places from the iNaturalist API. Then the places can be displayed with the Leaflet library, and shaded according to the number of resources associated with it.
World map with countries, continents and other regions shaded according to the number of associated resources.
There are a number of places that are not displayed:
- The biogeographical realms are not all in iNaturalist (e.g. Indomalayan realm).
- Some places like have ambiguous definitions (e.g. the Middle East) making it difficult to decide whether to link their Wikidata & iNaturalist entities if different definitions are used.
- Other places are historical and are difficult to express in modern borders (e.g. Japanese Empire in ca. 1930).
- A number of resources have scopes that have never been expressed in political borders (e.g. “the coast of Italy”).
All in all 347 of the current 1286 resources are not displayed, though it should be noted that 72 of those have no scope (e.g. keys for all globally known species).
Either way, the map clearly shows the geographical bias: Western Europe, especially the Netherlands, is overrepresented. This is because I have been mostly adding Dutch resources and various series from the UK (RES Handbooks), France (Faune de France), Denmark (Danmarks Fauna), Sweden (online keys and Nationalnyckeln), and Norway (Norske Insekttabeller).
The map is available from the home page and at https://identification-resources.github.io/catalog/visualizations/map.

On the modelling and application of the taxonomic coverage of identification keys
The main feature of the Library of Identification Resources is the description of the identification key (or matrix, reference, etc.). This description should on its basis specify when the key can or should be used. I have initially split this description into the taxonomic coverage and the ‘scope’. The latter includes life stage and sex but also some restrictions on the taxonomic coverage that are more difficult to characterize, like “species that cause galls on plants in the Rosa genus” or “species that live in aquatic environments”.
Zygaena filipendulae (Linnaeus, 1758)
Simple solution
For the taxonomic coverage sensu stricto, I started with the parent taxon, e.g. the family or genus. However, many keys do not treat all the species in a family, and are instead limited by a geographic scope. This geographic scope should clearly also be included. Then there are more casual keys that can be very useful but may not be complete even at the time of publication, either excluding some rare species or only including some common species. This can be detailed (to some extent) with an incomplete/complete switch. Finally, although many keys are for species, there are some keys primarily for identifying genera, families, or other ranks. Below are some example works where these aspects do or do not apply.
Title | Parent taxon | Geographic scope | Incomplete | Target rank |
---|---|---|---|---|
B460: A revision of the world Embolemidae (Hymenoptera Chrysidoidea) | Embolemidae | — | — | — |
B1: Identification Key to the European Species of the Bee Genus Nomada Scopoli, 1770 (Hymenoptera: Apidae), Including 23 New Species | Nomada Scopoli, 1770 | Europe | — | — |
B81: Key to some European species of Xylomyidae | Xylomyidae | Europe | Yes | — |
B63: MOSCHweb - Interactive key to the genera of the Palaearctic Tachinidae (Insecta: Diptera) | Tachinidae | Palaearctic realm | — | genus |
If only it were that simple
It became clear quite quickly that this is not enough. For one, parent taxon, geographic scope, and target rank should be able to contain multiple values. Additionally, as we saw before, some taxonomic coverages like “gallers on Rosa sp.” cannot be captured with these parameters unless the “Parent taxon” list gets very long and detailed.
Another, more common problem is that even the combination of a parent taxon and a geographic scope is hardly specific enough to be able to say whether an identification key is reliable and complete for a situation. Species are discovered, migrate, emigrate, and become extinct. Taxa are moved to different genera, families, and orders. In B659: Orthoptêres et Dermaptêres (Faune de France 3) from 1922, the order Orthoptera also contains Dictyoptera, currently a superorder consisting of cockroaches, stick insects, praying mantises, and more. This is a big problem too. Key questions that should be answerable are “How well does a British key apply to a Dutch observation?” and “How well does a British key from 1950 apply to a British observation from 2020?”
To account for the changes within higher-level taxa you might want to make a list of species that are included in the keys. For B81 for example, that could look like this:
https://purl.org/identification-resources/resource/B81:1
This gives a very clear image of the taxonomic coverage of the key: it includes three species, Salva marginata, S. varia and Xylomya maculata. The inclusion of the species in this key has a permalink, and the taxon is linked to a GBIF identifier. The list of species can then be matched (especially with the identifiers) to current checklists for the region that the observation was made in. More on that later.
This solution can easily be used for more complex taxonomic coverages, like the aforementioned key for species that cause galls on Rosa species. The ‘species’ list can also be a taxon list and have e.g. genera as the lowest rank. Another advantage is that there is no longer a need to explicitly specify the geographic scope, whether or not the key was known to be incomplete at the time of writing, and what the target taxon rank of the key is. In addition, this specific implementation also allows for multiple keys per work, which has various uses.
Two purposes
One problem that still comes up is that these taxon lists have two purposes:
- They describe which taxa can be distinguished with the key.
- They describe which taxa are considered when making the key.
These two are at odds. In a key to genera it would be simple to make a list of these genera, fulfilling (1) but not (2): if a species in of those genera gets moved to a different one, or if an additional species in one of the existing genera appears, the key ‘breaks’. The latter could be fixed by just listing all species but at that point you might as well fix both problems with (2) by grouping the species by genus. To simultaneously fulfill purpose (1) and (2) it is necessary to divide the taxonomic tree into units that can be distinguished by the key.
Even when fulfilling purpose (1) it is still possible to partially accomplish (2) in keys for e.g. families that key out to species. If, when making the species list, higher ranks such as genera are included, the key to genera can be validated according to (2) without having to go into more detail than the key. Therefore it is important to capture the entire taxonomic tree as presented in the work.
Another thing that might be useful to record is the synonyms. Many of the more academic works publish along with the key a series of descriptions including synonyms. If the status of that synonymic taxon is now different from the status in the key, this may be of importance to the results of the identification. It comes down to the same split of purposes: (1) no distinction is made in the key between the two taxa but (2) both taxa are, in theory, considered. Either way it is important to list both in some way.
If only it were that simple
At this point it became clear which taxa are necessary to include in such a list. But how to describe the taxa in those lists? This is an (enormous) rabbit hole in and of itself, and if I had more experience I could make a Falsehoods Programmers Believe About Names-style post about it. It starts with the normal parts: every taxon has a name, an author, a year, and a rank. That rank can be kingdom, phylum, class, order, family, genus, or species. But there is a seemingly endless list of increasingly obscure ranks, sometimes only in use by one or a few authors (what is a “stirps”??); taxa can have multiple authors and different ways of presenting this (et
/and
/&
and et al.
or listing everyone); the names themselves can be spelled in different ways and include or omit initials; the year can be very unclear, especially for taxa published in older works published over multiple years; and the scientific names themselves can have different spellings, capitalizations, and hyphenations as well.
As with personal names, the ‘trick’ is to trust the source to some extent, and avoid focussing on the specifics of the scientific name. After all, there is no reason to make the computer understand the scientific names. The goal is to allow for matching (and there are ways to do that without trying to standardize everything) and, secondarily, to present the names in such a way that humans can understand them (and the original key already did that).
Measuring applicability (more problems)
Now, with a complete and structured taxonomic tree, the question becomes how to check whether it applies to a certain observation. The idea is to compare the taxon list to a list of all the species that a certain observation could be. But where does such a list come from? The simple solution would be a checklist: a professional, thorough list of all the species that are known to occur in a certain area.
If such a checklist is not (easily) available (I am not aware of a global database of checklists, let alone in a standardised format), it might be tempting to create one from a map of observations from e.g. GBIF. The advantage of this is that you have, to some extent, global coverage and a standardised format. However, this has some caveats. The correctness of the resulting checklist is entirely dependent on the quality and quantity of the observations in the region. To get a higher sensitivity it may be useful to instead make a checklist of a larger, encompassing region, but this lowers the specificity. A species occurring in Belgium might only have recorded observations in the Netherlands on GBIF. The same goes for the time scale. There might be museum specimens of species that are now extinct in the region, but there might also be rare species that are only seen every 50 years or so.
A good addition (or a risky alternative) to a checklist would be a measure of the (relative) abundance of species in the region, and weighing the taxon lists of the keys according to that. This prioritises keys that include more common species over keys that include more rare species. Of course, the key of common species can be wrong about an observation of a rare species. Another problem is how to determine the (relative) abundance. Again it is tempting to derive this from GBIF observations, but a species with a lot of observations is not necessarily more common. It might be the focus of an observation campaign, or national attention of a different kind, it might sit still more often, it might be easier to recognize, or it might even be easier to identify to a species level.
An additional possibility of measuring applicability is bringing the scope restrictions mentioned at the start into this. Apart from belonging to a certain taxon, the observed organism also has a biological sex, a life stage, and more characteristics that may restrict which keys can be used. To compare these characteristics between keys and organisms they need to be described with a common, consistent vocabulary. iNaturalist has such a vocabulary (the “Annotations”), but there might be more suitable ones out there somewhere.
(For this it might also be useful to map keys that distinguish castes of ants, males/females of solitary bees, or life stages of shield bugs. How do you model that? Not in the same way as described above, that is for sure.)
Using the data
As I teased in the previous blog post about the Library of Identification Resources, I have started to collect this data, and attempted to recruit some others to contribute as well. All works which have their keys (and matrices, checklists, descriptions) mapped out can be found by searching for “TRUE” in the “Tax. data extr.” field. To get from the work page to a taxa list, look for the row titled “Resources” in the first table on the page. If available it lists the individual resources in the work for which the taxonomic data is extracted.
The pages for the taxonomic data of the individual resources contains some basic information about the work (as well as a link), as well as the page numbers of the resource within the work if available. There is also some info on the resource, derived from the same info in the work unless different for that resource. Then there is a list of taxa, displayed as a tree. Each taxon has a rank, an anchor, and if available a link to the corresponding taxon in GBIF.
The data is also used in the proof-of-concept app that I introduced in the previous blog post. Searching for a taxon and coordinates will now query GBIF for observations in that taxon in the country encompassing the coordinates, and match this with the GBIF identifiers in the taxon lists. It displays the relative amount of taxa in the ‘checklist’ that are also in the key as a percentage and a small pie chart. It does not yet deal with synonyms.

Library of Identification Resources
Since around this time last year, I have been working on creating a library of identification resources. Here, “identification resources” are identification keys, multi-access (matrix) keys, other works that can aid in the identification of species. The project is managed on GitHub: https://github.com/identification-resources.
The logo: Zygaena filipendulae (Linnaeus, 1758)
About the project
Right now the project mainly consists of a catalog, a list of works (or rather “manifestations” in the FRBR model) that contain identification resources. The catalog contains bibliographic information, but also a summary of the identification resource(s) that they contain.
This summary consists of the starting taxon or taxa (e.g. Chrysididae), a region that the resource applies to (e.g. Europe), an optional scope (e.g. adults, males and females), the taxonomic rank(s) that the resource distinguishes (usually species), and whether the key was supposed to be complete for the region-scope combination at the time of publication.
This data is similar to that provided by BioInfo UK (example). They actually go in more detail, with more nuance in completeness, the equipment required for using the key (e.g. stereo microscope), whether and what specimen preparation is required, and expert reviews. The Library of Identification Resources does not include this information, but has a broader geographical scope¹ and can be searched in a more structural way.
The places (for the geographical scope of keys) and the authors and publishers linked to works as well as the works themselves are linked to Wikidata entities wherever possible. Journals and book series are linked to ISSNs.
Using the data
On top of this catalog, I built a website: https://identification-resources.github.io/. This provides a user interface, allowing for structured searches through the catalog, linking multiple versions (FRBR “manifestation”) of the same work (FRBR “work”), a few graphs exploring the distribution of languages, licenses, etc., and some additional functionality like exporting citations.
To improve searching, I am also working on an app that lists the most relevant resources for a given situation. In the proof-of-concept (GitHub) users can enter a parent taxon and a coordinate location and get (possibly) applicable resources for that situation. This uses the GBIF API for taxon searches, and the iNaturalist API for looking up places around coordinates.
Results are sorted according to a few heuristics, including:
- The availability of full text.
- How close the key taxon is: a key for Vespidae is likely better than one for Insecta, when identifiying Vespa.
- How recent the resource is.
- Whether the level of detail indicates that the key can actually improve on the parent taxon.
- Whether the resource was considered complete during publication.
- Whether the resource is a (matrix) key, just a reference or a photo gallery, or even a just collection of other resources (which are not guaranteed to contain anything relevant, and if they did, they would likely show up in the results already).
The data is currently biased towards the Netherlands and the rest of western Europe, as well as towards insects and mainly Hymenoptera. This is because I have mainly worked on adding resources that I was using, or resources referenced therein.
The proof-of-concept app in action.
Future plans
Improving the interface
I want to improve the interface of the proof-of-concept app and I have a number of ideas for this:
- Allow users to “upload” photos. These would not be uploaded to a server but rather just displayed locally, but could serve a couple of purposes:
- Automatic extraction of coordinates from EXIF data
- Automatic determination of a parent taxon using computer vision [needed: a computer vision model :)]
- Viewing the key side-by-side with the photos.
- Same, but by entering a GBIF/iNaturalist observation ID.
Improving the data model
One of the main improvements that I want to make is modeling of individual keys/resources within works for two reasons:
- Works can contain multiple resources with completely different properties. If a work contains a key of Family A to Genus B and C, a key to the species of Genus B, and a key to the species of Genus C, those can be modeled as a single key to the species of Family A (assuming B and C are the only relevant genera). But this is not always the case. In addition, a work can contain resources of different types, like a key and a checklist (B159).
- Listing the taxa in the resource. I am aware of a few problems with this, but a list of species in a key can be matched to a modern checklist. This gives a better idea about how well a British key could apply to a Dutch observation (or an older British key to a more recent British observation, with all the species migration going on).
Example of what this would look like.
Contributing data
You are welcome to contribute data to the catalog. I plan to make this easier in the future, as currently the master copy of the data is still in the non-version-controlled spreadsheet that I started the work in.
¹ Note that the current data has a geographical bias, but this is not the same as a strict scope in my opinion.
Citation.js Version 0.6: CSL v1.0.2
Since the citation-js
npm package was first published, version 0.6 is the first major version of Citation.js that did not start out as a pre-release. Version 0.3 itself spent almost 6 months in pre-release, but only received updates for less than half a month. Version 0.4 spent more than a year in pre-release and received updates for about 4 months. Version 0.5 takes the cake with one and a half years in pre-release, receiving updates for a year, also making it the best-maintained version.
Tussilago farfara, March 27th, 2022
Version 0.6 is a major version bump because it introduces a number of breaking changes, including raising the minimal Node.js version to 14. Since April 2022, Node.js 12 is End-Of-Life, which led to a lot of dependencies dropping support. Now, Citation.js does so too. Other changes include the following:
Update data format to CSL v1.0.2
The internal data format is now updated from CSL v1.0.1 to v1.0.2. This introduces the software
type and the generic document
type, as well as some other types, and some new fields. The event
field is also renamed to event-title
. That, and software
replacing book
, makes it so that CSL v1.0.2 is not compatible with CSL v1.0.1 styles, making it a breaking change.
- CSL data is now automatically upgraded to v1.0.2 on input.
Cite#data
((new Cite()).data
) now contains CSL v1.0.2 data.- Output formatters of plugins now receive CSL v1.0.2 data as input.
util
(import { util } from '@citation-js/core'
) now has two functions,downgradeCsl
andupgradeCsl
, to convert data between the two versions.- The
data
formatter (.format('data')
) now takes aversion
option. When set to'1.0.1'
, this downgrades the CSL data before outputting. @citation-js/plugin-csl
already automatically downgrades CSL to v1.0.1 for compatibility with the style files.- Custom fields are now generally put in the
custom
object, instead of prefixing an underscore to the field name.
The mappings are also updated. Especially the RIS and BibLaTeX mappings were made more complete by the increased capabilities of the CSL data schema. Non-core plugins are also being updated, mainly affecting @citation-js/plugin-software-formats
and @citation-js/plugin-zotero-translation-server
.
Test coverage
While updating the plugin mappings, the test suites of the plugins were also expanded. This led to the identification of a number of bugs, that were also fixed in this release:
- BibLaTeX
- handling of CSL entries without a
type
- handling of
bookpagination
- handling of
masterthesis
- handling of CSL entries without a
- RIS
- RegExp pattern for ISSNs
- Name parsing of single-component names
Closing issues
A number of issues were also fixed in this release:
- Adding full support for the Bib(La)TeX
crossref
field - Mapping BibLaTeX
eid
tonumber
instead ofpage
- Adding a mapping for the custom BibLaTeX
s2id
field - In Wikidata, getting issue/volume/page/publication date info from qualifiers as well as top-level properties.
CSL styles
The bundled styles (apa
, vancouver
, and harvard1
) were updated. Note that harvard1
is now an alias for harvard-cite-them-right
. Quoting the documentation:
The original “harvard1.csl” independent CSL style was not based on any style guide, but it nevertheless remained popular because it was included by default in Mendeley Desktop. We have now taken one step to remove this style from the central CSL repository by turning it into a dependent style that uses the Harvard author-date format specified by the popular “Cite Them Right” guide. This dependent style will likely be removed from the CSL repository entirely at some point in the future.
http://www.zotero.org/styles/harvard1, CC-BY-SA-3.0
Looking forward
Some breaking changes are still pending, mainly changes to the plugin API and the removal of some left-over APIs. However, I also want to work on a more comprehensive format for machine-readable mappings, a format for mappings for linked data, and of course implementing more mappings in general!
A story about a university login with a broken security configuration, and a mildly uncooperative help desk
Last semester I followed some courses at a different university, and went through the process of collecting login credentials and multi-factor authentication tokens and familiarizing myself with a network of university systems all over again. Most (but not all) of those systems use the main single sign-on login process of the university, at https://login.universityfoo.nl
.
Note |
---|
The university has two main domains, let’s call them universityfoo.nl and universitybar.nl . |
One of those systems is Brightspace, used by course coordinators to communicate course information, syllabi, and additional documents to students. Very important for someone new to the university, especially someone who did not go through the normal process of introduction weeks and tutorials. But when I logged in at https://login.universityfoo.nl
, I was met with a blank screen. Other systems worked fine, including those to set up my email, but Brightspace did not.
Naturally I opened the trusted Chrome DevTools and saw the following error:
Refused to navigate to 'https://brightspace.universitybar.nl/d2l/lp/auth/login/samlLogin.d2l' because it violates the following Content Security Policy directive: "navigate-to 'self' https://*.universityfoo.nl:443 https://*.services.universitybar.nl:443".
That was already pretty clear: one of the Content Security Policy directives was simply blocking any navigation to any domains other than a short list of exceptions, which did not include the domain that Brightspace was on. But that seems like a major problem, one that would have been caught already unless I had some incredibly (un)lucky timing.
It turned out, however, that specifically the navigate-to
directive was not supported yet at all, in any browser, at least according to MDN. However, in the Chromium code the following could be found:
// Content counterpart of ExperimentalContentSecurityPolicyFeatures in
// third_party/blink/renderer/platform/runtime_enabled_features.json5. Enables
// experimental Content Security Policy features ('navigate-to' and
// 'prefetch-src').
public static final String EXPERIMENTAL_CONTENT_SECURITY_POLICY_FEATURES = "ExperimentalContentSecurityPolicyFeatures";
Turns out I had the #enable-experimental-web-platform-features
flag enabled, for some reason, and that flag probably included the EXPERIMENTAL_CONTENT_SECURITY_POLICY_FEATURES
. I probably enabled the flag for development at some point? I do not even remember. But that meant the navigate-to
directive was just wrong.
Since I did not want to disable the flag (or were not sure whether it would help), I instead turned to ModHeader: a Chrome web extension to modify requests and responses in the browser. I mainly use it to view DOI content negotiation requests in the browser instead of using cURL. With that I could modify the navigate-to
part of the Content-Security-Policy
header to the following (line breaks and [...]
mine):
Content-Security-Policy: [...] navigate-to 'self'
https://*.universityfoo.nl:443
https://*.services.universitybar.nl:443
https://brightspace.universitybar.nl:443; [...]
This finally allowed me to log in to Brightspace.
Naturally I wanted to share my findings, especially since whenever navigate-to
gets support without experimental flags, Brightspace log in breaks for everyone, so I went to the online university helpdesk. There, I was also met with a blank page. Imagine that. Suddenly logging in to Brightspace does not work anymore, and all the students going to the digital helpdesk are met with a blank page as well. Students panicking, the IT department (maybe) panicking because they were not doing any upgrades or maintenance or anything. Good thing I got a sneak preview of the problem, so I could warn them. First, bypassing navigate-to
for the helpdesk as well:
Content-Security-Policy: [...] navigate-to 'self'
https://*.universityfoo.nl:443
https://*.services.universitybar.nl:443
https://brightspace.universitybar.nl:443
https://helpdesk.universitybar.nl:443; [...]
However, when I sent a message detailing the problem, I was met with “can you try clearing your cache?” I did, even though I knew that was not the problem, and it did not help. I did know what would help though, but they clearly did not care since I am reproducing the problem while writing the blog post almost 9 months later. When I confirmed that clearing the cache did not help, I was asked to disable #enable-experimental-web-platform-features
. Which, sure, but that was not really the point. Anyway, I guess they will probably find out in time anyway, but I was still a bit disappointed.
Citation.js Version 0.5 and a 2022 update
Version 0.5.0
Version 0.5.0
of Citation.js was released on April 1st, 2021.
BibTeX and BibLaTeX
After the update to the Bib(La)TeX file parser, described in the earlier BibTeX Rework: Syntax Update blog post, the mapping of BibTeX and BibLaTeX data to CSL-JSON was also updated. The mapping is now split in two, one for BibLaTeX (which is backwards-compatible with BibTeX) and one for BibTeX. The output formats were also updated to output either BibTeX-compatible files or BibLaTeX-compatible files. The most common difference there is the use of year
and month
versus date
respectively. In addition, a number of updates were made to the file parser.
Core changes
In the Grammar
utility class, bugs were fixed an behavior was updated to better account for the Bib(La)TeX parser. Some of the code for correcting CSL-JSON was also updated, including moving the code correcting results of the Crossref API from the DOI plugin to the core module as CSL-JSON from the API may end up in Citation.js through other methods than the DOI plugin. Earlier in 0.5 development, some of the HTTP handling code was also updated for increased stability.
2022 update
v0.5.1
–v0.5.7
The version released since v0.5.0
mostly contain bug fixes and small enhancements. The latter includes some more descriptive errors in certain places, as well as mapping some non-standard fields in Bib(La)TeX and RIS.
New site design
The design of the Citation.js site was updated for the first time since 2018. The changes were detailed in the recent Citation.js: New site blog post.
New plugins
New plugins for the refer file format (plugin-refer
) and the RefWorks tagged format (plugin-refworks
) were released.
More coming
More changes are expected, including more long-awaited output formats, better mappings for software and datasets, and more work on machine-readable mappings.
Citation.js: New site
I recently updated the website of Citation.js. This involved getting rid of the Material Design Lite framework, simplifying and refreshing the site design and modernisering some of the code behind it. Additionally, I updated the content of the homepage, and added some functionality to the interface of the blog page and the demo.
Homepage
The old layout of the homepage had a dark grey background, with in the middle a grid of four cards with the main content of the site, and between the top and bottom row the Citation.js banner-variant logo. The grid of cards had a background of syntax-highlighted source code. This is actually the start of the Citation.js v2 code, which at that point still consisted of a single file. On the very top of the page was a yellow header and in the bottom a thin black footer.
The new layout incorporates a lot of the design elements of the first design, but in a way that hopefully improves the readability and feel of the page. The yellow header remains but the links are centered instead of right-aligned. The footer is full-width (though the text is still centered) and has a larger font size and vertical padding. The grid is gone, instead the top of the page has a background of code in full-width with the banner logo and some introductory text. The other content is now aligned in a single row, and the cards are replaced with plain text, although the headers still have white text with a slight shadow on a dark background.
Blog
The blog page had the same header, footer, and dark grey background in the old layout, with individual blog posts as cards and the introductory text and search bar as a slightly wider card.
The new layout mirrors the changes to the homepage, especially the white background and full-width code background and the changing of cards to plain text. To the right of the blog content is now a sidebar listing the blog posts per year, which moves to the bottom of the page on narrow screens. Below the search bar is now a clickable list of tags.
Demo page
The design of the demo page has not been updated since I made it in April 2016, being more or less plain-text but with paragraphs limited in width and centered.
The new design adds the header, footer, and code background from the homepage as well as some styles for the headers. The interface of the demo is simplified at the cost of easy-to-read code. That also means that the live view of the code is removed.
API documentation
The styles of the home page now also apply to the API documentation.
Re-implementing the upload of images for the LaTeX→HTML converter
The CDLI is developing a new website. That website’s admin interface for its journals contains a page where a LaTeX source file, following a specific template, is configured to an HTML page. For this, apart from the LaTeX file itself, two additional components are needed: a BibTeX file, containing metadata of the references; and image files.
Current implementation
The current implementation involves uploading images separately from the form that creates the article. This can lead to “orphan” images if such a form is abandoned after uploading the images. The current implementation has an additional problem, where files are saved in a single directory, so multiple images with the same file name (say, Figure1.jpg
) will overwrite previous images.
ClientServerGET /admin/articles/add/cdlj200 (add page)POST /admin/articles/convert-latex200 (converted HTML, list of images)POST /admin/articles/image 'Figure1.jpg'Saves image in 'Figure1.jpg'200POST /admin/articles/add/cdlj300 /articles/<sequential ID>ClientServer
New implementation
With a new implementation of the rest of the forms, simplifying a lot of the code, the issues with the image uploads are also attempted to be resolved.
Saving images together with the metadata
To avoid the problem of the “orphan” images resulting from abandoned forms, one could submit the images in the same form that creates the article. If that form is not submitted, or contains invalid data so cannot be saved, the images will not be uploaded.
Saving images according to metadata
The second problem could be solved immediately by saving the images in subdirectories according to the metadata of the image, e.g. 2021-01/
where 2021
is the year the article is published and 01
is the article sequence number within that year. This however assumes that that metadata does not change after the initial submission.
Both these solutions however creates some constraints on other problems, because it means that the images can only be saved after the user submits the main form, so after the HTML containing <img>
elements with references to the image locations is generated. Somehow, those image locations should be able to identify the article, before any information about that article is known:
ClientServerGET /admin/articles/add/cdlj200 (add page)POST /admin/articles/convert-latexAt this point the HTMLshould contain linksto the permanent locations of the images200 (converted HTML, list of images)POST /admin/articles/add/cdljAt this pointthe images are savedin a permanent location300 /articles/<sequential ID>ClientServer
So, what is the solution? I propose the following:
ClientServerGET /admin/articles/add/cdljGenerate random article ID200 (add page with embedded article id)POST /admin/articles/convert-latexGenerate image URLSaccording torandom article ID200 (converted HTML, list of images)POST /admin/articles/add/cdljVerify random article ID,save images,generate sequential ID300 /articles/<sequential ID>ClientServer

CDLI catalogue growth over time
Since Google Summer of Code 2020 I have been contributing code to the new framework of the Cuneiform Digital Library Initiative (CDLI), a digital repository for metadata, transliterations, images, and other data of cuneiform inscriptions, as well as tools to work with that data.
One of the features of the new CDLI framework is improved editing of artifact metadata, as well as inscriptions. There are several ways artifacts will be able to be edited: by uploading metadata in a CSV-format, by batch edits in the website interface, and by editing individual artifacts. At the basis of all those pathways is the database representation of individual revisions. To evaluate the planned representation, and to see if alternatives are possible, I took a look at the catalogue metadata (CDLI 2021), and what edits are being made currently.
Artifact registrations
Figure 1: The number of artifacts registered each year, split by the collection they are in. Collections with less than 7000 artifacts represented are combined into the “Other” category to keep the legend manageable. 13K artifacts without valid creation dates were excluded.
First, I took a quick look at the composition of the catalogue (Fig. 1). As it turns out some collections had most of their (included) artifacts added in a single year, such as the University of Pennsylvania Museum of Archaeology and Anthropology in 2005 and the Anadolu Medeniyetleri Müzesi in 2012. Other collections seemed to have had a more steady flows of artifact entries, particularly the British Museum. Overall, this does not help much with selecting a database representation of revisions though.
One of the options we want to evaluate is storing revisions in a linked format, similarly to how artifacts are stored now, instead of a flat format. This means that if each of those artifacts has about 10 links in total — say 1 material, 1 genre, 1 holding collection, 1 language, 1 associated composite, and 5 publications — each revision would need 10 rows for the links and 1 row for the rest of the entry. Therefore, the question is: are 11 rows per revision manageable for 343,371 active artifacts?
Revisions
To find out, let’s take a look at the daily updates of the catalogue data. With the git history, we can find out how many artifacts were edited on each day. Since the commits are made daily multiple consecutive edits to the same artifact are counted as a single revision. On the other hand, the removal of an artifact from the catalogue might be counted as a revision. Whether that balances out is hard to tell, so these numbers are a rough estimate. The analysis only goes back to 2017 unfortunately, as before that the catalogue was included as a ZIP file.
Figure 2: Number of artifact revisions per year. The 12 largest revisions are highlighted and explained below.
Figure 2 highlights in various colors the 12 revisions affecting the highest number of artifacts. Most of these are 7 consecutive in October and November of 2017. These involved editing the ID column, something which should not happen in the current system. Other large revision usually affected a single column, thereby revising almost every artifact:
- Padding designation numbers with leading zeroes (“CDLI Lexical 000030, ex. 13” → “013”): 2020-02-11, 2018-01-26
- Addition of new columns: 2019-09-10
- New default values ("" → “no translation” for the column
translation_source
): 2018-10-30
Outside the top 12 the edits become a lot more meaningful. Often, credits for new transliterations or translations are added, sometimes with the previously undetermined language now specified.
As it turns out, approximately 3 million edits are made every year. If all those edits are stored as linked entities, we are looking at 30 million table rows, per year. However, even if the edits are stored in a flat format there would be 3 million table rows per year already. Either option might become a problem in 5–10 years, depending on SQL performance. With indexing it might not be a problem at all: usually the only query is by identifier anyway.
Changed fields
That said, let’s presume I choose the flat format. Most revisions only change one or two fields (excluding the modification date which would not be included). Duplicating the whole row might be wasteful, so what could I do to avoid that?
Since the flat format is based on the schema of the CSV files used for batch edits, each column can be considered as text, with an empty string for empty values. This leaves NULL
(i.e. “no value”) available to represent “no change”. Together with MySQL’s SPARSE
columns only edited fields would be stored. (Otherwise, each empty value would need to be signified as such. Now, actual values carry some extra information to the same end.)
It would also make it even easier to display a list of changes, as there is no need to compare the value with the previous one. Other operations with revisions, such as merging two revisions made simultaneously on the same version of an artifact, would be easier for the same reason.
Since this would not be possible, or not as easy, with a linked format perhaps it was good the shear volume of edits pointed me that way anyway.
References
Cuneiform Digital Library Initiative. (2021, March 8). Monthly release 2021.03 (Version 2021.03). Zenodo. http://doi.org/10.5281/zenodo.4588551