I applied to the ContentMine fellowships to extract facts and other info from papers about conifers, and visualise them. This program accomplishes that, and possibly more.
Table of contents
ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).
The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.
Here’s a simple demonstration of how this works:
aardvark
(classic ContentMine example)This generates data.ttl
, which holds the following information:
Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.
However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.
To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.
Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.
Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.
Some stats:
Note that not all terms are actually valid. A lot of genus
matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli
, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.
Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:
Species | Hits |
---|---|
Pinus sylvestris | 248 |
Picea abies | 177 |
Pinus taeda | 138 |
Pinus pinaster | 120 |
Pinus contorta | 96 |
Arabidopsis thaliana | 91 |
Picea glauca | 77 |
Pinus radiata | 77 |
Pinus massoniana | 72 |
Pseudotsuga menziesii | 65 |
Oryza sativa | 56 |
Pinus halepensis | 56 |
Pinus ponderosa | 55 |
Pinus banksiana | 53 |
Pinus koraiensis | 53 |
Picea mariana | 52 |
Pinus nigra | 51 |
Pinus strobus | 46 |
Quercus robur | 45 |
Fagus sylvatica | 45 |
… | … |
The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.
Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.
Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:
Species 1 | Species 2 | Co-occurences |
---|---|---|
Picea abies | Pinus sylvestris | 98 |
Picea abies | Pinus taeda | 56 |
Arabidopsis thaliana | Pinus taeda | 47 |
Picea glauca | Pinus taeda | 43 |
Arabidopsis thaliana | Oryza sativa | 43 |
Pinus pinaster | Pinus taeda | 41 |
Pinus pinaster | Pinus sylvestris | 41 |
Picea abies | Picea glauca | 41 |
Arabidopsis thaliana | Picea abies | 37 |
Pinus contorta | Pinus sylvestris | 36 |
Betula pendula | Pinus sylvestris | 36 |
Pinus sylvestris | Pinus taeda | 36 |
Pinus nigra | Pinus sylvestris | 35 |
Pinus contorta | Pseudotsuga menziesii | 32 |
Picea abies | Pinus contorta | 31 |
Picea abies | Pinus pinaster | 30 |
Arabidopsis thaliana | Physcomitrella patens | 30 |
Oryza sativa | Pinus taeda | 29 |
Pinus sylvestris | Quercus robur | 29 |
Picea abies | Picea sitchensis | 28 |
… | … | … |
Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.
Species | Co-occurrences |
---|---|
Arabidopsis thaliana | 43 |
Pinus taeda | 29 |
Picea abies | 24 |
Physcomitrella patens | 23 |
Populus trichocarpa | 21 |
Glycine max | 20 |
Vitis vinifera | 17 |
Picea glauca | 17 |
Pinus pinaster | 16 |
Selaginella moellendorffii | 15 |
Pinus sylvestris | 13 |
Triticum aestivum | 12 |
Pinus contorta | 10 |
Picea sitchensis | 10 |
Ginkgo biloba | 10 |
Pinus radiata | 10 |
Ricinus communis | 9 |
Amborella trichopoda | 9 |
Medicago truncatula | 9 |
Cucumis sativus | 8 |
… | … |
So attention seems divided between trees and more agriculture-related plants. More to explore for later.