Finnish Museum of Natural History Museum websites

E-infrastructure for organism names to facilitate data sharing - LSID Nordic project

 

A joint project of the Nordic GBIF Nodes, funded by NordForsk in 2008-2010

 

Project Coordinating Group

Hannu Saarenmaa, Finnish Museum of Natural History, University of Helsinki (Project Coordinator)

Henrik Enghoff, Natural History Museum of Denmark, University of Copenhagen

Tiiu Kull, Estonian University of Life Sciences, Tartu

Sven Kullander, Swedish Museum of Natural History, Stockholm

Sergey Sinev, Zoological Insititute of the Russian Academy of Sciences, Sankt Petersburg

Einar Timdal, Natural History Museum, University of Oslo

 

Project Advisory Group

(Members to be invited)

 

Summary

The purpose of the project is to build an e-infrastructure for resolving scientific names of organisms to facilitate biodiversity data use and data sharing in the Nordic region and beyond. The work requires setting up a service on Internet that will issue globally unique identifiers for scientific names and the underlying taxonomic concepts based on the LSID specification, which has been standardized by the Biodiversity Informatics Standards organization TDWG and is recommended by the Global Biodiversity Information Facility (GBIF). Environmental authorities, research groups, and mobile observers out in the wild can then use these identifiers to remove ambiguities in data exchange. Among the benefits will be that large integrated studies that need to combine data, for instance for global change studies, become more feasible.

The project increases the interaction of the participants that already are major research infrastructure elements into new electronic frontiers. This is the first joint project of the Nordic GBIF nodes and the project is aimed also at strengthening Nordic cooperation in the global GBIF process.

Project plan

a. Overall aims and objectives

Since Linnaeus, scientific names written in Latin have formed the backbone of biology into which all knowledge about organisms has been anchored. Unfortunately, scientific names alone cannot always be used to uniquely specify organisms. Most species do have synonym names, which are relatively easy to handle. However, species get split and lumped, and due to erroneous usage of names throughout history, there are many homonyms. Typically 2-5% of species in most organism groups suffer from ambiguity of names beyond easy cases of synonymy. Ultimately, names are just tags attached to scientific hypotheses and opinions about how organisms are classified and grouped and what constitute taxonomic concepts (e.g. Ytow 2002). In other words, underlying names are taxonomic concepts and their published descriptions (a.k.a. ''circumscriptions'', which delineate the borders of taxonomic concepts). It is these descriptions that really need to be retrievable and identified uniquely.

Until now there has not been any lasting solution for communicating about taxonomic concepts other than using the ''imperfect'' names. Scientists do not necessarily need such a solution, but authorities such as environmental administrations exchanging information do. In the Nordic countries, a joint Nordic Code Centre (NCC) operated in the years 1981-94, and standardised which names would be used for data interchange (Clausen & Ravn 1999). It also issued numbers for the underlying concepts. In North America, Integrated Taxonomic Information System (ITIS) was established in 1994, and has been since issuing ''Taxonomic Serial Numbers'' for names. However, in 1994 NCC funding from The Nordic Council of Ministers was terminated. This led to a situation that in each Nordic country name lists are now kept separately, different sets of names are being used in the countries in some groups, and nobody is issuing numbers for taxonomic concepts anymore. As result data interchange and integration requires more manual work and is prone to error. Needs will increase in future, since the INSPIRE directive of EU requires that within five years, data about distribution of species will have to be made available in electronic form.

Meanwhile, sharing of biodiversity data is becoming a widespread activity worldwide. The Global Biodiversity Information Facility (GBIF) has spearheaded the development together with the Biodiversity Information Standards organisation TDWG. Together with the Species 2000 /Catalogue of Life initiative, these organisations provide the global e-infrastructure for biodiversity information. Already 130 million records of observations of organisms from over 1000 databases are openly available through GBIF. Still, GBIF integrates all this data using just names, as none of GBIF's current name providers offers identifiers for taxonomic concepts. The result is that an integrated dataset downloaded through GBIF has to be cleansed manually, and lots of data has to be left out from analyses.

The need of improving this situation has been recognised (Hobern & Saarenmaa 2005), and TDWG and GBIF have adopted a standard and build an e-infrastructure for resolving names to their underlying taxonomic concepts. The solution is based on the Life Sciences Identifier (LSID) specification (Life... 2004), which originates from the biomedical community (Szekely 2003). LSIDs are so called Unique Resource Names (URN) that can be resolved using Internet's naming conventions. That is, a software application or web browser can automatically retrieve the documentation and analyse the taxonomic concept and name identified by an LSID. The content that an LSID resolves to should always be the same, although different revisions of it could be added. Hence, LSIDs can be used as keys in disconnected databases that do not know of each other, and can be used in data interchange.

The benefits of using LSIDs in datasets that are exchanged and integrated are many. Datasets that use different nomenclatures and spellings of names can be merged for analysis automatically. There is no ambiguity and need for manual intervention about taxa with homonyms and split or lumped taxa. A scientist looking at a dataset on a web browser can point at a scientific name or an LSID, and automatically retrieve the literature reference or other description of what was actually meant by the person that inserted the name in the database or web page.

Examples of how LSIDs look like could be the following:

  • urn:lsid:gbif.fi:name:LEP-TIMGRIS9:1 Scientific name for the moth Timandra griseata auct. W. Petersen, 1902
  • urn:lsid:gbif.fi:taxon:123-4567890:1 Taxon concept for a species, for example Timandra griseata sensu Saarenmaa & Calabuig 2009, after (purely hypothetically) splitting this species for the n'th time.

The six elements of an LSID are separated by colon sign. Here, 1) element states that this is an URN, 2) that this URN is an LSID, 3) the authority identification (normally, who has issued the LSID and that provides the service to resolve it), 4) namespace identification (normally, name of a database at the authority), 5) object identification (database key), and 6) optionally revision number.

For this to work, some e-infrastructure has to be in place. Basically, one or several LSID authorities (comparable to Internet's Domain Name Servers that keep track of web addresses) with services to resolve the LSIDs have to be set up. When dealing with scientific names, different name lists have to be brought together and LSIDs generated for each circumscription underlying the names. Different names can be pointed to the same circumscription in case of synonymy. In case of a species that has been split into two (where one retains the original name), two different circumscriptions for one name exist, and two LSIDs would be generated. This is also the case with homonyms.

Setting up this e-infrastructure for the species that occur in the Nordic region is the aim of this proposal. Such e-infrastructure cannot be set up globally once for all. It is best done by each taxonomic group and geographic region, as name catalogues, use of data, and organisms are distributed that way. For instance, there have been many regional name catalogue projects such as the North American ITIS, Fauna Europaea, Flora Europaea, and Species 2000 europa, and there now is an Australasian pilot project for LSID implementation. The e-infrastructure will have to be scaled up stepwise starting from one organism group, learning from experience and moving to the next. At the top level, the GBIF and the TDWG infrastructure project are tying the parts together globally.

This study will be carried out by the major natural history museums and environmental administrations in the Nordic region. The museums themselves are major infrastructure elements, but this study increases their interaction to a new level of electronic data interchange. No single Nordic institute could alone carry out this study, as it deals with integration of data between countries. It also involves major users of museum databases to this exercise. It ties the Nordic efforts to the global initiatives in this area, and prepares way for larger studies and implementation, for instance as a European Union research infrastructure project.

b. Proposed methodology

For the purposes of this study, in the first phase butterflies and moths (Lepidoptera) will be used, as this group well highlights the multitude of problems and there is plenty of need for data exchange as species of the group are often used as environmental indicators. This order has about 3000 species in the Nordic region, and because of climate change 10-40 new species are spreading from the south annually, which together with changes caused by taxonomic research means that there is a constant need to update the catalogues. Due to differing taxonomic opinions, at least three different catalogues (taxonomic classifications) are currently being used in the region. Each of Finland, Scandinavian countries, and Russia use different sets of names for Lepidoptera. Also European name lists from Species 2000 europa and Fauna Europaea will be scrutinised, as well as the global LepIndex. LSIDs will be issued for all available names of Lepidoptera in the region, and the related taxonomic concepts.

The name data are used in databases about observations made in the field. The current GBIF data providers offer about 800,000 observation records of Lepidoptera from the region. However, the development of data sharing has been somewhat uneven. In Finland and Sweden, public data collection portals now offer good datasets. Norwegian data is less abundant, but fully available. The Danish Lepidoptera species already are included in the Norwegian dynamic checklist of Nordic Lepidoptera. In NW Russia and the Baltic countries data collection and sharing for Lepidoptera has not yet been organised on the Internet. When there is opportunity, the Nordic GBIF nodes will promote establishing of observation databases on Internet in this project.

After the techniques have been developed with Nordic Lepidoptera, they will be tried with two other types of name datasets, namely a global dataset of fish names, which are available from the project partner Swedish Museum of Natural History, national datasets of all species from Denmark, Estonia and Norway which are (or will be shortly for Denmark) available from the project partners Natural History Museum of Denmark - University of Copenhagen, Estonian University of Life Sciences and the Norwegian Biodiversity Information Centre (Artsdatabanken). These large datasets should illustrate other types of issues with the application of LSIDs.

Norwegian Biodiversity Information Centre has recently developed a species name database. Authorised taxonomists will be able to administrate nomenclature and taxonomy by using a shared web-based tool to insert new taxons, correct the nomenclature, insert synonyms, split or merge taxons, show historical data, export names, online access to names in researchers' own databases (through web services) and importing lists of species from Species 2000, (Catalogue of Life) and Norwegian registers. The project will consider using this tool as a platform for sharing species names in the Nordic countries. Another option is to use the Species Names Database to administrate national taxonomy and nomenclature.

A sizable piece of work will be creating the synonym lists across the lists. However, this is quite achievable with the help of experts from the region who will be gathered to a workshop to initiate this work. We want to make it clear that there is no intention to harmonise or merge the name lists, or in any way limit creation of new name lists, as this would intervene to taxonomic researchers' work. The purpose is simply to design a way for the various keepers of name lists to achieve interoperability by sending their updates to the LSID authority.

Another large piece of work is to initiate the data collection for the descriptions for Lepidoptera and fishes. This will not be fully achievable during a limited time period, as these would have to be extracted from thousands of publications where the species originally have been described. Some material can be retrieved from the global initiatives uBio and Biodiversity Heritage Library (BHL), and cooperation with them will be set up. But it is realised that most taxonomic concepts, perhaps covering up to 90% of the targeted 3000 Lepidoptera species will not have a fully documented concept in electronic form by the end of this study. However, for homonyms and split species coming up with electronic descriptions will be prioritised. Simple synonym resolution can work even with ''naked'' LSIDs without electronically available circumscription. However, the e-infrastructure to complete the work in the long run will be set up.

One or several LSID authorities will be set up. There will at least be a development site for this project's purposes, and one operational site that mirrors the services. Where these will reside will be decided [latest] at the project kick-off. To implement the LSID infrastructure, guidelines from TDWG will be followed, see http://wiki.tdwg.org/twiki/bin/view/GUID/WebHome.

The name lists will also be served to GBIF using TDWG standards, that is, the TAPIR protocol and the Taxonomic Concept Schema (TCS). The issued LSIDs will be included in this material.

Dissemination of the results will be done through a website, scientific conferences, and visits at major user institutions.

c. Description of the research environment and service provided by the suggested infrastructure

A small research group for the purpose of developing this service will be set up at the beginning of this project. It will consist of a biologist and a computer scientist that will be housed at one of the partner institutions. The aim is that the biologist will prepare a Ph.D. thesis of the subject in 3 years, while the computer scientist could work on a Master's thesis for a shorter period. They will set up the computer servers and handle the communication. Although the group will be housed at a major museum, there is no particular physical research environment, but a distributed e-infrastructure.

The proposed e-infrastructure will offer the following services:

  • Issue LSID for each scientific name and each circumscription, and resolve the LSIDs upon request.

  • Integrate globally the services through the TDWG infrastructure project.

  • Coordinate integration of name lists in the Nordic region, initially for Lepidoptera.

  • Develop guidelines for incorporating LSIDs in datasets. Offer helpdesk and training. Disseminate results.

  • Promote data sharing in the region.

These services will be used in several ways. Here we list some brief use scenarios that will be developed fully when the work has started.

  • Environmental administration sending or receiving data. Datasets that are exchanged will contain an LSID to the taxonomic concept in addition to the scientific name. This removes the ambiguity of what species is being dealt with.

  • Scientist integrating data sets for analysis. For instance, analysis of trends of species abundance and change of distribution because of climate change requires integration of many datasets using different sets of scientific names. Now this integration can be done automatically.

  • Citizen scientist entering data. Most of nature observation leg work is done by advanced amateurs. They use personal computer applications and mobile devices for data entry. These applications can be enhanced to tie a recognised LSID to a particular abbreviation of species (abbreviations are being used for rapid data entry, such as TIMGRIS9 for Timandra griseata ''sensu Saarenmaa & Calabuig 2009'') so that the ambiguity of the taxon observed is removed.

  • GBIF indexing data. The GBIF Data Portal contains a cache of all the world's shared biodiversity data. Integration is done by scientific name only which leads to observations of one taxonomic concept being split to many. Use of LSID in the source databases will consolidate GBIF data into at least 20% fewer species.

  • The Encyclopedia of Life (EoL) project creating web pages. The EoL will harvest all available web pages about species into a huge mash-up and synthetise knowledge from many such sources. This process will be greatly enhanced if the species home pages will contain the species LSID instead of just names.

d. Novelty of the proposed project, positioning the project in the international context of research in this field, and expected results

This is cutting edge implementation of open standards. The TDWG infrastructure has pioneered the methodology and sorted out most of the issues. There are sufficiently stable open source software implementations available. Nowhere in the world there yet is an operational LSID authority for taxonomy. There are several development sites for LSIDs in the USA (Franz & al. 2004), Brazil (TDWG Infrastructure... 2005), and Australia (Whitbred 2007). By entering the implementation phase now, the Nordic region would be in the forefront of biodiversity informatics in this area.

It will also be a good time to enter the scene, as several leading biodiversity informatics units are now working on these issues. There is possibility to learn and share experience with the leaders of this field.

LSID paves the way for the integrating activities of the ecological community into the development of the semantic web, in particular via the Encyclopedia of Life.

The new science of biodiversity informatics stands in forefront when the world addresses questions around the ongoing global change. Simply put, understanding loss of biodiversity, impact of biodiversity by climate change, impact of poverty on biodiversity, all require such large datasets that individual scientists or research groups or countries cannot put them together anymore. Integration of data from hundreds of databases around the world is needed. It can only succeed if research infrastructure services such as those proposed in this project can be put in place.

e. Impact and potential for promoting scientific innovation

We believe the impact of the project will be very high. Large scale studies integrating data become more feasible. The emphasis in ecological and environmental research is shifting from small scale hypothesis-deductive experiments into large scale data driven integrated studies. Data mining of large datasets will become easier and will use higher quality data because of the proposed services.

Better possibilities for integration of data will also help to promote more data sharing, as the owners of the data will now know that their data is easier to use by other groups. This will lead to increased opportunities for the dataset owners. Also entirely new kinds of studies may become possible with large integrated datasets. For instance, it will be possible to compare different taxonomies and analyses their different facets. Data can be better linked together, for instance in food-web analyses.

For the Nordic GBIF Nodes a first joint project like this will open a new platform of cooperation. It is expected that it will lead to new innovations and projects also in other areas.

f. Work plan -- milestones and targets for the proposal

Tentative timeline by which the described work is performed

Month 1: Hire workforce. Hold kick-off meeting with the Steering Group.

Month 3: Acquire and set up server computers. Install LSID software tools. Study experiences from other LSID projects. Acquire relevant name lists.

Month 6: Issue LSIDs for all names and tentatively generate naked LSID (without link to description) for the corresponding concept. Link simple synonyms together. Using several test databases for observations, integrate them together. Have entered 25 descriptions by hand from literature for homonyms and split species.

Month 9: Open test service for LSID issuance for new names and their resolution. Present demonstration of simple data integration. Design linking to descriptions from literature. Have entered 50 descriptions by hand from literature for homonyms and split species. Hold meeting of IT Advisory Group.

Month 12: Acquire links to literature references from uBio and BHL, and link them to the concept LSIDs. Have entered 100 descriptions by hand from literature. Start linking together split species and homonyms. Set up the TAPIR/TCS name provider. Hold meeting of the Coordinating Group in particular looking at lessons learnt and new opportunities for Nordic GBIF cooperation.

Month 15: Have entered 150 descriptions by hand from literature and automatic sources. Write guidelines how to incorporate LSIDs in observation databases and data collection tools. Test these guidelines with global fish and Danish, Estonian, and Norwegian all species lists.

Month 18: Open service on Internet for LSID resolution. First operational use of LSID in data exchange. Write papers. Present results at TDWG and other meetings. Prepare proposals Steering Group about continuation projects.

Month 24: Have entered 300 descriptions by hand from literature and automatic sources. Start working on other lists than Lepidoptera to test the procedures. Hold meeting of Steering Group and explore for continuation the feasibility of a Nordic Catalogue of Life project that would build permanent infrastructure for biodiversity informatics in the region.

Month 30: Experiment with using LSIDs in observation databases. Write thesis. Write about lessons learned. Propose latest at this stage further Nordic GBIF projects.

Month 36: Final meeting of the Coordinating Group. Training workshop to disseminate results. Transition of project to other bodies.

Contributions by country and partner

  • Denmark (SNM-UKBH): Contribute names of all Danish taxa (preparations for a national dataset of all species is being finalised, to be shared through the Danish node portal of GBIF www.danbif.dk in the near future). Feed updates on Danish inventory of Lepidoptera to the Norwegian dynamic checklist of Nordic Lepidoptera.

  • Estonia (EMU): Contribute names of all Estonian taxa. Test observation databases to increase availability of data.

  • Finland (FMNH): Good Lepidoptera data exists and is already being shared. Contribute names for Finnish Lepidoptera. Carry out project management and technical development.

  • Norway (University of Oslo and Artsdatabanken): Some Lepidoptera data exists and is already being shared. Possibly establish data collection portal to increase availability of data. Contribute names for all Norwegian taxa.

  • Russia (ZIN): Contribute Lepidoptera names. Test observation databases and digitize sample collection data.

  • Sweden (NRM): Good Lepidoptera data exists and is already being shared. Contribute names for Nordic Lepidoptera and global fish.

  • All: Put LSIDs in selected observation datasets according to guidelines that will be developed by the project. Share experience between GBIF nodes in the region.

g. Management and project organisation

Project leader will be Dr. Hannu Saarenmaa. He will oversee the work of the technological development unit, communicate with the Coordinating Group, Advisory Group, and the funding organisation, and ensure the quality of the project deliverables.

A Coordinating Group will be set up of project leaders of all partners. This group is six people, and will meet at least annually, but will work intensively on email.

An Advisory Group consisting of major users and experts of informatics solutions will be invited internationally. It will constitute of about 5 people from international organisations, such as GBIF, TDWG, Catalogue of Life, European Environment Agency, national environment authorities, and research groups who are well known leaders of this area. This group will meet at least once in person, but will comment on plans and deliverables electronically.

Technological development unit will be set up at the beginning of the project. Two people will be hired. The computer scientist will have to know Java, MySQL, Linux, XML, and key Internet protocols very well. The biologist will have to have a taxonomic background preferably in the targeted group and be aware of international developments in Species 2000, ITIS, etc., have good computer skills as user, and have good communication skills. This group will need about 60% off project cost. The location of this group has not yet been fully decided although the budgeting is done on the assumption that the group will be with the Project Leader in Helsinki. Possibly more person months can be afforded if the group is based in Estonia or Russia, instead of Nordic countries. The Steering Group will use the autumn of 2007 to investigate the options, search for suitable people, and decide on the location and people latest at project kick-off.

h. In which way(s) will the project create ''Nordic Strength''?

Despite the opportunity created by the establishment of GBIF Secretariat in Copenhagen in 2002, Nordic participation has not lately been particularly strong in global biodiversity informatics initiatives. The reasons for this are many, but it is time to change this because the answering the ongoing global challenges by Nordic research groups require it. The closure of NCC in 1994 was particularly unhappy. It removed the advantage this region has in biodiversity informatics just at the moment when it started to gain momentum internationally. It is the time that Linnaeus' home region again grabs the lead in this area.

This project will strengthen Nordic participation in large-scale international biodiversity science infrastructure projects such as GBIF, TDWG, and Encyclopedia of Life. Joint Nordic effort is important, because Nordic countries alone are relatively small and not rich in biodiversity compared to the countries that play leading roles in these global mega-science efforts. The proposed work will also stimulate planning of future Nordic participation in EU FP7 infrastructure projects such as LIFEWATCH and SpeciesBase, and other related projects.

Biodiversity informatics is an emerging science that studies the organisation of data, information and knowledge about biodiversity. It stands in the intersection of bioinformatics, environmental informatics, geographic information systems, etc. It is an integrative discipline that builds on large databanks and very much on the idea of open access to biodiversity data on Internet. The leading research groups in this area are in Amsterdam, Berlin, Reading, Woods Hole (Massachusetts), Lawrence (Kansas), Berkeley, Mexico City, San Jose (Costa Rica), Campinas (Brasil), and Canberra (Australia). The above-mentioned large initiatives directly build on efforts of these groups. In the Nordic region there is quite a bit research on biodiversity, but the results and data are not in full use until there also is an integrative component that enables syntheses. This requires increased efforts by Nordic GBIF Nodes for data sharing, but also methodological development. It is our belief that there needs to be at least one research group specialising in biodiversity informatics in the Nordic region that can match the best groups in the world, and support the work of the Nordic GBIF Nodes. One important aspect of this proposal is to get started with building that capacity.

The partners are national nodes and data providers for GBIF. There will be opportunity to raise awareness of GBIF in Russia that is not yet a member. This is the first concrete project being proposed by a group of Nordic GBIF nodes together, and as such will be used to strengthen and explore the potential for further cooperation within areas of biodiversity informatics which could be addressed from a specifically Nordic perspective. This aspect alone will be very important for Nordic science infrastructure.

References

Clausen, K. & Ravn, A. 1999. STANDAT - Experience from developing and implementing a standardised format for exchange of data. Report, 76 p. European Environment Agency, Copenhagen. http://reports.eea.europa.eu/PROstandatXXX/en

Franz, N., Liu, X. & Peet, R. 2004. SEEK taxon tools. http://seek.ecoinformatics.org/Wiki.jsp?page=SeekTaxonTools

Hobern, D. & Saarenmaa, H. 2005. GBIF data portal strategy. 40 p. http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architecture/portal_strategy_1/

Life Sciences Identifiers Specification. OMG Final Adopted Specification, dtc/04-05-01, 32 p. May 2004. Object Management Group, Inc. http://www.omg.org/cgi-bin/doc?dtc/04-05-01

Szekely, B. 2003. Build a life sciences collaboration network with LSID. http://www.ibm.com/developerworks/webservices/library/os-lsid2/

TDWG infrastructure project. 2005-2007. TDWG GUID Wiki. http://wiki.tdwg.org/twiki/bin/view/GUID/WebHome

Whitbred, G. 2007. An LSID Policy for the Australasian Biodiversity Federation. http://wiki.tdwg.org/twiki/bin/view/GUID/AustralasianBiodiversityFederationLsidPolicy

Ytow, N. 2002. Managing species names. The First Ebbe Nielsen Prize Acceptance Lecture. http://www.gbif.org/GBIF_org/prize/lecture2002.htm