ABSTRACT
The course will explore the use large scale biodiversity databases for macroecological research. There is now a new emerging e-infrastructure for biodiversity which is making it possible and practical to merge data from large number of distributed sources. Instrumental to such progress is the ability to harmonise content in the integrated database, where use of scientific nomenclature is one of the hardest aspects to deal with. This is being solved by linking scientific names of organisms to their underlying taxonomic concepts, and automatic processing of their globally unique identifiers. This is the topic of an existing joint NordForsk project of the Nordic nodes of the Global Biodiversity Information Facility (GBIF). During the course the students will explore the utility of the new e-infrastructure and integrate the available Nordic databases of butterflies and moths. That unique integrated resource will be used to study the impact of climate change to the butterfly and moth fauna. The course will be held in Joensuu, Eastern Finland. The teachers include several world-leading researchers in macroecology whose advise and review of the results will help to build Nordic strength in this important topic.
DATES: 17-27 May 2011
VENUE: University of Eastern Finland, Joensuu campus
ATTENDEES: 15 students total from all Nordic countries, 5-10 teachers.
(Proposal submitted to Nordforsk 2008-04-20. Proposal accepted 2008-07-30. This text abridged for publication 2008-11-25. Updated 2011-02-18.)
PROPOSAL TEXT
The widest e-infrastructure for biodiversity is that of the Global Biodiversity Information Facility (GBIF), which is currently making 150 million data records available from 1700 databases on 210 provider computers in 40 countries. About 15% of these are Nordic data. GBIF has enabled access to these databases through the application of protocols and data exchange standards of the Biodiversity Standards Organisation TDWG, but little work has yet been done to prove the usefulness of such large data pool.
In order to prove utility, content of the databases need to be integrated for analytical purposes, for instance for macro-ecological and biogeographical analysis such as studying the impact of global warming on the abundance and distribution of species. The integration step is still tedious, as the content in biodiversity databases is largely not harmonised. For instance, among the 150 million records of GBIF Data Portal’s central cache, 2.4 million different scientific names can be found, but only 300.000 of them are known by the global Catalogue of Life. Every database can use their own conventions for taxonomic concepts and names of organisms, georeferencing, temporal aspects, record basis, individual counts, etc. When a scientist wants to analyse, for instance, trends of fauna in a major region such as Northern Europe, in particular the nomenclature used across databases must be unified, which takes time and expert knowledge. Therefore most studies of this kind still have used data from one database or one country only.
Scientific names of organisms pose some of the hardest problems in this context. Basically, scientific names are insufficient to uniquely denote the organisms in question. For instance, among German mosses 65% of the taxa include unambiguity of the taxonomic concept, if the name only was used to refer to them. As a solution, unique identifiers based on the Life Sciences Identifier (LSID) standard for the taxonomic concepts have been recommended by GBIF and TDWG. NordForsk therefore funded a joint e-infrastructure project of the Nordic GBIF Nodes which aims at coming up with LSIDs for Nordic taxa, starting from organism groups that are often used for analysing global change such as butterflies and moths (Lepidoptera). The project is now underway, and is expected to deliver the LSID e-infrastructure in 1-2 years time, including their implementation in selected observation databases. However, the use of the LSIDs beyond this has not been included in the project. This training course aims at exploring the usage of the generated LSIDs in data integration.
In practical terms the course will begin by examining the results of the current project and the available e-infrastructure. Organisation of taxonomic databases (which contain names and information in what sense the name has been used) and inclusion of LSIDs in them are discussed. Data interchange between taxonomic databases and observation databases is explored, in particular from the view of how to include the LSIDs in observation databases to remove unambiguities of the taxonomic concept and to resolve synonyms automatically. As a joint exercise, multiple observation databases from all Nordic countries of Lepidoptera are then brought together into a single large-scale data warehouse.
To prove utility, this resource will then be used for macro-ecological exploration. Trends of abundance and change of distribution for large subsets of the fauna will be investigated. In particular northern and high-arctic species will be studied, as first reports now indicate that they may be in trouble because of the warming of the climate. Multiple techniques will be exploited, such as trend analysis using the popular TRIM package and other statistical techniques, ecological niche modelling with popular algorithms such as GARP and MAXENT, and data mining techniques will be applied. The course is completed by looking at where the leading research in the field is moving on, and applying the data in the data warehouse to study scenarios of the high-arctic butterfly fauna of Kilpisjärvi region.
Probably the leading laboratory for macro-ecological education and training in the Nordic countries is the Danish Center for Macroecology led by Professor Carsten Rahbek. Macroecology seeks to explain the large-scale spatial distribution of biological diversity on Earth from first principles of evolution, ecology and historical contingency. One of the topics is the question how climate acts as a principal factor affect distribution of life on Earth in the past, today and in the future, see http://www.macroecology.ku.dk/. Macroecological research is probably the most important immediate user of large integrated biodiversity databases.
In order to boost the use of the data this way, GBIF has carried out courses in ecological niche modelling around the world. Ecological niche modelling is one of the most important techniques used in macroecology. The latest event was held in Warsaw in 2007 where several Nordic trainees attended, see http://www.ksib.pl/enm2007/. An idea to organise such a course in the Nordic region has since been refined by the attendees, and is realised herewith.
Besides macro-ecological research, the course also falls into the discipline of biodiversity informatics. Biodiversity informatics is an emerging science whose purpose is the application of information technologies to the management, algorithmic exploration, analysis and interpretation of primary data regarding life, particularly at the species level of organization (Soberon and Peterson 2004, see http://en.wikipedia.org/wiki/Biodiversity_Informatics). At the moment the only course in biodiversity informatics in the Nordic region is given at the University of Helsinki where it started in 2008.
Literally hundreds of studies have been published lately using these techniques to study effects of global change. These studies, however, have mainly concerned South and North-American species and problems, and have been carried out using data from a limited number of species and data sources in just one country such as Mexico and Brazil. These methods have not been used widely in Nordic countries. Nowhere have they been carried out by large scale automatic data integration, even though GBIF now makes such global repository available. So an openly accessible integrated data warehouse of GBIF data in the Nordic countries would be a unique resource that would probably boost macro-ecological research and biodiversity informatics in the region.
Macroecological research training has been working well in Denmark, but is somewhat less advanced in other Nordic countries. This course will be a launchpad for related activities in the other countries. While the Danish group has focussed on birds, in this course also other groups of organisms will be covered.
As the result of the course, the attendees have learned fundamental GBIF techniques in data source interoperability and primary data integration. They have learned use of taxonomic databases and how to deal with taxonomic concepts. They have been introduced to state of art methods being used for global change analysis. The data resource of Lepidoptera that will be put together during the course will remain available also afterwards for the students and others to explore and use in research. These are all key elements for introducing biodiversity informatics research in the Nordic academic curricula.
Still, there are only a few curricula in biodiversity informatics anywhere in the world. This course will continue the ongoing effort of establishing this science in the Nordic region. This is a good opportunity to build on Nordic strength, as the countries in the Nordic region have very good databases about biodiversity, but full use is not yet being made of them in all cases.
As examples of some of the leading biodiversity informatics research groups in the world we can mention those at the University of Kansas http://www.nhm.ku.edu/, the Brazilian Reference Center on Environmental Information http://www.cria.org.br/, Freie University of Berlin http://www.bgbm.org/BioDivInf/, Costa Rican INBio http://www.inbio.ac.cr/. Through cooperation, systematic scaling up of training and education, and building on the Nordic strengths of large databases and macroecological research, the Nordic GBIF nodes could possibly achieve a similar level in about 5-10 years.
As instructors of the course, scientists from these and other leading biodiversity informatics laboratories and macroecological research groups will be invited. As the course will only commence in two year’s time, detailed commitments from individual speakers have not yet been requested. We believe that making detailed commitments would be premature at this point as the collaborations around the emerging e-infrastructure are still being formed.
Several of the teachers and organisers are women. Choice of attendees for the course will be handled through an open selection process, and we will ensure significant proportion of both sexes among participants.
17 May: Introductions, practicalities, problem statement and discussion, biodiversity databases, GBIF/TDWG, open access protocols. Overview of the results of the LSID project. Taxonomic databases in general and the Nordic Lepidoptera databases in particular.
18 May: Nordic observation databases of Lepidoptera, and their LSIDs implementations. Merger of the target databases and their exploration and data mining. Setting the scene for coursework
19 May: Introduction to R and TRIM, begin use of R and TRIM for coursework
20 May: Preparation of analysis and report on use of R and TRIM
21 May: Day off (or spare day for optional programme)
22 May: Excursion to Koli national park
23 May: Introduction to ecological niche modelling
24 May: Introduction to GARP and MAXENT, case studies
25 May: Use of MAXENT for own analysis (All students)
26 May: Macroecology seminar
- Presentation of results using MAXENT
- Presentations of macroecological projects by visiting scientists
27 May: Seminar continues
- Adjourn at afternoon
The venue will be Joensuu Science Park http://www.carelian.fi/en/facility+services/network+oasis/ which is located at the University of East Finland campus downtown in Joensuu http://www.uef.fi/uef/english
Joensuu is located 450 km north-east from Helsinki. Fly time is about 40 minutes or train 4½ hours.
Most teachers will stay one week at this 2 weeks long event. Teachers include representatives of macro-ecological and biodiversity informatics research projects in Europe and USA.
Hannu Saarenmaa and staff of Digitarium, the Digitisation Centre of the Finnish Museum of Natural History and the University of East Finland.