The combination of large datasets from several sources requires careful harmonization of taxon names. Non-standardized, incorrect, ambiguous, or synonymous taxonomic names can lead to unreliable results and conclusions. The bdc package includes tools to help standardize major taxonomic groups’ names (e.g., animals and plants) by comparing scientific names against one out of 10 taxonomic databases. The taxonomic harmonization process borrows heavily from Norman et al. (2020); taxadb package), which contains functions that allow querying millions of taxonomic names in a fast, automated, and consistent way using high-quality locally stored taxonomic databases.
The functions used to harmonize names (bdc_clean_name and bdc_query_names_taxadb) contain several additions to the taxadb package, including tools for:
The taxonomic harmonization is based on one taxonomic database chosen by the user. taxadb makes available the following taxonomic sources:
Abbreviation | Taxonomic database |
---|---|
col | Catalogue of Life |
fb | FishBase |
itis | Integrated Taxonomic Information System |
iucn | International Union for Conservation of Nature’s Red List |
gbif | Global Biodiversity Information Facility |
ncbi | National Center for Biotechnology Information |
ott | OpenTree Taxonomy |
tpl | The Plant List |
slb | SeaLifeBase |
wd | Wikidata |
⚠️IMPORTANT:
Note that some taxonomic databases may be momentarily unavailable in taxadb. Check ?bdc_query_names_taxadb for a list of available taxonomic databases.
Check here how to install the bdc package.
Read the database created in the Pre-filter module of the bdc package. It is also possible to read any datasets containing the required fields to run the package (more details here).
<-
database ::read_csv(here::here("Output/Intermediate/01_prefilter_database.csv")) readr
Scientific names improperly formatted usually cannot be matched with valid names. To solve this issue, we developed the bdc_clean_name containing functionalities to unify writing style of scientific names. This optimize the taxonomic queries by increasing the probability of finding matching names. This tool is used to:
<-
parse_names bdc_clean_names(sci_names = database$scientificName, save_outputs = FALSE)
#> >> Family names prepended to scientific names were flagged and removed from 0 records.
#> >> Terms denoting taxonomic uncertainty were flagged and removed from 1 records.
#> >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and removing extra spaces, were flagged and corrected or removed from 0 records.
#> >> Infraspecific terms were flagged and removed from 1 records.
#>
#> >> Scientific names were cleaned and parsed. Check the results in 'Output/Check/02_clean_names.csv'.
An example of bdc_clean_names output.
Let’s merge the names parsed with the complete database. As the column ‘scientificName’ is in the same order in both databases (i.e., parse_names and database), we can append names parsed in the database. Also, only the columns “names_clean” and “.uncert_terms” will be used in the downstream analyses. But don’t worry, you can check the results of the parsing names process in “Output/Check/02_parsed_names.qs”.
<-
parse_names %>%
parse_names ::select(.uncer_terms, names_clean)
dplyr
<- dplyr::bind_cols(database, parse_names) database
The taxonomic harmonization is based upon one of those taxonomic authorities previously mentioned. It starts with creating a local database by downloading, extracting, and importing the taxonomic database informed by users using the taxadb package. The download may take some time, depending on the internet connection.
⚠️IMPORTANT: If will have a problem downloading
databases, please consider removing the previous versions of taxonomic
databases using fs::dir_delete(taxadb:::taxadb_dir())
<- bdc_query_names_taxadb(
query_names sci_name = database$names_clean,
replace_synonyms = TRUE, # replace synonyms by accepted names?
suggest_names = TRUE, # try to found a candidate name for misspelled names?
suggestion_distance = 0.9, # distance between the searched and suggested names
db = "gbif", # taxonomic database
rank_name = "Plantae", # a taxonomic rank
rank = "kingdom", # name of the taxonomic rank
parallel = FALSE, # should parallel processing be used?
ncores = 2, # number of cores to be used in the parallelization process
export_accepted = FALSE # save names linked to multiple accepted names
)
#> A total of 0 NA was/were found in sci_name.
#>
#> 115 names queried in 3.1 minutes
Merging results of the taxonomy harmonization process with the original database. Before that, let’s rename the column containing the original scientific names to “verbatim_scientificName”. From now on, “scientificName” corresponds to the verified names (resulted from the name harmonization process). As the column “original_search” in “query_names” and “names_clean” are equal, only the first will be kept.
<-
database %>%
database ::rename(verbatim_scientificName = scientificName) %>%
dplyr::select(-names_clean) %>%
dplyr::bind_cols(., query_names) dplyr
The report is based on the column notes containing
the results of the name harmonization process. The notes can be grouped
into two categories: accepted names and those with a taxonomic issue or
warning, needing further inspections. Accepted names are returned as
“valid” in the column “Description”. The report can be automatically
saved if save_report = TRUE.
<-
report bdc_create_report(data = database,
database_id = "database_id",
workflow_step = "taxonomy",
save_report = FALSE)
report
It is also possible to filter out records with taxonomic status different from “accepted”. Such records may be potentially resolved manually.
<-
unresolved_names bdc_filter_out_names(data = database,
col_name = "notes",
taxonomic_status = "accepted",
opposite = TRUE)
Save the table containing unresolved names
%>%
unresolved_names ::write_csv(., here::here("Output/Check/02_unresolved_names.csv")) readr
It is possible to remove records with unresolved or invalid names to get a ‘clean’ database. However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal modules of the package), potentially erroneous or suspect records will be removed in the final module of the package.
# output <-
# bdc_filter_out_names(
# data = database,
# taxonomic_notes = "accepted",
# opposite = FALSE
# )
You can use qs::qsave() instead of write_csv to save a large database in a compressed format.
# use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database
%>%
database ::write_csv(.,
readr::here("Output", "Intermediate", "02_taxonomy_database.csv")) here