Taxon‐specific BLAST percent identity thresholds for identification of unknown sequences using metabarcoding Artículo académico uri icon

Abstracto

  • Abstract The identification of organisms in environmental samples using metabarcoding relies on factors such as taxonomic assignment methods, genetic markers, reference databases and confidence thresholds for taxonomic assignment. Because lineages evolve at different rates, a global threshold (e.g. an unknown sequence is assigned to a species name using a 97% BLAST percent identity threshold) is likely not optimal for accurate taxonomic assignments across all taxa, but no study has systematically evaluated taxon‐specific confidence thresholds. We developed taxon‐specific confidence thresholds for marine eukaryotes to improve the performance of BLAST‐based assignment methods. We tested our approach using published whole community DNA datasets and evaluated factors affecting the accuracy of taxonomic assignments. We provide confidence thresholds for multiple taxonomic levels using the mitochondrial cytochrome c oxidase subunit I (COI) and large ribosomal RNA (lrRNA or 16S rRNA in metazoans) to achieve five false positive error rates (0%, 1%, 5%, 10% and 20%). Additionally, we test two methods to pick the BLAST final hit, the best‐hit and best‐shared (similar to lowest common ancestor) methods. The best‐shared approach to pick the final BLAST hit reduced assignment errors for lower taxonomic levels, and therefore, all subsequent analyses utilized this method. The estimated taxon‐specific thresholds required to achieve reliable assignments at predefined error rates varied among phyla, genetic markers and taxonomic levels. With a 5% error rate, optimized thresholds identified a higher number of sequences at the phylum and family levels than global thresholds. For the genus and species levels, global thresholds identified more sequences to lower taxonomic levels but with an increasing false positive error rate from 14% for family, 21% for genus, to 44% for species. Our results suggest that commonly used global thresholds may be too relaxed and lead to misidentifications for lower taxonomic levels. Using taxon‐optimized thresholds can help define an acceptable error rate (that can be study‐dependent), enable the identification of more unknown sequences at higher taxonomic levels or specific groups, and prevent misclassifications at lower taxonomic levels. We developed an R Shiny application to filter users' BLAST results using predefined error rates and their choice of global or optimized thresholds.

autores

  • Pappalardo, Paula
  • Hemmi, Jan M.
  • Machida, Ryuji J.
  • Leray, Matthieu
  • Collins, Allen G.
  • Osborn, Karen J.

fecha de publicación

  • 2025

Número de páginas

  • 14

Página inicial

  • 2380

Última página

  • 2394

Volumen

  • 16

Cuestión

  • 10