• A comprehensive phylogenomic platform for exploring the angiosperm tree of life

      Baker, William J.; Bailey, Paul; Barber, Vanessa; Barker, Abigail; Bellot, Sidonie; Bishop, David; Botigué, Laura R.; Brewer, Grace E.; Carruthers, Tom; Clarkson, James J.; et al. (Oxford University Press, 2021-05-13)
      The tree of life is the fundamental biological roadmap for navigating the evolution and properties of life on Earth, and yet remains largely unknown. Even angiosperms (flowering plants) are fraught with data gaps, despite their critical role in sustaining terrestrial life. Today, high-throughput sequencing promises to significantly deepen our understanding of evolutionary relationships. Here, we describe a comprehensive phylogenomic platform for exploring the angiosperm tree of life, comprising a set of open tools and data based on the 353 nuclear genes targeted by the universal Angiosperms353 sequence capture probes. The primary goals of this paper are to (i) document our methods, (ii) describe our first data release and (iii) present a novel open data portal, the Kew Tree of Life Explorer (https://treeoflife.kew.org ). We aim to generate novel target sequence capture data for all genera of flowering plants, exploiting natural history collections such as herbarium specimens, and augment it with mined public data. Our first data release, described here, is the most extensive nuclear phylogenomic dataset for angiosperms to date, comprising 3,099 samples validated by DNA barcode and phylogenetic tests, representing all 64 orders, 404 families (96%) and 2,333 genera (17%). A "first pass" angiosperm tree of life was inferred from the data, which totalled 824,878 sequences, 489,086,049 base pairs, and 532,260 alignment columns, for interactive presentation in the Kew Tree of Life Explorer. This species tree was generated using methods that were rigorous, yet tractable at our scale of operation. Despite limitations pertaining to taxon and gene sampling, gene recovery, models of sequence evolution and paralogy, the tree strongly supports existing taxonomy, while challenging numerous hypothesized relationships among orders and placing many genera for the first time. The validated dataset, species tree and all intermediates are openly accessible via the Kew Tree of Life Explorer and will be updated as further data become available. This major milestone towards a complete tree of life for all flowering plant species opens doors to a highly integrated future for angiosperm phylogenomics through the systematic sequencing of standardised nuclear markers. Our approach has the potential to serve as a much-needed bridge between the growing movement to sequence the genomes of all life on Earth and the vast phylogenomic potential of the world's natural history collections.
    • Factors affecting targeted sequencing of 353 nuclear genes from herbarium specimens spanning the diversity of angiosperms

      Brewer, Grace E.; Clarkson, James J.; Maurin, Olivier; Zuntini, Alexandre R.; Barber, Vanessa; Bellot, Sidonie; Biggs, Nicola; Cowan, Robyn S.; Davies, Nina M.; Dodsworth, Steven; et al. (Frontiers, 2019-09-12)
      The world’s herbaria collectively house millions of diverse plant specimens, including endangered or extinct species and type specimens. Unlocking genetic data from the typically highly degraded DNA obtained from herbarium specimens was difficult until the arrival of high-throughput sequencing approaches, which can be applied to low quantities of severely fragmented DNA. Target enrichment involves using short molecular probes that hybridise and capture genomic regions of interest for high-throughput sequencing. In this study on herbariomics, we used this targeted sequencing approach and the Angiosperms353 universal probe set to recover up to 351 nuclear genes from 435 herbarium specimens that are up to 204 years old and span the breadth of angiosperm diversity. We show that on average 207 genes were successfully retrieved from herbarium specimens, although the mean number of genes retrieved and target enrichment efficiency is significantly higher for silica gel-dried specimens. Forty-seven target nuclear genes were recovered from a herbarium specimen of the critically endangered St Helena boxwood, Mellissia begoniifolia, collected in 1815. Herbarium specimens yield significantly less high molecular weight DNA than silica gel-dried specimens, and genomic DNA quality declines with sample age which is negatively correlated with target enrichment efficiency. Climate, taxon-specific traits, and collection strategies additionally impact target sequence recovery. We also detected taxonomic bias in targeted sequencing outcomes for the 10 most numerous angiosperm families that were investigated in depth. We recommend that 1) for species distributed in wet tropical climates, silica gel-dried specimens should be used preferentially, 2) for species distributed in seasonally dry tropical climates, herbarium and silica gel-dried specimens yield similar results, and either collection can be used, 3) taxon specific traits should be explored and established for effective optimisation of taxon-specific studies using herbarium specimens, 4) all herbarium sheets should, in future, be annotated with details of the preservation method used, 5) long-term storage of herbarium specimens should be in stable low humidity and low temperature environments, and 6) targeted sequencing with universal probes, such as Angiosperms353 should be investigated closely as a new approach for DNA barcoding that will ensure better exploitation of herbarium specimens than traditional Sanger sequencing approaches.
    • On the origin of giant seeds: the macroevolution of the double coconut (Lodoicea maldivica) and its relatives (Borasseae, Arecaceae)

      Bellot, Sidonie; Bayton, Ross P.; Couvreur, Thomas L.P.; Dodsworth, Steven; Eiserhardt, Wolf L.; Guignard, Maite S.; Pritchard, Hugh W.; Roberts, Lucy; Toorop, Peter E.; Baker, William J. (Wiley, 2020-06-16)
      Seed size shapes plant evolution and ecosystems, and may be driven by plant size and architecture, dispersers, habitat and insularity. How these factors influence the evolution of giant seeds is unclear, as are the rate of evolution and the biogeographical consequences of giant seeds. We generated DNA and seed size data for the palm tribe Borasseae (Arecaceae) and its relatives, which show a wide diversity in seed size and include the double coconut (Lodoicea maldivica), the largest seed in the world. We inferred their phylogeny, dispersal history and rates of change in seed size, and evaluated the possible influence of plant size, inflorescence branching, habitat and insularity on these changes. Large seeds were involved in 10 oceanic dispersals. Following theoretical predictions, we found that: taller plants with fewer-branched inflorescences produced larger seeds; seed size tended to evolve faster on islands (except Madagascar); and seeds of shade-loving Borasseae tended to be larger. Plant size and inflorescence branching may constrain seed size in Borasseae and their relatives. The possible roles of insularity, habitat and dispersers are difficult to disentangle. Evolutionary contingencies better explain the gigantism of the double coconut than unusually high rates of seed size increase.
    • A roadmap for global synthesis of the plant tree of life

      Eiserhardt, Wolf L.; Antonelli, Alexandre; Bennett, Dominic J.; Botigue, Laura R.; Burleigh, J. Gordon; Dodsworth, Steven; Enquist, Brian J.; Forest, Felix; Kim, Jan T.; Kozlov, Alexey M.; et al. (Wiley, 2018-03-31)
      Providing science and society with an integrated, up-to-date, high quality, open, reproducible and sustainable plant tree of life would be a huge service that is now coming within reach. However, synthesizing the growing body of DNA sequence data in the public domain and disseminating the trees to a diverse audience are often not straightforward due to numerous informatics barriers. While big synthetic plant phylogenies are being built, they remain static and become quickly outdated as new data are published and tree-building methods improve. Moreover, the body of existing phylogenetic evidence is hard to navigate and access for non-experts. We propose that our community of botanists, tree builders, and informaticians should converge on a modular framework for data integration and phylogenetic analysis, allowing easy collaboration, updating, data sourcing and flexible analyses. With support from major institutions, this pipeline should be re-run at regular intervals, storing trees and their metadata long-term. Providing the trees to a diverse global audience through user-friendly front ends and application development interfaces should also be a priority. Interactive interfaces could be used to solicit user feedback and thus improve data quality and to coordinate the generation of new data. We conclude by outlining a number of steps that we suggest the scientific community should take to achieve global phylogenetic synthesis.
    • A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering

      Johnson, Matthew G.; Pokorny, Lisa; Dodsworth, Steven; Botigue, Laura R.; Cowan, Robyn S.; Devault, Alison; Eiserhardt, Wolf L.; Epitawalage, Niroshini; Forest, Felix; Kim, Jan T.; et al. (Oxford University Press (OUP), 2018-12-10)
      Sequencing of target-enriched libraries is an efficient and cost-effective method for obtaining DNA sequence data from hundreds of nuclear loci for phylogeny reconstruction. Much of the cost of developing targeted sequencing approaches is associated with the generation of preliminary data needed for the identification of orthologous loci for probe design. In plants, identifying orthologous loci has proven difficult due to a large number of whole-genome duplication events, especially in the angiosperms (flowering plants).We used multiple sequence alignments from over 600 angiosperms for 353 putatively single-copy protein-coding genes identified by the One Thousand Plant Transcriptomes Initiative to design a set of targeted sequencing probes for phylogenetic studies of any angiosperm group. To maximize the phylogenetic potential of the probes, while minimizing the cost of production, we introduce a k-medoids clustering approach to identify the minimum number of sequences necessary to represent each coding sequence in the final probe set. Using this method, 5–15 representative sequences were selected per orthologous locus, representing the sequence diversity of angiosperms more efficiently than if probes were designed using available sequenced genomes alone. To test our approximately 80,000 probes, we hybridized libraries from 42 species spanning all higher-order groups of angiosperms, with a focus on taxa not present in the sequence alignments used to design the probes. Out of a possible 353 coding sequences, we recovered an average of 283 per species and at least 100 in all species. Differences among taxa in sequence recovery could not be explained by relatedness to the representative taxa selected for probe design, suggesting that there is no phylogenetic bias in the probe set. Our probe set, which targeted 260 kbp of coding sequence, achieved a median recovery of 137 kbp per taxon in coding regions, a maximum recovery of 250 kbp, and an additional median of 212 kbp per taxon in flanking non-coding regions across all species. These results suggest that the Angiosperms353 probe set described here is effective for any group of flowering plants and would be useful for phylogenetic studies from the species level to higher-order groups, including the entire angiosperm clade itself.