De novo assembly of arthropod genomes of public health importance
Long range de novo genome assembly from short sequence reads is still one of the greatest challenges in genomics despite vast and rapid improvements in obtaining those short reads. Numerous viral, bacterial, and parasitic agents causing human and veterinary diseases are carried and transmitted by arthropods including but not limited to ticks, mosquitos, triatomids, sandflies, mites, lice, and fleas. The prevalence, diversity, and range of these arthropod disease vectors render them an important subject of study for improving our understanding of their roles in human disease transmission, and consequently for the prevention of those diseases. The impact of assembly of the human genome sequence on medicine has been tremendous. Similarly, high-quality genome assemblies of arthropod vectors are a critical precondition to new approaches to studying vector-pathogen interactions and for controlling vector populations. Such genome assemblies will be used to underpin demographic, phylogenetic, host/parasite, and population genetic analyses. Unfortunately, very few reference or even draft-quality assemblies exist for arthropod vectors of public health importance, with the exception of species with small genomes like mosquitoes. This is true despite the huge economic and public health impacts of many other vector species. Assemblies of larger, more complex arthropod genomes, such as ticks, were first proposed in 2006 (Lyme disease vector) but they are still very poor despite their importance to public health. A consequence of the lack of cost-effective assembly technologies to complete even single sequences is that large-scale comparative assembly efforts are virtually nonexistent. This is partly a consequence of the costs required to generate high-quality assemblies for this massively diverse phylum, even as the cost per read has declined. However, the read length of efficient sequencers with contemporary libraries is too short for effective construction of chromosome length assemblies. Higher order assembly requires very expensive and cumbersome approaches that are still being applied to correction of the human genome sequence. New approaches may substantially reduce the time and effort required for higher order genome assemblies and thus make fuller and more cost effective use of the short read sequence libraries which are the norm for NextGeneration Sequencers but quite inadequate for achieving the goal of accurate full genome assemblies. Currently genomes are often released as assemblies containing tens of thousands of contigs (e. g., Rhodnius prolixus, the triatomine vector of Chagas Disease, even at 8 X coverage has 58,559 contigs and 27,872 supercontigs).
Two recent Institute of Medicine of the National Academies Workshops by expert panels have focused on the costs and public health threats associated with vector-borne disease. The more general 2008 workshop was entitled ?Vector-Borne Diseases, Understanding the Environmental, Human Health, and Ecological Connections, Workshop Summary (Forum on Microbial Threats).? And published by the National Academy Press. This was followed by another more focused National Academy Press publication in 2011 o f a workshop proceeding which was entitled ?Critical Needs and Gaps in Understanding Prevention, Amelioration, and Resolution of Lyme and Other Tick-Borne Diseases: The Short-Term and Long-Term Outcomes – Workshop Report.? These lengthy documents fully explore the public health problem, costs, and research directions needing effort in order to reduce the health burden posed by vector-borne diseases. Suffice it to say, greater genetic understanding of the target species, the focus of this solicitation, was fully addressed as a pressing need in these publications. Those same considerations apply to a wide range of arthropods of veterinary and agricultural importance.
The goals of the proposed research are to rapidly and cost-effectively assemble high-quality arthropod genomes de novo. The innovation should ultimately enable large numbers of genomes to be assembled in multi-megabase scaffolds rapidly and affordably. A scalable, parallelizable approach will enable much broader surveys and targeted studies of arthropod genomes to better understand their role in disease transmission and myriad costs to society. Technologies designed to meet these needs will need to employ computational and assay-based innovations. Projects must start with input DNA and yield assembled genomes, not just data from which assembly may be done eventually. This will be the first such effort attempted with large arthropod genomes for higher order and more complete assemblies. Technologies previously found to be effective for human and alligator genome assemblies (e.g., Chromosome-scale shotgun assembly using an in vitro method for long-range linkage ArXiv: 1502.05331v1 [q-bio.GN] 18 Feb 2015) using emerging sequencing and bioinformatic technologies may be used to achieve this goal.
Phase I Activities and Expected Deliverables:
Phase I must demonstrate the feasibility of an advanced methodology pipeline for rapid and high-quality de novo genome assembly of several arthropod genomes. Specifically, at least three tick vector genomes of public health importance (e. g., Ixodes scapularis, Dermacentor variabilis, Amblyomma americanum) with different genomic characteristics (total estimated genome sizes of >1 Gbp and different amounts of repetitive DNA families) must be assembled to reasonable contiguity (N50 > 200 Kbp) and quality by the responder to the solicitation. For Phase I, only the final data demonstrating successful de novo assembly of these targets will need be provided. Additional high quality conventional annotation and chromosome mapping confirmation that these assemblies are indeed correct will be required in Phase II for each of these three targets as well as the additional genomes to be analyzed in Phase II.
Projected Phase II activities
Phase II projects must demonstrate the scalability and cost-effectiveness of the technology approach demonstrated in Phase I, as well as quality annotations for the assemblies. For each respective year of the phase II project period, a total of 8 (including the three Phase I targets) (Phase II-yr1) and 10 (Phase II-yr2) additional arthropod vector genome assemblies and annotations, including other arthropod species of public health interest in at least four arthropod orders (e. g., fleas, ticks, lice, mites, triatomines) with genomes > 500 Mbp, must be produced that meet the same annotation quality, N50 and contiguity criteria established for the three Phase I assemblies. Furthermore, each assembly in the second phase must be completed, or a credible roadmap demonstrated to reach, a total reagent cost reduced to less than $10,000 per assembly for those performed in the last year of Phase II. Generated assemblies must be released to the public domain, and applicant must perform and demonstrate gene annotation and synteny comparisons of qualities comparable to the state of the art generally achieved for well-assembled reference vertebrate genomes.
Higher-quality arthropod genome assemblies will support public health research and interventions on a number of fronts. Better assemblies and broader sampling are highly conducive to comparative analyses for understanding phylogenetic relationships between closely and distantly related disease vectors. Population genetic analyses will become more powerful by allowing for more accurate characterization of population growth or decline, changes in geographic distribution through time, and evolutionary forces such as selection and drift. Higher-quality assemblies also allow better genome annotation via both syntenic and computational analyses, which in turn offer insight into host/symbiont relationships and the genetic basis of disease transmission. All of these benefits positively impact the ability to understand arthropod-borne disease transmission and, consequently, capabilities for effective prevention and intervention. Arthropods are increasingly resistant to pesticides used in their control. A fundamental understanding of the genetics of vectors is essential to developing new strategies for reducing the huge economic and public health burdens they cause. The first step in this process is to obtain high quality reference genome sequences which can provide the basis for development of other novel methodologies applicable to related species and diverging populations. The methods demonstrated should be applicable to a much wider selection of species of interest, both vertebrate and for other invertebrate groups, and provide a commercially viable future for a successful responder to this solicitation.
More than 5000 arthropod genomes have been proposed for sequencing (i5K project) and numerous pests of agricultural importance as well medically important vectors are on this list. Insects alone comprise the largest number of species of any life form except bacteria and arthropods are the most successful animals on the planet. Given the importance of arthropods in disease transmission, the vast diversity of the phylum, and the importance of high quality genome assemblies for effective scientific investigations, the proposed technologies embody substantial market potential. Whereas the i5K project is estimated to take 5 or 10 years to complete, our goal is to use advanced technologies to obtain results sooner than this and of better quality. The ultimate service or product provided will find extensive use with all eukaryotic subjects of genomics research. Thorough characterization of the many important arthropod genomes will be a long term effort by the scientific community providing an ongoing need for the proposed product from the solicitation responder for many years to come.
- Agency: Department of Health and Human Services,Department of Health and Human Services
- Program: SBIR
- Phase: Phase I
- Release Date: July 24, 2015
- Open Date: July 24, 2015
- Close Date: October 16, 2015
- URL: https://sbir.nih.gov/sites/default/files/PHS2016-1.pdf