Endophytic bacteria (Supplementary Table S1) were isolated from within leaves of Arabidopsis thaliana or Draba verna from Southwest Michigan (Kniskern et al., 2007; Traw et al., 2007; Barrett et al., 2011). Isolates were cultured and maintained in King’s B media (King et al., 1954).
Sequence generation and processing
Using a modified protocol from (Morgan et al., 2010), genomic DNA was isolated from 1 mL of bacterial overnight, KB culture. First, cells were pelleted by centrifugation at 14,000 RPM for 2 minutes. After removal of supernatant, cell pellets were resuspended in TES (10 mM Tris-HCl pH 7.5, 1 mM EDTA, 100 mM NaCl) supplemented with 50 U/μL lysozyme (Ready-lyse Lysozyme Solution, Epicentre, Madison, WI, USA), and incubated overnight at room temperature. Then, SDS and proteinase-k were added to final concentrations of 1% and 0.5 mg/mL, respectively, and incubated at 55°C for 4 hours. Three 2.3 mm stainless steel beads (BioSpec, Bartlesville, OK, USA) were added and lysates were mechanically disrupted at 1,750 RPM for 5 minutes (2010 Genogrinder, SPEX, Metuchen, NJ, USA). DNA was then extracted with the Gentra Puregene Yeast/Bacteria Kit (Qiagen, Valencia, CA, USA), using the standard protocol.
DNA was quantified by fluorimetry (Quant-iT DNA Assay, Life Technologies, Carlsbad, CA, USA) and sent to the Institute for Genomics and Systems Biology Next Generation Sequencing Core at Argonne National Laboratory (Lemont, IL, USA). Short insert libraries (~ 200 bp) were prepared with the standard protocol for the TruSeq DNA Sample Prep kit (Illumina, San Diego, CA, USA), and long insert libraries (insert ~2500 bp) were constructed with the standard protocols for either the Mate Pair Library Prep Kit v2 or the Nextera Mate Pair Sample Prep Kit (Illumina, San Diego). Actual insert sizes are listed in Supplementary Table S1. Paired end, 101 bp reads were generated using the Illumina HiSeq 2000 platform with base caller CASAVA v1.8.2.
All sequencing processing steps and parameters are included in custom shell scripts that are publicly available at https://bitbucket.org/perisin/16stimator. Briefly, raw reads were quality filtered with Trimmomatic v0.32 (Bolger et al., 2014). The first and last three bases were trimmed due to low sequence quality. A sliding window of four bases was further used to trim reads where the average sequencing quality was below Q15. Any Illumina adapter sequences were removed and all the reads were cropped to the same length. Reads that aligned to PhiX were further removed with Bowtie 2 v2.1.0 (Langmead and Salzberg, 2012). To check for contamination in the DNA isolation and library generation, processed reads were taxonomically classified with Kraken v0.10.4 (Wood and Salzberg, 2014), and contaminating reads were removed. Finally, mate pair reads were reverse complemented with FASTX-Toolkit v0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/). After processing, some sequencing libraries did not meet the previous quality requirements and were not used in subsequent assemblies and analysis (Supplementary Table S1). All processed reads were deposited in NCBI SRA (Supplementary Table S1). For Escherichia coli TY-2482 (Rohde et al., 2011), processed Illumina paired end reads generated from short and long insert libraries were downloaded from SRA (SRR292678, SRR292862).
Draft genome assembly
Interleaved reads were assembled into contigs with the VelvetOptimiser.pl script packaged with Velvet v1.2.10 (Zerbino and Birney, 2008). Final assemblies were generated with a kmer size that maximized N50 length, and deposited into NCBI WGS database (Supplementary Table S1 for accession numbers).
16S rRNA gene copy number estimation for endophytes
16S copy numbers were estimated by a sequencing read-depth approach as outlined in Supplementary Figure 1. An annotated genome and sequencing reads are required for estimation. Draft genome assemblies were submitted to the Rapid Annotation using Subsystem Technology (RAST Overbeek et al., 2013) server for annotation, and a custom R v3.1.0 (R Core Team, 2014) script was used to extract the positions of 16S rRNA, 23S rRNA, and single-copy genes. To calculate read-depths, bed files were created by applying a sliding window of two times the read length over the genome. Read-depth can be biased for GC% of the genomic region; therefore, we corrected for GC%, where appropriate, by applying a modified method from Yoon et al. (2009). Briefly, processed reads were mapped back to the assembly with Bowtie 2, and the resulting sam file was converted to a bam file with SAMtools v0.1.19 (Li et al., 2009). The GC% and read-depth for each window was calculated via bedtools v2.17.0 (Quinlan and Hall, 2010) nuc and intersect functions, respectively. For single-copy gene windows, read-depth was plotted against GC% to test for a linear relationship. If the linear model (lm function in R) explained a significant proportion of the variance (R2 > 0.1, slope p < 0.05), then we used the model parameters to correct read-depths for single-copy and 16S windows. Otherwise, we used the raw read-depths for further calculations. Assemblies can result in partial resolution of 16S copies; therefore, each gene was collapsed to a single contig and read-depths were summed for overlapping regions.
Estimates and confidence intervals were calculated by two different methods, Price-Bonett and Permuation. For Price-Bonett, the ratio of median read-depths for 16S to single-copy genes was calculated and the 95% confidence intervals were calculated as in Price & Bonett (2002). For Permutation, a random position of each gene was chosen and the read-depth was calculated as above for the window surrounding each position. The copy number was calculated by dividing the 16S read-depth by the single-copy read-depth. This process was repeated 1000 times and 95% confidence intervals were calculated for the resulting distribution with the Hmisc R package (https://github.com/harrelfe/Hmisc).
16S rRNA gene copy number estimation for NCBI deposited genomes
The table of deposited complete and draft prokaryotic genomes was downloaded from NCBI on November 25, 2014. This table was parsed for assemblies that are annotated and have sequencing reads in SRA. For each sequencing read/assembly combination, copy numbers were computed using the Price-Bonett method outlined above, except annotations were downloaded directly from NCBI. The results were further parsed for sequencing coverage (> 10,000 reads), proportion of mapped reads (> 50%), and outliers of 16S copy number estimates (< 20 copies). All estimates and metadata for NCBI genomes can be found in Supplementary Table S2.
Experimental validation of 16S rRNA gene copy number
We identified single-copy genes (argS, atpD, ppK or valS) for each bacterial isolate, designed and quality tested quantitative PCR primers (Supplementary Table S3), and constructed gBlocks® (Integrated DNA Technologies, Coralville, IA, USA) for reference. An individual gBlock reference contained one single-copy amplicon and one 16S amplicon, comparable to a reference plasmid (Lee et al., 2008). The concentration of genomic DNA samples was estimated by fluorimetry (Quant-iT DNA Assay, Life Technologies, Carlsbad, CA, USA) and adjusted to 3ng/µL. gBlocks were adjusted to 0.1 pg/µL. A threefold dilution series with seven total dilutions was produced for each sample and gBlock. The final volume per PCR was 11 µL and contained 4.4 µL molecular grade water, 5 µL SYBR FAST ABI Prism qPCR Kit reagent (KK4604, KAPA Biosystems, Wilmington, MA, USA), 0.3 µL of 10 µM forward and reverse primers, and 1 µL template. The qPCR program is detailed in Supplementary Table S3. qPCR runs were carried out in three technical replicates per primer and dilution on an ABI PRISM 7900HT (Applied Biosystems Instruments, Life Technologies, Carlsbad, CA, USA). Dissociation curves for each run were inspected for contamination and analysis performed by ABI PRISM software SDS 2.4 with the automatic Ct setting.
Copy numbers were determined by absolute quantification (Lee et al., 2008). All calculations were performed in R v3.1.0 (R Core Team, 2014). First, the weight of gBlock DNA for each dilution was converted into number of copies. Standard curves for 16S and single-copy genes were constructed for Ct vs. log(Copies) and fitted with a linear model (lm function). The regression parameters were used to calculate the number of gene copies for 16S and single-copy gene based on the genomic DNA Ct values. The 16S:single-copy gene copy ratio at each dilution was used to determine copy number per isolate.
16S rRNA copy number estimate comparison for 16Stimator and PICRUSt
To address whether 16Stimator represents an improvement in copy number prediction over ancestral reconstructions methods, we compared our estimates to PICRUSt predictions for Greengenes 13_5 OTUs (McDonald et al., 2012). We first extracted all the 16S sequences for all genomes with 16Stimator copy number estimates. We aligned these sequences against the Greengenes 99% OTUs (pick_closed_reference_otus.py in QIIME with uclust, enable_rev_strand_match True, and similarity 0.99) (Caporaso et al., 2010; Edgar, 2010). With the precalculated table found on the PICRUSt website (16S_13_5_precalculated.tab, http://picrust.github.io/picrust/picrust_precalculated_files.html), we were able to match PICRUSt copy number estimates for each OTU. Some genomes contained multiple 16S sequences that aligned to different 99% OTUs. We excluded these genomes from further analysis. As each genome had multiple 16Stimator estimates from multiple reads and sequencing libraries, we took the mean estimate and confidence interval for each genome. As several Greengenes OTUs encompassed multiple genomes, we then took the median estimate for each OTU. To calculate Pearson correlations between 16Stimator and PICRUSt estimates at different nearest sequenced taxon index (NSTI) cutoffs, we used the cor.test function in R (R Core Team, 2014). We have included a table of median 16Stimator copy number estimates for the Greengenes 99% OTUs for direct application in the PICRUSt workflow (Supplementary Table S4).
Note: All annotated, custom scripts are available at https://bitbucket.org/perisin/16stimator.
Titles and legends to supplementary figures
Supplementary Figure 1
16Stimator pipeline. This pipeline requires three inputs: sequencing reads, assembly fasta, and annotations. These inputs can be downloaded from NCBI Short Read Archive (SRA) and Genomes ftp sites. For analysis of de novo assemblies, reads are generated on various sequencing platforms, assembled into contiguous sequence fragments (contigs), here using the Velvet assembler (Zerbino and Birney, 2008), and annotated, here using RAST (Aziz et al., 2008; Overbeek et al., 2013). Sequencing reads are then aligned to the assembly with Bowtie 2 (Langmead and Salzberg, 2012), and read-depth at each genomic position is calculated with bedtools (Quinlan and Hall, 2010). If a significant GC% effect is detected, read-depths are corrected using a linear model. Custom R scripts are used to extract read-depths for 16S and single-copy genes and to estimate 16S copy number with Price & Bonett (2002) confidence intervals. Custom shell and R scripts can be found at https://bitbucket.org/perisin/16stimator.
Supplementary Figure 2
Correlations between 16Stimator and PICRUSt 16S copy number estimates deteriorate with increasing NSTI. For genomes with 16Stimator estimates, 16S sequences were extracted and aligned to Greengenes 99% OTUs (McDonald et al., 2012). For matching OTUs (535 total), the median 16Stimator estimates were plotted against PICRUSt estimates (Langille et al., 2013), and colored according to their nearest sequenced taxon index (NSTI) cutoff (<0.01, n = 341; <0.03, n = 77; <0.06, n = 50; <0.09, n = 21; <0.12, n = 17; <0.15, n = 11; >0.15, n = 18). Lines represent linear relationships between estimates colored by NSTI, with Pearson correlations: (<0.01, PCC = 0.50, p < 0.001; <0.03, PCC = 0.28, p = 0.014; <0.06, PCC = 0.71, p <0.001; <0.09, PCC = -0.07, p = 0.76; <0.12, PCC = 0.47, p = 0.056; <0.15, PCC = -0.24, p = 0.48; >0.15, PCC = 0.05, p = 0.86).
Supplementary Figure 3
16Stimator confidence interval size is unaffected by NSTI. We calculated 16Stimator copy number estimates for the Greengenes 99% OTUs and plotted the proportion of median confidence interval (CI) to median copy number estimate over increasing NSTI cutoffs. The proportion of CI to median estimate allows for comparison of estimates with different copy numbers, as the CI size alone will necessarily increase with higher copy numbers. To better visualize and compare the boxplots, outlier points greater than 10 are not shown.
Supplementary Figure 4
Interaction between sequencing depth and GC correction affects 16Stimator accuracy for Listeria monocytogenes genomes sequenced on the Illumina MiSeq. Sequencing libraries generated on the MiSeq (n = 1616) were used to estimate 16S copy number for Listeria monocytogenes genomes. The sequencing read-depth is plotted against the difference between 16Stimator estimate and actual copy number. Points are colored red if linear (“LM”) GC% correction was applied and blue if no GC% correction was applied to the read-depths.
Supplementary Figure 5
Bacterial phylogeny does not impact 16Stimator confidence intervals. For the Greengenes 99% OTUs, the proportion of median confidence interval to median 16Stimator copy number estimate is shown as a heatmap plotted on the Greengenes phylogeny tree.
Titles and legends to supplementary tables
Supplementary Table S1
Endophytic isolate collection data and genome assembly summary statistics.
Supplementary Table S2
16Stimator generated 16S rRNA gene copy number estimates and metadata for genomes and sequencing libraries downloaded from NCBI.
Supplementary Table S3
Quantitative PCR primers and conditions for 16S rRNA gene and single copy gene amplifications.
Supplementary Table S4
PICRUSt formatted table of 16Stimator generated 16S rRNA gene copy number estimates for Greengenes 99% OTUs.
Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, et al. (2008). The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics 9:75.
Barrett LG, Bell T, Dwyer G, Bergelson J. (2011). Cheating, trade-offs and the evolution of aggressiveness in a natural pathogen population. Ecology Letters 14:1149–1157.
Bolger AM, Lohse M, Usadel B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat Meth 7:335–336.
Edgar RC. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461.
King EO, Ward MK, Raney DE. (1954). Two simple media for the demonstration of pyocyanin and fluorescin. J Lab Clin Med 44:301–307.
Kniskern JM, Traw MB, Bergelson J. (2007). Salicylic acid and jasmonic acid signaling defense pathways reduce natural bacterial diversity on Arabidopsis thaliana. MPMI 20:1512–1522.
Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. (2013). Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 31:814–821.
Langmead B, Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nat Meth 9:357–359.
Lee C, Lee S, Shin SG, Hwang S. (2008). Real-time PCR determination of rRNA gene copy number: absolute and relative quantification assays with Escherichia coli. Appl Microbiol Biotechnol 78:371–376.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.
McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. (2012). An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal 6:610–618.
Morgan JL, Darling AE, Eisen JA. (2010). Metagenomic sequencing of an in vitro-simulated microbial community. PLoS ONE 5:e10209.
Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, et al. (2013). The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research 42:D206–D214.
Price RM, Bonett DG. (2002). Distribution-Free Confidence Intervals for Difference and Ratio of Medians. Journal of Statistical Computation and Simulation 72:119–124.
Quinlan AR, Hall IM. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842.
R Core Team. (2014). The R project for statistical computing from http://www.R-project.org.
Rohde H, Qin J, Cui Y, Li D, Loman NJ, Hentschke M, et al. (2011). Open-source genomic analysis of Shiga-toxin-producing E. coli O104:H4. N Engl J Med 365:718–724.
Traw MB, Kniskern JM, Bergelson J. (2007). SAR increases fitness of Arabidopsis thaliana in the presence of natural bacterial pathogens. Evolution 61:2444–2449.
Wood D, Salzberg S. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46.
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research 19:1586–1592.
Zerbino DR, Birney E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18:821–829.