Abstract:Two families of statistical approaches, parametric models and nonparametric estimates, were developed and used to predict bacterial species richness in core MD05-2896 collected from the South China Sea during the Chinese-French joint MARCO POLO/IMAGES 147 cruise (chief operator Yvon Balut). MD05-2896 (08°49.50'N, 111°26.47'E) was an 11.03 m long core collected from the SCS south slope, where the water depth is 1 657 m. A total of twelve microbial subsamples were collected from the top to the bottom of the core at intervals of 1 m. For each sample, we extracted the bulk DNA and amplified the bacterial 16S rRNA gene sequences. All of the bacterial 16S rRNA gene sequences were used to construct a 16S rRNA gene library. A total of 194 unique phylotypes were identified based on the phylogenetic analysis, most of which clustered into 17 phyla that belong to Planctomycetes, Proteobacteria, Chloroflexi, Actinobacteria, Sprirochaetes, Verrucomicrobia, Acidobacteria, Bacteriodetes, Defferribacteres, Nitrospirae, and candidate divisions OP1, OP3, OP8, OP11, JS1, WS3, and TM6. All sequences were grouped into Operational Taxonomic Units (OTUs) based on 99%, 98%, 97%, 95%, 90%, and 80% similarity cut-off values. Sequences were aligned using the CLUSTALW software, after which the% sequence similarity was calculated and the sequences were grouped into OTUs using the unweighted pair group method. These frequency data were then analyzed by parametric models and coverage-based nonparametric estimates, ACE and ACE-1, respectively. Because nonparametric estimates usually underestimate the species richness, this study focused on application of parametric models. Five models, including the inverse Gaussian, log normal, negative binomial, Pareto and 2-mixed exponential, were adopted to fit asymptotically with the OTU's frequency data and predict the species richness. The parametric models were implemented step by step according to the procedures described by Hong et al., and the best fitted model was selected for the final parametric analysis. At the level of 99% rRNA sequence similarity, our data were best described by the 2-mixed exponential distribution model, which estimates the richness at 326±40(SE). At the level of 97% rRNA sequence similarity, the negative binomial distribution model describes our data best with an estimated richness of 244±10(SE). At the level of 95% rRNA sequence similarity, the negative binomial distribution model best described the data and estimated the richness at 220±6(SE). At the level of 90% rRNA sequence similarity, the 2-mixed exponential distribution model best described the data and estimated the richness at 127±4(SE). At the level of 80% rRNA sequence similarity, the Pareto distribution model described the data best and estimated the richness at 62±4(SE). The 99%, 97%, 95%, 90%, and 80% rRNA gene sequence similarities were adopted to identify bacterial strain, species, genera, families/classes, and phyla, respectively. Accordingly, core MD05-2896 contains at a minimum of 326 ± 40(SE) bacterial strains, 244 ± 10 (SE) bacterial species, 62 ± 4 (SE) bacterial phyla, 127 ± 4 (SE) bacterial families/classes, and 220 ± 6 (SE) bacterial genera. However, these numbers are conservative because of the limitations associated with laboratory experiments, such as coextracted interfering substances including humic and fulvic acids, and PCR bias.