The size distribution of these contigs is shown in reference 2 Fig. 1B. Among these contigs, 16,569 (50.8%) were longer than 500 bp, of which 3,977 (12.2%) were longer than 1,000 bp. These results demonstrated the effectiveness of 454 pyrosequencing in rapidly capturing a large portion of the P. yessoensis transcriptome. The sequencing depth was 5.8 X on average. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given contig and the number of reads assembled into it (Fig. 1C). The remaining 106,807 high-quality reads were retained as singletons. About 7.7% of the reads produced in this study matched to microbes, and over 83% of the microbial transcripts were turned out to come from the embryo and larval library, of which samples were collected directly from non-sterile seawater.
It seems very plausible that the majority of identified microbial sequences were caused by microbial contamination from seawater. Therefore, these microbial sequences have been removed from the procedures of functional annotation, and SSR and SNP mining. Sequence annotation We utilized several complementary approaches to annotate the assembled sequences. First, the assembled sequences were compared against the public Nr and Swiss-Prot databases using BlastX (E-value<1e-4). Of the 139,397 assembled sequences, 38,942 (14,638 contigs plus 24,304 singletons) had a significant matches (Table S1) corresponding to 25,237 unique accession numbers, of which 6,622 were matched by multiple queries without overlap.
These 6,622 subject sequences were matched by 20,327 different query sequences (3.1 matched queries per subject, on average). Additionally, 24,304 singletons showed significant matches to 17,204 unique accession numbers, of which 13,661 (79.4%) were not found among contigs, suggesting that most of singletons contained useful gene information which could not be obtained from contigs. It could be due to the fact that many genes in the transcriptome are expressed at levels low enough to hinder adequate sampling for 454 sequencing. The percentage of sequences without annotation information in this study was considerable (approximately 72.1%). The poor annotation efficiency could be due to the insufficient sequences in public databases for phylogenetically closely related species to date. For example, 461 (1.2%) hits were matched to P.
yessoensis; 228 (0.6%) to C. farreri (Zhikong Cilengitide scallop); 182 (0.5%) to C. gigas (Pacific oyster); and 176 (0.5%) to A. irradians (Bay scallop). Only 4.1% of the BLAST hits matched to Bivalvia class in total. On the other hand, because the significance of the BLAST comparison depends in part on the length of the query sequence, short reads obtained from sequencing would rarely be matched to known genes [13]. In this study, almost half of the assembled sequences were not very long (48.