How can corpus embedding contribute to the predictive results?

To explore how the viral corpus might influence the prediction results, we first applied different numbers of corpus input (abstract number) and generated corresponding corpus embedding. Then, we generated the top k results under different conditions and introduced the rank correlation coefficient to see how the predicted results changed.

The Kendall rank correlation coefficient evaluates the degree of similarity between two sets of ranks given to the same set of objects. If the Kendall rank correlation is close to 1, the two predictive ranks of the mutants are almost the same. Thus, the Kendall rank correlation can reflect the rank variance among different corpus inputs.

The experimental details are shown below:

Experiments details:

Case name: Sars-cov-2 B1 strand.

Case sequence embedding/features are extracted using the Sars-cov-2 B1 Fasta file.

The corps embedding is generated using different abstract numbers. We first changed the numbers of abstract generated corresponding Albert embedding and fed them into the final predictor to see the change of top k results.

We first changed the number of abstracts for a case study; the number of abstracts ranged from 3 to 100. (3,30,50,80,100). We selected the top k from 20,30,50,70 and 100.

Results

We compared the Kendall correlation on different vdkeys. Figure 1 shows the Kendall variance when vdkey equals 4mer-resnet 50 and doc2vec-resnet50; as these two predictors contain no abstract information (4mer comes from viral sequence features and doc2vec comes from viral sequence embeddings), we can see that the top results remain the same.

Figure 2 shows the Kendall variance of doc2vec-resnet50 and Albert-resnet50

Figure 1. Kendall variance of different numbers of abstract for vdkey of 4mer-resnet 50- Dole2vec-resnet50. As abstracts change do not influence sequence

thus the prediction remains stable (a row line).

Figure 2. Kendall variance of different numbers of abstract for vdkey of doc2vec-resnet50 and Albert-resnet50. The change in the number of abstracts influences the abstract quality and, therefore, influences the predicted top results of the albert-resnet predictor. However, the trend is not so specific.

Figure 3. Kendall variance of different numbers of abstract for vdkey of doc2vec-GPT2 and Albert-GPT2. The change in the number of abstracts influences the abstract quality and, therefore, influences the predicted top results of the Albert-GPT2 predictor. However, the trend is not so specific.

Figure 4. Kendall variance of different numbers of abstract for ensemble results (ensemble of 9 predictors). As can be seen from the figure, as more abstracts were introduced, Kendall’s main trend decreased for all top results, showing that as more information was introduced, the top results were more varied. Also, the minimum Kendall correlation here is 0.86, which means that the predicted results didn’t vary much for the ensemble results.

Summary

In summary, First (Figure 1), the change in the abstract won’t influence the viral sequence-based predictors. Second (Figure 2-Figure 3), the abstract number can influence the rank for different predictors. However, the trend is dependent on the vdkey predictor. Third (Figure 4), we further compared the ensemble results under the situation of abstract 2 to other numbers of abstracts (3-100). As more abstracts were introduced, the Kendall correlation fluctuated and decreased for all top results, showing that the top results for the ensemble method were more varied as more information was introduced.