Supplementary Materials Webpage

DeepSeq2Drug: An Expandable Ensemble End-to-end Antiviral Drug Repurposing Benchmark Framework by Multi-Modal Embeddings and Transfer Learning.

Extra Experiments

EXP 1 Time-complexity/ Runing time

EXP 2 comparison-with-sota-methods

EXP 4 mutation-exploration

EXP 5 Negative Sampling Rate

EXP 6 Corpus-embedding-contribute-to-the-predictive-results

EXP 7 MPVX- in – 1.5 years

Supplementary Materials Contents

More details concerning the pre-trained models can be found in Supplementary Materials. (PART I Features/embedding)

Part II Datasets

Details about the negative sampling strategies are shown in Supplementary Table 3.

Part Ⅲ Results

The visualized results (violin-swam plot), dataset overlap, and the p-value matrix of transfer results can be found in Supplementary Materials, Part III.

Part Ⅳ Extended Results and Parametrical Analysis

Part V Case Study of RSV

Additionally, we checked the findings for the respiratory syndrome virus (RSV); 20 out of the top 20 predictive results were reported (further information regarding the RSV prediction outcomes can be found in Supplementary Table 5.

—————————————————————————————–

PART Ⅰ Features/embedding

This manuscript mainly focused on embeddings from sequences, corpus, images, and networks. For viruses, the DNA sequence and virus-related abstracts/descriptions are more accessible. Thus, we mainly leveraged sequence-based and corpus-based (NLP) embeddings (for the virus). Only a few drugs have sequences. We didn’t consider drug sequences in this framework to maximize the number of drugs that can generate all modal feature embedding. We leveraged descriptions, molecular images, and networks to generate drug features/embeddings. We also introduced a virus or drugs feature from GCNMDA [1], downloaded from its official GitHub, to compare with our methods in embedding quality. The sources of previous contents and candidate methods leveraged for extracting features/embeddings are shown in Supplementary Table 1.

Supplementary Table 1. Feature/embedding types and candidate extractors

Feature-view domains	Source	Candidate Virus features/embedding extractors	Candidate Drug embedding extractors
Sequence-based ( iLearn [2])	NCBI virus DNA sequences [3]	4mer,5mer, mismatch41, PseKNC2013, Z48bit	N/A
Sequence-based embedding (Doc2vec)	NCBI virus RNA sequences	Doc2vec (model trained from 867020 sequences)	N/A
corpus-based (NLP) embedding (7984)	PubMed Collected Virus-related Papers abstracts	Roberta/ Albert/Bert (finetune selected Roberta and Albert)*	Bert/Gpt-2/Roberta/ Albert/ (Sources: drug-bank descriptions [4])
Molecular Image-based Embedding (9212)	DrugBank Molecular Image [4]	N/A (9212) +	Resnet50, Resnest101, Rfficient net, Inception-Resnet
Networks-based Embedding	Drug-Drug-interaction, Drug-Target interactions [4], Drug-disease interactions	N/A	Role2vec, node2vec, GCNMDA (SOTA method) (finetune selected Role2vec 768 and Role2vec 768(30) )

* Gpt-2 is not able to be applied to virus abstract corpus due to the limitation of the GPU memory of the experiment platform. The Platform is i7-9700 CPU, 64 Gigabytes of Memory, and GTX2080Ti, 11 Gigabytes of graphic memory. Each feature-view domain may contain more than one feature extractor.

Part II Datasets

Supplementary Table 2. Details of the collected datasets/databases

Database	No. virus	No.drugs	No.interactions	Type	purpose
NCBI virus [3]	Expandable	Not applicable	N/A	Virus seq	Virus-sequence based features
DrugBank [4]	N/A	190 (13,473)	N/A	Drug info	Drug features (corpus, Image, network)
Drugvirus [5]	103 (153)	190 (231)	1,156 (1,518)	Drug-virus Relation	Constructed datasets Train model
CTD [6].	11 (2 overlap)	585	1,588	Chemical-virus relations	Transfer Test
DrugBank: Drug-Target [4]	4,596 (seq)	7,381 (Overlap with DrugVirus:136)	20,127	Drug-target relations	Network construction
Drug-Drug-interaction,	N/A	391 (ODV: 56)	875	Drug-Drug Interaction,	Network construction
Drug-disease interactions	N/A	1,663 (ODV:117)	466,657	Drug-Disease Interactions	Network construction
GCNMDA [1]		167( 175-8 no DBID) (ODV:136)	N/A	Drug-virus Prediction methods	State Of The Art (SOTA)
Total Network		(ODV:175)	487,679	Heterogeneous Network	Network Construction

The numbers in the brackets are the original numbers before the data filtration (ODV means the Overlap with DrugVirus datasets, in total, our network has 175 drugs of overlaps).

Pretrained deep-learning models

Here is some information about the pre-trained deep learning models; they are all Tensorflow-based models, without further finetune:

For language models:

Roberta: https://huggingface.co/exbert/?model=xlm-roberta-base

Albert: https://github.com/google-research/albert

Bert: https://huggingface.co/bert-base-uncased

GPT-2: GPT2LMHeadModel

For image-based models, it is also Tensorflow-based, and downloaded from tensorflow.keras.applications, details of the model are as follows:

#model=ResNet101(weights=”imagenet”, include_top=False, pooling=”avg”)

#model=ResNet50(weights=”imagenet”, include_top=False, pooling=”avg”)

#model=EfficientNetV2L(weights=”imagenet”, include_top=False, pooling=”max”)

#model=InceptionResNetV2(weights=”imagenet”, include_top=False, pooling=”avg”)

Different negative sampling strategies

Supplementary Table 3. Details of the constructed datasets

Constructed datasets	Negative samples from	Selected From No.virus	Selected From No.drug	No.positve samples	No.negative samples	Dataset random seed k
Drugvirus (GCNMDA)	GCNMDA drug set	95	175(167 with DBID)	781	781	0-4
Drugvirus (same set)	DrugVirus set	103	190 with DBID	1156	1156	0-4
Drugvirus (Random Select)	All drugs set	103	9212 with DBID	1156	1156	0-4
CTD	All drugs set	103	9212 with DBID	1592	1592	0-4

We designed transfer validation in the results to reflect the predictive ability of different negative sampling methods. We repeated the sampling process k times (in this manuscript, k=5, we fixed the random seeds to repeat the experiments).

The main difference is that the negative samples are selected from different

The GCNMDA virus/drug features cannot conduct this experiment because of its limited number of viruses/drug entries (samples). Thus, if we use the GCNMDA features for training set construction, the model would have trouble learning the dataset’s distribution. Therefore, we selected a virus and drug embeddings group to conduct the transfer verification.

Part Ⅲ Results

Transfer results

Although for the known pair, there is no overlap, However, as we leveraged the random sampling policy, we cannot guarantee that the negative Overlap does not have overlaps. To further validate if the CTD dataset can be an independent dataset, we further counted the overlaps of our constructed dataset; details are as follow:

	CTD_frs0	CTD_frs1	CTD_frs2	CTD_frs3	CTD_frs4	Size
DB_frs0	991	5	3	2	0	2312
DB_frs1	5	987	1	3	1	2312
DB_frs2	4	3	958	0	1	2312
DB_frs3	3	2	0	996	0	2312
DB_frs4	1	1	1	1	982	2312
Size	3184	3184	3184	3184	3184	–

Frs_k means the dataset is constructed with the random seed k. As long as it is not using the same random seeds, the Overlap is acceptable.

As can be seen from above, if the CTD and drug virus use the same random seed to make the constructed dataset, it would contain plenty of overlaps. In order to make it fairer to other constructed datasets, thus, the results from the sample random seed would be removed from the results.

The adjusted AUC/AUPR/metric and p-values of transfer results are shown as follows:

P-value_0 means that the P-value of results (Metrics) from the first pair of VDkey: Roberta-ResNet50 in the dataset of DrugVirus compared with other vdkeys. It is the same with _1,_2,_3, meaning Albert-ResNet50,4mer-ResNet50, and 5mer-ResNet50.

The p-values calculated in the pictures are rank-sum tests with Bonferroni corrections. The results of transfer verification grouped by datasets are shown in Supplementary Figure 1.

The findings reveal that DrugVirus has considerably superior overall transfer verification outcomes than DrugVirus(SS), with a p-value of 2.297e-17, and DrugVirus(GCN), with a p-value of 5.124e-34.

In Supplementary Figure 2, various built datasets’ findings are sorted by vdkeys. We discovered that the 5mer-Resnet vdkey could obtain the greatest average metric (AUC+AUPR). The top four vdkeys of the Drugvirus are distributed similarly to one another in terms of results distribution. However, Drugvirus is not significantly better in the created dataset than Roberta-Resnet (SS). This suggests that compared to other constructed datasets, the Drugvirus-constructed dataset has a greater capacity to predict transfer.

As a result, we carried out more tests, mostly using the Drugvirus dataset that was created.

Supplementary Figure 1

Supplementary Figure 2

Part Ⅳ Extended Results and Parametrical Analysis

We extracted features from discrepant aspects (Including viruses’ sequential and sematic, drugs’ corpus, Image, and network-based embedding). Some aspects involve parameters, such as virus semantical features and Drug network-based embedding. We conducted some parametrical analysis to filter the best features of those aspects. During this process, we will fix features from other aspects and conduct some experiments on the chosen aspects’ candidate extractors.

Virus Semantical (Corpus-Based)Embedding Finetuning

Due to the limitations of GPU memory, we have to apply different strategies to the original abstracts list. As shown in Supplementary Table 4, we can only generate features from the first ten abstracts if we do not apply any preprocess to the corpus. If we apply a preprocess and a length limitation to the corpus, we could generate features from the first 20 abstracts. Furthermore, if we make the corpus with no overlap words, it will contain information from 100 abstracts. However, this will cause the loss of some information from the frequency of words.

Supplementary Table 4. Details of Finetuning the virus corpus-based embeddings

Pre-trained-Model	Preprocess Method	10 Abstracts	20 Abstracts	50 Abstracts	100 Abstracts
Albert	Without preprocess	v (able to run)	OOM (Out Of Memory)	OOM	OOM
Albert	PreProcess (PP)	v	v	OOM	OOM
Albert	PP+Non overlaop (PPNoL)	v	v	v	v
Roberta	Without preprocess	v	OOM	OOM	OOM
Roberta	PreProcess (PP)	v	v	OOM	OOM
Roberta	PP+Non overlaop (PPNoL)	v	v	v	v
Bert	Without Preprocess	v	OOM	OOM	OOM
Bert	PP	v	v	OOM	OOM
Bert	PPNoL	v	v	v	v

To further explore how those features would affect the performance, we leveraged those features by conducting a parametrical analysis. The dataset we chose is Drugvirus, Type2 verification, and repetition as ten times. The drug virus could perform well on the transfer tasks.

We categorized the parameters and ran various tests. In Supplementary Table 5, the results are displayed. We can see that the improvement is not very noticeable when compared to 10 or 50, or 100 papers (abstract) (Supplementary Figure 3). According to our analysis, each virus’s average abstract number is 86.4606. Although the 100 abstracts are slightly fewer than the 50 abstracts, we still used the 100 abstracts as parameters to create corpus-based embedding in order to add more information.

We further sorted the results by using different models to generate context-based embedding. As we can see from Supplementary Figure 4, the Albert is significantly better than other comparison methods with the Rank-sum test with Bonferroni correction (****: p <= 1.00e-04).

For the methods of preprocessing the corpus, we can see that when no preprocess is employed, the metric is 1.8665; if we applied the PreProcess No Overlap, the average metrics increased to 1.86913. if we only use the Preprocess, it will decrease to 1.8640. it is reasonable that some information might be removed when using the preprocess. However, removing the Overlap can significantly increase performance compared to Preprocessed only, as shown in Supplementary Figure 5. As we want to introduce more semantical information(including more abstracts) into the embedding here, we decide to use the PPNoL as preprocess method to generate corpus-based embeddings.

Thus in the feature pool, we leveraged albertPPNoL100 and robertaPPNoL100 as candidate semantic embedding extractors.

Supplementary Table 5. Results of finetuning the virus corpus-based embeddings

Numbers of documents	Abstract_10	Abstract_20	Abstract_50s	Abstract_100	figures
Metrics (average)	1.86901	1.8640	1.86902	1.8680	Figure 3
models	Albert	Bert	Roberta
Metrics (average)	1.8764	1.8623	1.8691		Figure 4
Preprocess methods	None-PreProcess	PreProcess	PreProcess No Overlap
Metrics (average)	1.8665	1.8640	1.86913		Figure 5

Supplementary Figure 3

Supplementary Figure 4

Supplementary Figure 5

Drug Network-based Embedding Finetuning

We initially chose no more than two aspects(two sequence features, two sequence embedding, two NLP embedding) for each feature domain of the infection virus and re-paired those features.

Then, in order to build the embedding datasets, we experimented with variously produced network-based drug embeddings and used five-fold cross-validation verification. The table contains information about the parameters used to generate the embedding. Drug network composition (D-target, Drug-Drug, Drug-Disease networks, and their combinations), embedding dimensions, and creation techniques are some of these factors.

In order to create embedding datasets with outstanding transfer verification performance, we used the Drug Virus (random choice for negative samples). Our network-based embedding has 487,697 linkages and 7,592 medicines, whereas the SOTA technique, GCNMDA, has 136 medications. The outcomes are shown in Supplementary Figure 6.

The heatmap demonstrates that performance often rose as the dimension grew. Adding extra links can improve performance, as seen in the first two rows. We further plotted a violin-swam plot with adjusted p-values shown in Supplementary Figure 7,

The plot’s trend further suggested that adding dimensions and networks might improve the performance of the model. _30 means that each node will visit 30 times when producing the role2vec embedding. From the figure, it does not improve performance significantly compared with the default setting.

Supplementary Figure 8 shows the output of each virus-drug feature domain extractor (vdkey). We can see that doc2vec-role2vec( Dtarget-DDI-Ddis-768_30) can get the highest mean AUC+AUPR values in this task( with a selected group of virus features).

We ultimately decided to include Drug role2vec(dtarget+DDI+Ddis768 30), and node2vec(dtarget+DDI128) as network representatives to the feature pool for features selection.

Supplementary Figure 6

Supplementary Figure 7

Supplementary Figure 8

Doc2vec finetuning

In our previous work [13], we generated sequence-based embedding for each RNA sequence by first choosing the longest sequence as a sample. Then we generated the embedding for each DNA sequence. We have additional sequences for each virus in this endeavor. Therefore, if a virus has more than one sample, we opted to average the embedding for that virus. In this case, the virus variants are regarded as the same the virus, and their embeddings would be averaged. The longest sequence embedding is marked as doc2vec_longest or doc2vec_(s), and the average embedding is noted as doc2vec_0.8m.

As a SOTA comparison, we also provided the GCNMDA viral features, which are named GCNMDA. We chose one embedding generator for each feature domain for the drug features. The AUC/AUPR/Metric violin-swam plot is shown in Supplementary Figure 9. (a-c). Rank-sum Bonferroni P-values were added at the top of each violin-swam plot. The heatmap of the average metric findings for particular virus-drug feature combinations is shown in Supplementary Figure 9. (d). The ROC/PR curves for those two techniques and GCNMDA are shown in Supplementary Figure 9. (e–f).

The viral averaged embedding doc2vec with drug network embedding role2vec is noticeably superior to the other virus-drug feature domain combinations, as shown in Supplementary Figure 9. (a-c). Additionally, doc2vec_0.8m-role2vec and doc2vec_0.8m-ResNet50 might produce results with a comparable distribution for the overall performance of AUC and AUPR. Supplementary Figure 9. (d). demonstrates that when paired with the three distinct drug feature domains, the viral embedding doc2vec_0.8m can achieve the greatest average metric values.

Supplementary Figure 9

Part V Case Study of RSV

We further collected respiratory syndrome virus from recent records (December 22, 2022),

We conducted drug repurposing for those newly recorded nucleotide sequences. Results are shown in Supplementary Table 5. As can be seen from the table, DeepSeq2drug performed well, with 10 out of the top 10 predicted medicines being reported from PubMed to treat the contagious RSV. Those PMIDs were also recorded for further research.

The sample IDs are shown in the IDs of the selected sequences for RSV results.

Supplementary Table 5. Predictive results of RSV

Virus	Drug name	Count	Possible citation from PUBMED
respiratory syndrome virus	Ribavirin	10	[‘31384456’, ‘33924302’, ‘32352535’, ‘32307245’, ‘27281742’, ‘32634603’, ‘30849247’, ‘33961695’, ‘32282022’, ‘32284326’]
respiratory syndrome virus	Chloroquine	10	[‘32348588’, ‘32217113’, ‘33010669’, ‘32964796’, ‘32295814’, ‘32446285’, ‘34356617’, ‘33236131’, ‘32696108’, ‘32373993’]
respiratory syndrome virus	Mycophenolic acid	10	[‘33116299’, ‘32639598’, ‘33743151’, ‘34549821’, ‘33957273’, ‘24323636’, ‘24626235’, ‘25542975’, ‘25810418’, ‘32579258’]
respiratory syndrome virus	Amantadine	10	[‘31275265’, ‘14643124’, ‘34152583’, ‘1048031’, ‘3500376’, ‘35390511’, ‘33364201’, ‘10965680’, ‘25446940’, ‘15071371’]
respiratory syndrome virus	Nitazoxanide	10	[‘33336780’, ‘33361100’, ‘32768971’, ‘33588727’, ‘35130104’, ‘27095301’, ‘34755538’, ‘35069994’, ‘28500431’, ‘36094778’]
respiratory syndrome virus	Itraconazole	10	[‘32428379’, ‘33666253’, ‘35064041’, ‘34984948’, ‘33472466’, ‘35405278’, ‘27895278’, ‘35229317’, ‘36038303’, ‘29899416’]
respiratory syndrome virus	Nelfinavir	10	[‘33817567’, ‘33482181’, ‘34755538’, ‘16312205’, ‘33217030’, ‘34611467’, ‘33995308’, ‘33080984’, ‘32259313’, ‘32705942’]
respiratory syndrome virus	Amodiaquine	10	[‘33941899’, ‘34755538’, ‘34239286’, ‘36453012’, ‘34217752’, ‘32916297’, ‘34541995’, ‘17176632’, ‘32805422’, ‘33475021’]
respiratory syndrome virus	Sunitinib	4	[‘34951532’, ‘32540268’, ‘32669298’, ‘35738348’]
respiratory syndrome virus	Gemcitabine	7	[‘29795047’, ‘33557278’, ‘33479570’, ‘32563698’, ‘36213871’, ‘36334362’, ‘35971500’]

IDs of the selected sequences for RSV results

“>OP730529.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV-CH-SDLY27-2022 envelope glycoprotein gene, complete cds

“>OP730530.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV-CH-SDLY28-2022 envelope glycoprotein gene, complete cds

“>OP730531.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV-CH-SDLY32-2022 envelope glycoprotein gene, complete cds

“>OP785693.1 |Porcine reproductive and respiratory syndrome virus isolate PRRSV-CH-SDLY176-2022 nonstructural protein 2 gene, partial cds

“>OP785694.1 |Porcine reproductive and respiratory syndrome virus isolate PRRSV-CH-SDLY177-2022 nonstructural protein 2 gene, partial cds

“>OP785695.1 |Porcine reproductive and respiratory syndrome virus isolate PRRSV-CH-SDLY28-2022 nonstructural protein 2 gene, partial cds

“>OP785696.1 |Porcine reproductive and respiratory syndrome virus isolate PRRSV-CH-SDLY32-2022 nonstructural protein 2 gene, partial cds

“>OM677752.1 |Porcine reproductive and respiratory syndrome virus strain NPUST3064 glycoprotein 5 gene, complete cds

“>OM677753.1 |Porcine reproductive and respiratory syndrome virus strain 108-355 glycoprotein 5 gene, complete cds

“>OM677754.1 |Porcine reproductive and respiratory syndrome virus strain 108-603 glycoprotein 5 gene, complete cds

“>OM677755.1 |Porcine reproductive and respiratory syndrome virus strain NPUST3554 glycoprotein 5 gene, complete cds

“>OM677756.1 |Porcine reproductive and respiratory syndrome virus strain NPUST3599 glycoprotein 5 gene, complete cds

“>OM677757.1 |Porcine reproductive and respiratory syndrome virus strain 108-2275 glycoprotein 5 gene, complete cds

“>OM677758.1 |Porcine reproductive and respiratory syndrome virus strain NPUST4028 glycoprotein 5 gene, complete cds

“>OM677759.1 |Porcine reproductive and respiratory syndrome virus strain NPUST4035 glycoprotein 5 gene, complete cds

“>OM677760.1 |Porcine reproductive and respiratory syndrome virus strain NPUST4178 glycoprotein 5 gene, complete cds

“>OM677761.1 |Porcine reproductive and respiratory syndrome virus strain NPUST4260 glycoprotein 5 gene, complete cds

“>OM677762.1 |Porcine reproductive and respiratory syndrome virus strain 109-920 glycoprotein 5 gene, complete cds

“>OM686875.1 |Porcine reproductive and respiratory syndrome virus strain 103-555 ORF5 gene, complete cds

“>OM801587.1 |Porcine reproductive and respiratory syndrome virus strain KN-21035TB envelope protein GP5, membrane protein GP6, and nucleocapsid protein genes, complete cds

“>OM860456.1 |Porcine reproductive and respiratory syndrome virus isolate VNUA-PRRS-HY-01 glycoprotein 5 (GP5) gene, complete cds

“>MK279739.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV/pig/CHN/JTS/201606, complete genome

“>MK279740.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV/pig/CHN/TG/201711, complete genome

“>MK279741.1 |Porcine reproductive and respiratory syndrome virus strain PRRSV/pig/CHN/JK/201805, complete genome

“>ON691479.1 |Porcine reproductive and respiratory syndrome virus strain GD-H1, complete sequence

“>ON691480.1 |Porcine reproductive and respiratory syndrome virus strain GD-H1, complete sequence

Supplementary Table 6.

DeepSeq2drug Top Predicted Drug for Monkeypox Virus

(Ref seq and average seq)

Rank	Drug_name (Ref_seq)	Count	REF	Drug_name (200_Monkeypox)	REF
1	Ribavirin	2	[51], [52]	Ribavirin	[51], [52]
2	Amantadine	0		Chloroquine	0
3	Mycophenolic acid	1	[51]	Mycophenolic acid	[51]
4	Chloroquine	0		Nitazoxanide	0
5	Itraconazole	0		Itraconazole	0
6	Nitazoxanide	0		Amantadine	0
7	Nelfinavir	0		Nelfinavir	0
8	Amodiaquine	0		Amodiaquine	0
9	Sorafenib	0		Sunitinib	0
10	Gemcitabine	1	[53]	Sorafenib	0
11	Sunitinib	0		Gemcitabine	[53]

The Monkeypox ref_seq (NCBI Accession ID GCF_000857045.1) and 200 monkeypox DNA nucleotide sequences (NCBI Accession ID range from ON669283.1 to OP881933.1, searched on 12th Dec 2022) are the parameters that control the embedding is generated by one sequence or a batch of sequences. We can find out that the predictive results of the Reference sequence and a batch of recent Monkeypox are slightly different.

Reference

[2] Chen Z, Zhao P, Li F, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data[J]. Briefings in bioinformatics, 2020, 21(3): 1047-1057.
[4] Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 November 8. doi: 10.1093/nar/gkx1037.