Similarity of sequences in the CTD dataset

How similar are the sequences in the CTD dataset to the training datasets?

As described in http://deepseq2drug.cs.cityu.edu.hk/supplementary-materials-webpage/，the, the CTD dataset does not overlap with the drug virus dataset.

We introduced Cosine Similarity and Euclidean Distance to explore the viral sequences from two datasets. Cosine similarity measures the similarity between two vectors of an inner product space. (If Cosine similarity is close to 1, it means that the two vectors are almost in the same direction; these two vectors are more likely to be the same.)

Taking doc2vec embedding as an example, we first separately averaged the doc2vec embeddings from the CTD and drugvirus dataset(both of them are 128 dimensions vectors), and then we calculated the cosine similarity between the two average embeddings.

After the calculation, we get the cosine similarity value of 0.020797768101430344, which is close to 0, indicating that the two averaged vectors(representing sequences are almost vertical to each other).

We further calculated the averaged Euclidean Distance between each dataset and their corresponding averaged embedding.

Table 1.

Dataset embedding	Drug virus info(ave)	CTD(ave)
Drug virus info	24.294509640108675	3.9599758181140876
CTD	25.281865959883508	2.6539594222977936

As can be seen from the table, the average distance from Drugvirus info viral sequence embedding to its averaged embeddings is 24.29, while the average distance from CTD viral embedding is 25.28. The mean distance from the Drugvirus info viral sequence to CTD averaged embeddings is 3.96, while from CTD, it is only 2.65, which indicates that the embeddings from these two datasets are quite different.