The Distant Segments Kernel: a tutorial

This post is a tutorial on how to use the distant segments kernel. String kernels were recently introduced as a more precise way to perform pairwise comparisons. First, I will define the concept of kernels. A kernel is a function that takes two objects in an input space and multiply them by mapping them to a vectorial feature space. A kernel associates a real number to a pair of instances. String kernels are a particular case of kernels. The input space of string kernels is the set of strings. Recall that strings are sequences generated with a given alphabet. The distant segments kernel is a string kernel. For the distant segments kernel, the feature vector associated to a string is the distribution of its distant segments. See this paper for more details.

In this tutorial, command lines are shown in red. First, download the source code.


seb@ubuntu:~$ wget http://boisvert.info/software/PermutationDSKernel.cpp
This software performs the kernel matrix computation of a set of strings. Now, compile the c++ file with g++.

seb@ubuntu:~$ g++ PermutationDSKernel.cpp -O4 -Wall -o DSK

Executing the program without providing parameters will output the usage.

seb@ubuntu:~$ ./DSK
Usage:
DSkernel l sequenceFile delta_m theta_m out

Basically, sequenceFile is a file containing sequences. In this file, you must put one sequence per line. Note that this file is not a fasta file.

In this tutorial, I will use the file sequences.txt. You can copy the content below in sequences.txt on your computer. The sequences.txt file contains protein sequences from the SCOP database.

seb@ubuntu:~$ cat sequences.txt
svydaaaqltadvkkdlrdswkvigsdkkgngvalmttlfadnqetigyfkrlgnvsqgmandklrghsitlmyalqnfidqldnpddlvcvvekfavnhitrkisaaefgkingpikkvlasknfgdkyanawaklvavvqaal
lsaaqkdnvksswakasaawgtagpeffmalfdahddvfakfsglfsgaakgtvkntpemaaqaqsfkglvsnwvdnldnagalegqcktfaanhkargisagqleaafkvlagfmksyggdegawtavagalmgmirpdm
glsaaqrqviaatwkdiagadngagvgkkclikflsahpqmaavfgfsgasdpgvaalgakvlaqigvavshlgdegkmvaqmkavgvrhkgygnkhikaqyfeplgasllsamehriggkmnaaakdawaaayadisgalisglqs
vlsegewqlvlhvwakveadvaghgqdilirlfkshpetlekfdrfkhlkteaemkasedlkkhgvtvltalgailkkkghheaelkplaqshatkhkipikylefiseaiihvlhsrhpgdfgadaqgamnkalelfrkdiaakykelgy
xslsaaeadlagkswapvfanknangldflvalfekfpdsanffadfkgksvadikaspklrdvssriftrlnefvnnaanagkmsamlsqfakehvgfgvgsaqfenvrsmfpgfvasvaappagadaawtklfgliidalkaaga
lsadqistvqasfdkvkgdpvgilyavfkadpsimakftqfagkdlesikgtapfethanrivgffskiigelpnieadvntfvashkprgvthdqlnnfragfvsymkahtdfagaeaawgatldtffgmifskm
seb@ubuntu:~$ wc -l sequences.txt
6 sequences.txt

To perform a pairwise comparison, one can use string kernels. Note that pairwise comparisons are usually done with alignment software. Instead of relying on an alignment to perform this task, here we will utilize the computer program that we just compiled from its source code.

seb@ubuntu:~$ ./DSK 6 sequences.txt 1000 3 KernelMatrix-DSK-1000-3.txt

This performed the computation for the 6 sequences in sequences.txt with the distant segments kernel. Hyperparameters deltaM and theteM were set respectively to 1000 and 3. See the paper for a clear explanation of the role of those hyperparameters in the distant segments kernel. The last argument that we provided is simply the output file.

The output file contains a square matrix. For each pair of sequences, it contains the image of the distant segments kernel. Let us look at the content of this file.

seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt
x 0 1 2 3 4 5
0 95925 5707 6163 5114 6218 4130
1 5707 93627 9197 5396 9088 5900
2 6163 9197 103266 6711 8617 5547
3 5114 5396 6711 105790 5833 4359
4 6218 9088 8617 5833 101886 6256
5 4130 5900 5547 4359 6256 83565

As we can see, a diagonal dominance occurs and inner products are large. The distant segments kernel has broad applicability. One of the applications is to use the distant segments kernel as the similarity operator of the support vector machine. svmlight, an implementation of the SVM algorithm, provides ways to plug in any kernel. You can download this package to use a precomputed matrix with svmlight. Note that you need to download svmlight separately here. Note also that svmlight is free for academic use.

For real-world bioinformatics application, the kernel matrix needs to be normalized. One of the reasons is that proteins share subsequences, but have a varying length. One way to normalize the matrix is to transform all associated vectors to vectors of unit norm. Here, in the last part of this tutorial, I show how to normalize the kernel matrix associated to the distant segments kernel. Let us compile the computer program Normalize.

seb@ubuntu:~$ wget http://genome.ulaval.ca/dav/boiseb01/pub/software/svmoptimize-v1.zip
seb@ubuntu:~$ unzip svmoptimize-v1.zip
seb@ubuntu:~$ gcc -O4 -Wall svmoptimize-v1/Normalize.c -o svmoptimize-v1/Normalize -lm

Again, executing the program without arguments will show the correct usage.

seb@ubuntu:~$ ./svmoptimize-v1/Normalize
usage: program matrix n
k'(x,y) <- k(x,y)/sqrt(k(x,x) k(y,y))

The normalized matrix is generated and written to KernelMatrix-DSK-1000-3.txt.1

seb@ubuntu:~$ ./svmoptimize-v1/Normalize KernelMatrix-DSK-1000-3.txt 6 > KernelMatrix-DSK-1000-3.txt.1
seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt.1
x 0 1 2 3 4 5
0 1.000000 0.060220 0.061922 0.050766 0.062897 0.046129
1 0.060220 1.000000 0.093533 0.054219 0.093049 0.066702
2 0.061922 0.093533 1.000000 0.064208 0.084008 0.059713
3 0.050766 0.054219 0.064208 1.000000 0.056184 0.046361
4 0.062897 0.093049 0.084008 0.056184 1.000000 0.067800
5 0.046129 0.066702 0.059713 0.046361 0.067800 1.000000

Using a normalized matrix keeps the same angle between proteins, but sets the norm of proteins to 1. Presumably, the angle between proteins is more important than the norm of proteins. This normalized matrix is suitable for supervised learning with the SVM, if labels are available. This concludes this tutorial.



References:

Sebastien Boisvert, Mario Marchand, Francois Laviolette, and Jacques Corbeil. Hiv-1 coreceptor usage prediction without multiple alignments: an application of string kernels. Retrovirology, 5(1):110, Dec 2008. [ bib | DOI | http ]

Comments

Popular posts from this blog

Le tissu adipeux brun, la thermogénèse, et les bains froids

My 2022 Calisthenics split routine

Adding ZVOL VIRTIO disks to a guest running on a host with the FreeBSD BHYVE hypervisor