Title: | Gapped-Kmer Support Vector Machine |
---|---|
Description: | Imports the 'gkmSVM' v2.0 functionalities into R <https://www.beerlab.org/gkmsvm/> It also uses the 'kernlab' library (separate R package by different authors) for various SVM algorithms. Users should note that the suggested packages 'rtracklayer', 'GenomicRanges', 'BSgenome', 'BiocGenerics', 'Biostrings', 'GenomeInfoDb', 'IRanges', and 'S4Vectors' are all BioConductor packages <https://bioconductor.org>. |
Authors: | Mahmoud Ghandi |
Maintainer: | Mike Beer <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.83.0 |
Built: | 2025-01-25 05:07:53 UTC |
Source: | https://github.com/cran/gkmSVM |
Imports the 'gkmSVM' v2.0 functionalities into R <http://www.beerlab.org/gkmsvm/> . It also uses the 'kernlab' library (separate R package by different authors) for various SVM algorithms.
The gkm-SVM provides implementation of a new SVM kernel method using gapped k-mers as features for DNA or Protein sequences.
There are three main functions in the gkmSVM package:
gkmsvm_kernel: computes the kernel matrix
gkmsvm_train: computes the SVM coefficients
gkmsvm_classify: scores new sequences using the SVM model
Tutorial
========
We introduce the users to the basic workflow of our gkmSVM step-by-step. Please refer to help messages for more detailed information of each function.
1) making a kernel matrix
First of all, we should calculate a full kernel matrix before training SVM classifiers. In this tutorial, we are going to use test_positives.fa as a positive set, and test_negatives.fa as a negative set.
#Input file names:
posfn= 'test_positives.fa' #positive set (FASTA format)
negfn= 'test_negatives.fa' #negative set (FASTA format)
testfn= 'test_testset.fa' #test set (FASTA format)
Alternatively if the negative set is not available, and positive set is provided as a bed file, genNullSeqs function could be used to generate the negative set and positive set sequences.
#Output file names:
kernelfn= 'test_kernel.txt' #kernel matrix
svmfnprfx= 'test_svmtrain' #SVM files
outfn = 'output.txt' #output scores for sequences in the test set
gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel
2) training SVM
We can now train a SVM classifier using the kernel matrix generated above. For that we use gkmsvm_train function It takes four arguments; kernel file, positive sequences file, negative sequences file, and prefix of output file names for the svm model.
gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM
It will generate two files, test_svmtrain_svalpha.out and test_svmtrain_svseq.fa, which will then be used for classification/scoring of test sequences as described below.
3) classification using SVM
gkmsvm_classify can be used to score any set of sequences. Here, we will score the test sequences which are given in test_testset.fa. Note that the same set of parameters used in the gkmsvm_kernel should always be specified for optimal classification (here we used default parameters).
gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
In a more advanced example, we set the word length L=18, and the number of non-gapped positions K=7, and maximum number of mismatches maxnmm=4:
gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel
gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM
gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
In another example, we run a 5-fold cross validation to plot the ROC curves:
gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel
cvres = gkmsvm_trainCV(kernelfn, posfn, negfn, svmfnprfx, outputPDFfn='ROC.pdf', outputCVpredfn='cvpred.out'); #trains SVM, plots ROC and PRC curves, and outputs model predictions.
Mahmoud Ghandi
Maintainer: Mike Beer <[email protected]>
Ghandi M, Lee D, Mohammad-Noori M, Beer MA. 2014. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10: e1003711.
Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway LA, and Beer MA. 2016. gkmSVM an R package for gapped-kmer SVM, Bioinformatics 32 (14), 2205-2207.
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences # using L=18, K=7, maxnmm=4 # gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences # using L=18, K=7, maxnmm=4 # gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
Generates null sequences (negative set) with matching repeat and GC content as the input bed file for positive set regions.
genNullSeqs( inputBedFN, genomeVersion='hg19', outputBedFN = 'negSet.bed', outputPosFastaFN = 'posSet.fa', outputNegFastaFN = 'negSet.fa', xfold = 1, repeat_match_tol = 0.02, GC_match_tol = 0.02, length_match_tol = 0.02, batchsize = 5000, nMaxTrials = 20, genome = NULL)
genNullSeqs( inputBedFN, genomeVersion='hg19', outputBedFN = 'negSet.bed', outputPosFastaFN = 'posSet.fa', outputNegFastaFN = 'negSet.fa', xfold = 1, repeat_match_tol = 0.02, GC_match_tol = 0.02, length_match_tol = 0.02, batchsize = 5000, nMaxTrials = 20, genome = NULL)
inputBedFN |
positive set regions |
genomeVersion |
genome version: 'hg19' and 'hg18' are supported. Default='hg19'. For other genomes, provide the BSgenome object using parameter 'genome' |
outputBedFN |
output file name for the null sequences genomic regions. Default='negSet.bed' |
outputPosFastaFN |
output file name for the positive set sequences. Default='posSet.fa' |
outputNegFastaFN |
output file name for the negative set sequences. Default='negSet.fa' |
xfold |
controls the desired number of sequences in the negative set. Default=1 (same number as in positive set) |
repeat_match_tol |
tolerance for difference in repeat ratio. Default=0.02 (repeat content difference of 0.02 or less is acceptable) |
GC_match_tol |
tolerance for difference in GC content. Default=0.02 |
length_match_tol |
tolerance for difference in relative sequence length. Default=0.02 |
batchsize |
number of candidate random sequences tested in each trial. Default=5000 |
nMaxTrials |
maximum number of trials. Default=20. |
genome |
BSgenome object. Default=NULL. If this parameter is used, parameter genomeVersion is ignored. |
Writes the null sequences to files with the provided filenames. Outputs the filename for the output negative sequences file.
Mahmoud Ghandi
# Example 1: # genNullSeqs('ctcfpos.bed' ); #Example 2: # genNullSeqs('ctcfpos.bed', nMaxTrials=3, xfold=2, genomeVersion = 'hg18' ); #Example 3: # genNullSeqs('ctcfpos.bed', xfold=2, genomeVersion = 'hg18', outputBedFN = 'ctcf_negSet.bed', # outputPosFastaFN = 'ctcf_posSet.fa',outputNegFastaFN = 'ctcf_negSet.fa' ); #Example 4: # Input file names: posBedFN = 'test_positives.bed' # positive set genomic ranges (bed format) genomeVer = 'hg19' #genome version testfn= 'test_testset.fa' #test set (FASTA format) # output file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # genNullSeqs(posBedFN, genomeVersion = genomeVer, # outputPosFastaFN = posfn, outputNegFastaFN = negfn ); # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences # using L=18, K=7, maxnmm=4 # gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
# Example 1: # genNullSeqs('ctcfpos.bed' ); #Example 2: # genNullSeqs('ctcfpos.bed', nMaxTrials=3, xfold=2, genomeVersion = 'hg18' ); #Example 3: # genNullSeqs('ctcfpos.bed', xfold=2, genomeVersion = 'hg18', outputBedFN = 'ctcf_negSet.bed', # outputPosFastaFN = 'ctcf_posSet.fa',outputNegFastaFN = 'ctcf_negSet.fa' ); #Example 4: # Input file names: posBedFN = 'test_positives.bed' # positive set genomic ranges (bed format) genomeVer = 'hg19' #genome version testfn= 'test_testset.fa' #test set (FASTA format) # output file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # genNullSeqs(posBedFN, genomeVersion = genomeVer, # outputPosFastaFN = posfn, outputNegFastaFN = negfn ); # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences # using L=18, K=7, maxnmm=4 # gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel # gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
Given support vectors SVs and corresponding coefficients alphas and a set of sequences, calculates the SVM scores for the sequences.
gkmsvm_classify(seqfile, svmfnprfx, outfile, L=10, K=6, maxnmm=3, maxseqlen=10000, maxnumseq=1000000, useTgkm=1, alg=0, addRC=TRUE, usePseudocnt=FALSE, batchSize=100000, wildcardLambda=1.0, wildcardMismatchM=2, alphabetFN="NULL", svseqfile=NA, alphafile=NA)
gkmsvm_classify(seqfile, svmfnprfx, outfile, L=10, K=6, maxnmm=3, maxseqlen=10000, maxnumseq=1000000, useTgkm=1, alg=0, addRC=TRUE, usePseudocnt=FALSE, batchSize=100000, wildcardLambda=1.0, wildcardMismatchM=2, alphabetFN="NULL", svseqfile=NA, alphafile=NA)
seqfile |
input sequences file name (FASTA format) |
svmfnprfx |
SVM model file name prefix |
outfile |
output file name |
L |
word length, default=10 |
K |
number of informative columns, default=6 |
maxnmm |
maximum number of mismatches to consider, default=3 |
maxseqlen |
maximum sequence length in the sequence files, default=10000 |
maxnumseq |
maximum number of sequences in the sequence files, default=1000000 |
useTgkm |
filter type: 0(use full filter), 1(use truncated filter: this gaurantees non-negative counts for all L-mers), 2(use h[m], gkm count vector), 3(wildcard), 4(mismatch), default=1 |
alg |
algorithm type: 0(auto), 1(XOR Hashtable), 2(tree), default=0 |
addRC |
adds reverse complement sequences, default=TRUE |
usePseudocnt |
adds a constant to count estimates, default=FALSE |
batchSize |
number of sequences to compute scores for in batch, default=100000 |
wildcardLambda |
lambda for wildcard kernel, defaul=0.9 |
wildcardMismatchM |
max mismatch for Mismatch kernel or wildcard kernel, default=2 |
alphabetFN |
alphabets file name, if not specified, it is assumed the inputs are DNA sequences |
svseqfile |
SVM support vectors sequence file name (not needed if svmfnprfx is provided) |
alphafile |
SVM support vectors weights file name (not needed if svmfnprfx is provided) |
classification using SVM: gkmsvm_classify can be used to score any set of sequences. Note that the same set of parameters (L, K, maxnmm) used in the gkmsvm_kernel should be specified for optimal classification.
gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
Mahmoud Ghandi
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
Given support vectors SVs and corresponding coefficients alphas and a pair of file test sequence files (one for reference allele, and one for alternate allele), calculates the deltaSVM scores for the sequences.
gkmsvm_delta(seqfile_allele1, seqfile_allele2, svmfnprfx, outfile, L = 10, K = 6, maxnmm = 3, maxseqlen = 10000, maxnumseq = 1e+06, useTgkm = 1, alg = 2, addRC = TRUE, usePseudocnt = FALSE, batchSize = 1e+05, wildcardLambda = 1, wildcardMismatchM = 2, alphabetFN = "NULL", svseqfile = NA,alphafile = NA, outfile_allele1 = NA, outfile_allele2 = NA)
gkmsvm_delta(seqfile_allele1, seqfile_allele2, svmfnprfx, outfile, L = 10, K = 6, maxnmm = 3, maxseqlen = 10000, maxnumseq = 1e+06, useTgkm = 1, alg = 2, addRC = TRUE, usePseudocnt = FALSE, batchSize = 1e+05, wildcardLambda = 1, wildcardMismatchM = 2, alphabetFN = "NULL", svseqfile = NA,alphafile = NA, outfile_allele1 = NA, outfile_allele2 = NA)
seqfile_allele1 |
fasta file containing the test sequences (reference allele) |
seqfile_allele2 |
fasta file containing the test sequences (alternate allele). The sequences in this file should be in the exact same order as in seqfile_allele1. |
svmfnprfx |
SVM model file name prefix |
outfile |
output file name |
L |
word length, default=10 |
K |
number of informative columns, default=6 |
maxnmm |
maximum number of mismatches to consider, default=3 |
maxseqlen |
maximum sequence length in the sequence files, default=10000 |
maxnumseq |
maximum number of sequences in the sequence files, default=1000000 |
useTgkm |
filter type: 0(use full filter), 1(use truncated filter: this gaurantees non-negative counts for all L-mers), 2(use h[m], gkm count vector), 3(wildcard), 4(mismatch), default=1 |
alg |
algorithm type: 0(auto), 1(XOR Hashtable), 2(tree), default=0 |
addRC |
adds reverse complement sequences, default=TRUE |
usePseudocnt |
adds a constant to count estimates, default=FALSE |
batchSize |
number of sequences to compute scores for in batch, default=100000 |
wildcardLambda |
lambda for wildcard kernel, defaul=0.9 |
wildcardMismatchM |
max mismatch for Mismatch kernel or wildcard kernel, default=2 |
alphabetFN |
alphabets file name, if not specified, it is assumed the inputs are DNA sequences |
svseqfile |
SVM support vectors sequence file name (not needed if svmfnprfx is provided) |
alphafile |
SVM support vectors weights file name (not needed if svmfnprfx is provided) |
outfile_allele1 |
output filename for gkmSVM scores for the reference sequences (optional) |
outfile_allele2 |
output filename for gkmSVM scores for the alternate sequences (optional) |
predicting the effect of variants using gkmSVM model: gkmsvm_delta can be used to predict the effect of sequence variants. The sequences corresponding to reference allele and alternate alleles are given in two separate files. gkmSVM is used to score each set of sequences, and the difference in the gkmSVM score for the reference and alternate allele is reported. Note that the same set of parameters (L, K, maxnmm) used in the gkmsvm_kernel should be specified for optimal scoring.
gkmsvm_kernel(seqfile_allele1, seqfile_allele2, svmfnprfx, outfn); #scores test sequences
deltaSVM scores
Mahmoud Ghandi
Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway LA, and Beer MA. gkmSVM: an R package for gapped-kmer SVM, Bioinformatics 2016.
Ghandi M, Lee D, Mohammad-Noori M, Beer MA. 2014. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10: e1003711.
Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, and Beer MA. A method to predict the impact of regulatory variants from DNA sequence. Nature Genetics 2015.
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn_ref= 'test_testsetRef.fa' #test set (reference allele) (FASTA format) testfn_alt= 'test_testsetAlt.fa' #test set (alternate allele) (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output delta svm scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_delta(testfn_ref,testfn_alt, svmfnprfx, outfn); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn_ref= 'test_testsetRef.fa' #test set (reference allele) (FASTA format) testfn_alt= 'test_testsetAlt.fa' #test set (alternate allele) (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output delta svm scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_delta(testfn_ref,testfn_alt, svmfnprfx, outfn); #scores test sequences
Generates a lower triangle of kernel matrix (i.e. pairwise similarities) between the sequences.
gkmsvm_kernel(posfile, negfile, outfile, L=10, K=6, maxnmm=3, maxseqlen=10000, maxnumseq=1000000, useTgkm=1, alg=0, addRC=TRUE, usePseudocnt=FALSE, wildcardLambda=1.0, wildcardMismatchM=2, alphabetFN="NULL")
gkmsvm_kernel(posfile, negfile, outfile, L=10, K=6, maxnmm=3, maxseqlen=10000, maxnumseq=1000000, useTgkm=1, alg=0, addRC=TRUE, usePseudocnt=FALSE, wildcardLambda=1.0, wildcardMismatchM=2, alphabetFN="NULL")
posfile |
positive sequences file name (FASTA format) |
negfile |
negative sequences file name (FASTA format) |
outfile |
output file name |
L |
word length, default=10 |
K |
number of informative columns, default=6 |
maxnmm |
maximum number of mismatches to consider, default=3 |
maxseqlen |
maximum sequence length in the sequence files, default=10000 |
maxnumseq |
maximum number of sequences in the sequence files, default=1000000 |
useTgkm |
filter type: 0(use full filter), 1(use truncated filter: this gaurantees non-negative counts for all L-mers), 2(use h[m], gkm count vector), 3(wildcard), 4(mismatch), default=1 |
alg |
algorithm type: 0(auto), 1(XOR Hashtable), 2(tree), default=0 |
addRC |
adds reverse complement sequences, default=TRUE |
usePseudocnt |
adds a constant to count estimates, default=FALSE |
wildcardLambda |
lambda for wildcard kernel, defaul=0.9 |
wildcardMismatchM |
max mismatch for Mismatch kernel or wildcard kernel, default=2 |
alphabetFN |
alphabets file name, if not specified, it is assumed the inputs are DNA sequences |
It calculates the full kernel matrix that can be then used to train an SVM classifier. gkmsvm_kernel(posfn, negfn, kernelfn);
Mahmoud Ghandi
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
Using the kernel matrix created by 'gkmsvm_kernel', this function trains the SVM classifier. Here we rely on the 'kernlab' package, and merely provide a wrapper function.
gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx, Type="C-svc", C=1, shrinking=FALSE, ...)
gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx, Type="C-svc", C=1, shrinking=FALSE, ...)
kernelfn |
kernel matrix file name |
posfn |
positive sequences file name |
negfn |
negative sequences file name |
svmfnprfx |
output SVM model file name prefix |
Type |
optional: SVM type (default='C-svc'), see 'kernlab' documentation for more details. |
C |
optional: SVM parameter C (default=1), see 'kernlab' documentation for more details. |
shrinking |
optional: shrinking parameter for kernlab (default=FALSE), see 'kernlab' documentation for more details. |
... |
optional: additional SVM parameters, see 'kernlab' documentation for more details. |
Trains SVM classifier and generates two files: [svmfnprfx]_svalpha.out for SVM alphas and the other for the corresponding SV sequences ([svmfnprfx]_svseq.fa)
Mahmoud Ghandi
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx); #trains SVM # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
Using the kernel matrix created by 'gkmsvm_kernel', this function trains the SVM classifier. It uses repeated CV to find optimum SVM parameter C. Also generates ROC and PRC curves.
gkmsvm_trainCV(kernelfn, posfn, negfn, svmfnprfx=NA, nCV=5, nrepeat=1, cv=NA, Type="C-svc", C=1, shrinking=FALSE, showPlots=TRUE, outputPDFfn=NA, outputCVpredfn=NA, outputROCfn=NA, ...)
gkmsvm_trainCV(kernelfn, posfn, negfn, svmfnprfx=NA, nCV=5, nrepeat=1, cv=NA, Type="C-svc", C=1, shrinking=FALSE, showPlots=TRUE, outputPDFfn=NA, outputCVpredfn=NA, outputROCfn=NA, ...)
kernelfn |
kernel matrix file name |
posfn |
positive sequences file name |
negfn |
negative sequences file name |
svmfnprfx |
(optional) output SVM model file name prefix |
nCV |
(optional) number of CV folds |
nrepeat |
(optional) number of repeated CVs |
cv |
(optional) CV group label. An array of length (npos+nneg), containing CV group number (between 1 an nCV) for each sequence |
Type |
(optional) SVM type (default='C-svc'), see 'kernlab' documentation for more details. |
C |
(optional)a vector of all values of C (SVM parameter) to be tested. (default=1), see 'kernlab' documentation for more details. |
shrinking |
optional: shrinking parameter for kernlab (default=FALSE), see 'kernlab' documentation for more details. |
showPlots |
generate plots (default==TRUE) |
outputPDFfn |
filename for output PDF, default=NA (no PDF output) |
outputCVpredfn |
filename for output cvpred (predicted CV values), default=NA (no output) |
outputROCfn |
filename for output auROC (Area Under an ROC Curve) and auPRC (Area Under the Precision Recall Curve) values, default=NA (no output) |
... |
optional: additional SVM parameters, see 'kernlab' documentation for more details. |
Trains SVM classifier and generates two files: [svmfnprfx]_svalpha.out for SVM alphas and the other for the corresponding SV sequences ([svmfnprfx]_svseq.fa)
Mahmoud Ghandi
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # cvres = gkmsvm_trainCV(kernelfn,posfn, negfn, svmfnprfx, # outputPDFfn='ROC.pdf', outputCVpredfn='cvpred.out'); # #trains SVM, plots ROC and PRC curves, and outputs model predictions. # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
#Input file names: posfn= 'test_positives.fa' #positive set (FASTA format) negfn= 'test_negatives.fa' #negative set (FASTA format) testfn= 'test_testset.fa' #test set (FASTA format) #Output file names: kernelfn= 'test_kernel.txt' #kernel matrix svmfnprfx= 'test_svmtrain' #SVM files outfn = 'output.txt' #output scores for sequences in the test set # gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel # cvres = gkmsvm_trainCV(kernelfn,posfn, negfn, svmfnprfx, # outputPDFfn='ROC.pdf', outputCVpredfn='cvpred.out'); # #trains SVM, plots ROC and PRC curves, and outputs model predictions. # gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences