Search

Back to top

Search for Sequences

For all types of sequence searches:

  • Do not enter spaces or dashes.
    Example: do not use: G E T R A P L, G-E-T-R-A-P-L
  • Do not enter amino acid modifications, phosphorylation, terminal and other marks.
    Example: do not use: Myr-GE(pT)RAPL-NH2
  • Remove the D prefix from D-amino acids.
    Example: do not use: Gly-Glu-Thr-Arg-D-Ala-Pro-Leu
  • Do not use three letter amino acid code, or full names.
    Example: do not use: Gly-Glu-Thr-Arg-Ala-Pro-Leu
  • Use only single letter amino acid code, uppercase or lowercase.
    Example: for all of the above, use: GETRAPL, getrapl

To search for sequences, use the tools designed specifically for this purpose. In most cases, these are not Quick Search or Advanced Search, which should be used for text search or pattern/motif search. Instead, we suggest to use BLAST search (if the query sequence is 4 amino acids or longer) or Smith-Waterman search (3 amino acids or longer). For shorter sequences, use either of these, but increase Expect or E(). BLAST search and Smith-Waterman search:

  • use substitution matrices which take into account different similarities or distances between amino acids. For example, when searching for GETRAPL, the hit GETKAPL will be closer to the top than GETDAPL. This is based on the fact that R is more often replaced with K than with D in evolution. In most cases this is what you want.
  • allow multiple query sequences in a single search (bulk queries).
  • find partial or shorter matches, in addition to exact matches. Although we store only peptides that are 20 amino acids or shorter, the query can be much longer.

Try this query example:

>exact_query
GETRAPL
>partial_query
GETKAPL
>long_query
GETRAPLAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQ
AQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQA
QAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQ
AQAQAQAQAQ

As the next best choice, use Advanced Search for Peptide: Sequence, or Quick Search. Note that a partially matching (GETKAPL) or a long (GETRAPLAQAQ...) query will not return any results. This is probably not what you want. Quick Search and Advanced Search are used mostly for text, rather than sequence, search. Their main advantage for sequence search is the possibility to search for certain more precisely specified patterns or motifs. Examples:

  • GET*APL will find entries with GET and APL having 0 or more characters in between, such as GETAPL, GETRAPL, GETKAPL, AAAGETQQAPLQQ, etc.
    Beware that Quick Search will also find strings like this in text: 'GET me an APL', because it searches all fields: sequences, text, and all.
  • G_T_APL will find entries with G followed by any one character, followed by T, followed by any one character, followed by APL - such as GETRAPL, GQTAAPL, KKGQTAAPLWW, etc.

Back to top

Quick and Advanced Search

For both Quick Search and Advanced Search, queries are case-insensitive.

The wild-card characters are star - '*' and underscore - '_'. Star matches zero or more characters. Underscore matches any single character. Star is automatically added to the left and to the right of the query. To prevent this, put double quotes around the query. To browse the entries, click on 'Browse Interactions', or use Quick Search with an empty query or with * and hit 'submit'.

Examples below show whether the entry can be found with the search query.

EntrySearch queryFound?
abcabcyes
abcAbCyes
abc"abc"yes
abc deabcyes
abc de"abc"no
abc debc dyes
abc debc *yes
abc debcyes
abc deab *no
abc deb*dyes
abc ded*bno
abc de"abc de"yes
abc de"a*d*"yes
abc de"a_c*"yes

Boolean queries (AND, OR, NOT) are not supported. Examples:

  • Do not use: MMP and Cathepsin
  • Use: MMP
  • Use: MMP*Cathepsin
  • Use: Cathepsin*MMP

We store MEDLINE abstracts without additional modifications, and do not resolve gene synonyms. Thus, queries for MMP2 and MMP-2 are not the same (just as they are not the same in MEDLINE/PubMed).

Got fewer results than expected? Try using a less restrictive query. Examples:

  • MMP2 (most restrictive)
  • MMP*2
  • MMP (least restrictive)
  • Thromboembolism (most restrictive)
  • Thromboemb
  • Thrombo
  • Thromb (least restrictive)

Back to top

Quick Search

Quick Search searches all text fields.

Back to top

Advanced Search

Advanced Search searches individual fields. Using only * without other characters is not allowed, unlike in Quick Search. The query fields are automatically connected by 'AND'.

Examples:

  • To find entries with score 0.9 exactly, use query: score: 0.9
  • To find entries manually marked as valid peptide sequence, use query: score: > 0.6
  • To find entries where the interaction takes place in human cells or tissues, use query: organism: Homo sapiens, or organism: sapiens, for short.
  • To find entries that have peptide motif 'RGD' and the word 'integrin' in the abstract text, use query: sequence: RGD, text: integrin
  • To find entries with Pubmed id 14759804, use query: text id: 14759804

Back to top

BLAST Search

BLAST search uses NCBI blastall program. BLAST is a faster but less accurate method than Smith-Waterman search. The default option settings are optimized for peptide database: -p blastp -W 2 -F F -T T .

They can be overridden using 'Other advanced options'. For example, to search with the options similar to the options suggested by NCBI for short, nearly exact matches, use Matrix: PAM30, Expect: 10000, other advanced options: -G 9 -E 1. The full list of available blastall options is below. See more info on the NCBI BLAST web page.

-p  Program Name [String]
-e  Expectation value (E) [Real]
  default = 10.0
-m  alignment view options:
	0 = pairwise,
	1 = query-anchored showing identities,
	2 = query-anchored no identities,
	3 = flat query-anchored, show identities,
	4 = flat query-anchored, no identities,
	5 = query-anchored no identities and blunt ends,
	6 = flat query-anchored, no identities and blunt ends,
	7 = XML Blast output,
	8 = tabular, 
	9 tabular with comment lines
	10 ASN, text
	11 ASN, binary [Integer]
  default = 0
-F  Filter query sequence (DUST with blastn, SEG with others) [String]
  default = T
-G  Cost to open a gap (zero invokes default behavior) [Integer]
  default = 0
-E  Cost to extend a gap (zero invokes default behavior) [Integer]
  default = 0
-X  X dropoff value for gapped alignment (in bits) (zero invokes default behavior)
    blastn 30, megablast 20, tblastx 0, all others 15 [Integer]
  default = 0
-I  Show GI's in deflines [T/F]
  default = F
-q  Penalty for a nucleotide mismatch (blastn only) [Integer]
  default = -3
-r  Reward for a nucleotide match (blastn only) [Integer]
  default = 1
-v  Number of database sequences to show one-line descriptions for (V) [Integer]
  default = 500
-b  Number of database sequence to show alignments for (B) [Integer]
  default = 250
-f  Threshold for extending hits, default if zero
    blastp 11, blastn 0, blastx 12, tblastn 13
    tblastx 13, megablast 0 [Integer]
  default = 0
-g  Perform gapped alignment (not available with tblastx) [T/F]
  default = T
-Q  Query Genetic code to use [Integer]
  default = 1
-D  DB Genetic code (for tblast[nx] only) [Integer]
  default = 1
-a  Number of processors to use [Integer]
  default = 1
-O  SeqAlign file [File Out]  Optional
-J  Believe the query defline [T/F]
  default = F
-M  Matrix [String]
  default = BLOSUM62
-W  Word size, default if zero (blastn 11, megablast 28, all others 3) [Integer]
  default = 0
-z  Effective length of the database (use zero for the real size) [Real]
  default = 0
-K  Number of best hits from a region to keep (off by default, if used a value of 100 
    is recommended) [Integer]
  default = 0
-P  0 for multiple hit, 1 for single hit (does not apply to blastn) [Integer]
  default = 0
-Y  Effective length of the search space (use zero for the real size) [Real]
  default = 0
-S  Query strands to search against database (for blast[nx], and tblastx)
     3 is both, 1 is top, 2 is bottom [Integer]
  default = 3
-T  Produce HTML output [T/F]
  default = F
-l  Restrict search of database to list of GI's [String]  Optional
-U  Use lower case filtering of FASTA sequence [T/F]  Optional
-y  X dropoff value for ungapped extensions in bits (0.0 invokes default behavior)
    blastn 20, megablast 10, all others 7 [Real]
  default = 0.0
-Z  X dropoff value for final gapped alignment in bits (0.0 invokes default behavior)
    blastn/megablast 50, tblastx 0, all others 25 [Integer]
  default = 0
-R  PSI-TBLASTN checkpoint file [File In]  Optional
-n  MegaBlast search [T/F]
  default = F
-L  Location on query sequence [String]  Optional
-A  Multiple Hits window size, default if zero (blastn/megablast 0, all others 40 [Integer]
  default = 0
-w  Frame shift penalty (OOF algorithm for blastx) [Integer]
  default = 0
-t  Length of the largest intron allowed in a translated nucleotide sequence when linking 
    multiple distinct alignments. (0 invokes default behavior; a negative value disables 
    linking.) [Integer]
  default = 0
-B  Number of concatenated queries, for blastn and tblastn [Integer]  Optional
  default = 0

Back to top

Smith-Waterman Search

Smith-Waterman search uses SSEARCH program. Smith-Waterman search is a more accurate, but slower method than BLAST. The default option settings are optimized for peptide database: -w 80 -z 2 -L -m 6 -H -Q .

They can be overridden using 'Other advanced options'. For example, to search with Penalty for the first residue in a gap = -6, use: -f 6. The full list of available FASTA/SSEARCH options is below (use SSEARCH options only).


-a   (fasta3, ssearch3 only) show both sequences in their
     entirety.

-A   force Smith-Waterman alignments for fasta3 DNA sequences.
     By default, only fasta3 protein sequence comparisons use
     Smith-Waterman alignments.

-B   Show normalized score as a z-score, rather than a bit-score
     in the list of best scores.

-b # Number of sequence scores to be shown on output.  In the
     absence of this option, fasta (and tfasta and ssearch)
     display all library sequences obtaining similarity scores
     with expectations less than 10.0 if optimized score are
     used, or 2.0 if they are not. The -b option can limit the
     display further, but it will not cause additional sequences
     to be displayed.

-c # Threshold score for optimization (OPTCUT).  Set "-c 1" to
     optimize every sequence in a database.

-E # Limit the number of scores and alignments shown based on the
     expected number of scores.  Used to override the expectation
     value of 10.0 used by default.  When used with -Q, -E 2.0
     will show all library sequences with scores with an
     expectation value <= 2.0.

-d # Maximum number of alignments to be displayed.  Ignored if
     "-Q" is not used.

-f   Penalty for the first residue in a gap (-12 by default for
     proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

-F # Limit the number of scores and alignments shown based on the
     expected number of scores. "-E #" sets the highest E()-value
     shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
     will not show any matches or alignments with E() < 0.0001.
     This allows one to skip over close relationships in searches
     for more distant relationships.

-g   Penalty for additional residues in a gap (-2 by default for
     proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

-h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).

-H   Omit histogram.

-i   Invert (reverse complement) the query sequence if it is DNA.
     For tfasta3/x3/y3, search the reverse complement of the
     library sequence only.

-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).

-l file
     Location of library menu file (FASTLIBS).

-L   Display more information about the library sequence in the
     alignment.

-M low-high
     Range of amino acid sequence lengths to be included in the
     search.

-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10

             -m 0        -m 1          -m 2          -m 3        -m 4
         MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
         ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
         MWKSCGYPYT   MWKSCGYPYT

     -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
     -m 5 plus HTML formatting.

-m 9 provides coordinates and scores with the best score
     information.  A simple  -m 9 extends the normal best score
     information:

         The best scores are:                                      opt bits E(14548)
         XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79

     to include the additional information (on the same line,
     separated by a <tab>):

         %_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
         0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0

      -m 9c provides additional information: an encoded alignment
     string.  Thus:

                10        20        30        40        50          60         70
         GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
                :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
         XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
                        20        30                 40        50        60

     would be encoded:

         =23+9=13-2=10-1=3+1=5

     The alignment encoding is with repect to the alignment, not
     the sequences.  The coordinate of the alignment is given
     earlier in the " -m 9c" line.

-m 10
     -m 10 is a new, parseable format for use with other
     programs.  See the file "readme.v20u4" for a more complete
     description.

     As of version "fa34t23b2", it has become possible to combine
     independent "-m" options.  Thus, one can use "-m 1 -m 6 -m
     9".

-M low-high
     Include library sequences (proteins only) with lengths
     between low and high.

-n   Force the query sequence to be treated as a DNA sequence.
     This is particularly useful for query sequences that contain
     a large number of ambiguous residues, e.g. transcription
     factor binding sites.

-o   Turn off default optimization of all scores greater than
     OPTCUT. Sort results by "initn" scores (reduces the accuracy
     of statistical estimates).

-p   Force query to be treated as protein sequence.

-Q,-q
     Quiet - does not prompt for any input.  Writes scores and
     alignments to the terminal or standard output file.

-r   Specify match/mismatch scores for DNA comparisons.  The
     default is "+5/-4". "+3/-2" can perform better in some
     cases.

-R file
     Save a results summary line for every sequence in the
     sequence library.  The summary line includes the sequence
     identifier, superfamily number (if available) position in
     the library, and the similarity scores calculated.  This
     option can be used to evaluate the sensitivity and
     selectivity of different search strategies (Pearson, 1995,
     Pearson, 1998).

-s file
     Specify the scoring matrix file.  fasta3 uses the same
     scoring matrices as Blast1.4/2.0.  Several scoring matrix
     files are included in the standard distribution.  For
     protein sequences: codaa.mat - based on minimum mutation
     matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
     matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
     pam120.mat - a PAM120 matrix.  The default scoring matrix is
     BLOSUM50 ("-s BL50"). Other matrices available from within
     the program are: PAM250/"-s P250", PAM120/"-s P120",
     PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
     (MDM are modern PAM matrices from Jones et al. (Jones et
     al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
     BL80".

-S   Treat lower-case characters in the query or library
     sequences as "low-complexity" ("seg"-ed) residues.
     Traditionally, the "seg" program (Wootton and
     Federhen, 1993) is used to remove low complexity regions in
     DNA sequences by replacing the residues with an "X".  When
     the "-S" option is used, the FASTA33 programs provide a
     potentially more informative approach.  With "-S", lower
     case characters in the query or database sequences are
     treated as "X"'s during the initial scan, but are treated as
     normal residues during the final alignment display.  Since
     statistical significance is calculated from the similarity
     score calculated during the library search, when the lower
     case residues are "X"'s, low complexity regions will not
     produce statistically significant matches.  However, if a
     significant alignment contains low complexity regions, their
     alignmen is shown.  With "-S", lower case characters may be
     included in the alignment to indicate low complexity
     regions, and the final alignment score may be higher than
     the score obtained during the search.

     The pseg program can be used to produce databases (or query
     sequences) with lower case residues indicating low
     complexity regions using the command:

         pseg database.fasta -z 1 -q  > database.lc_seg

     (seg can also be used with some post processing, see
     readme.v33tx.)

-U   Treat the query sequence an RNA sequence.  In addition to
     selecting a DNA/RNA alphabet, this option causes changes to
     the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
     as 'G:G'.

-V str
     It is now possible to specify some annotation characters
     that can be included (and will be ignored), in the query
     sequence file.  Thus, One might have a file with:
     "ACVS*ITRLFT?", where "*" and "?"  are used to indicate
     phosphorylation.  By giving the option -V '*?', those
     characters in the query will be moved to an "annotation
     string", and alignments that include the annotated residues
     will be highlighted with the appropriate character above the
     sequence (on the number line).

-w # Line length (width) = number (<200)

-W #  context length (default is 1/2 of line width -w) for
     alignment, like fasta and ssearch, that provide additional
     sequence context.

-x # Specify the penalty for a match to an 'X', independently of
     the PAM matrix.  Particularly useful for fastx3/fasty3,
     where termination codons are encoded as 'X'.

-X   Specifies offsets for the beginning of the query and library
     sequence.  For example, if you are comparing upstream
     regions for two genes, and the first sequence contains 500
     nt of upstream sequence while the second contains 300 nt of
     upstream sequence, you might try:

         fasta -X "-500 -300" seq1.nt seq2.nt

     If the -X option is not used, FASTA assumes numbering starts
     with 1.  (You should double check to be certain the negative
     numbering works properly.)

-y   Set the width of the band used for calculating "optimized"
     scores.  For proteins and ktup=2, the width is 16.  For
     proteins with ktup=1, the width is 32 by default.  For DNA
     the width is 16.

-z -1,0,1,2,3,4,5
     -z -1 turns off statistical calculations. z 0 estimates the
     significance of the match from the mean and standard
     deviation of the library scores, without correcting for
     library sequence length.  -z 1 (the default) uses a weighted
     regression of average score vs library sequence length; -z 2
     uses maximum likelihood estimates of Lambda and K; -z 3 uses
     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
     uses two variations on the -z 1 strategy. -z 1 and -z 2 are
     the best methods, in general.

-z 11,12,14,15
     estimate the statistical parameters from shuffled copies of
     each library sequence.  This doubles the time required for a
     search, but allows accurate statistics to be estimated for
     libraries comprised of a single protein family.

-Z db_size
     set the apparent size of the database to be used when
     calculating expectation E() values.  If you searched a
     database with 1,000 sequences, but would like to have the
     E()-values calculated in the context of a 100,000 sequence
     database, use '-Z 100000'.

-1   sort output by init1 score (for compatibility with FASTP -
     do not use).

-3   translate only three forward frames

Back to top

PepBank has been developed and is maintained by http://ric.csb.mgh.harvard.edu