NAME

Peptide::Pubmed - extract peptide sequences from MEDLINE article abstracts.


SYNOPSIS

  use Peptide::Pubmed;
  $parser = Peptide::Pubmed->new;
  $in = { 
     PMID       => q[15527327],
     Author     => q[Doe JJ, Smith Q],
     Journal    => q[J Biological Foo. 2004;8(2):123-30.],
     Title      => q[Foo, bar and its significance in phage display.],
     Abstract   => 
      q[Peptide sequences EYHHYNK and Arg-Gly-Asp, but not ACCCGTNA or VEGFRI.],
     Mesh       => q[Genes, p53/genetics; Humans; Bar],
     Chemical   => q[Multienzyme Complexes; Peptide Library; Foo],
    };
  $parser->parse_abstract($in);
  # get the peptide sequences in 1 letter symbols (select all words where the 
  # combined word/abstract score is above threshold: 
  # WordAbstScore >= WordAbstScoreMin):
  @seqs = $parser->get_seqs;
  print "@seqs\n"; # prints: 'EYHHYNK RGD'


EXAMPLES

  # same as above, set threshold explicitly:
  $parser->WordAbstScoreMin(0.4);
  @seqs = $parser->get_seqs;
  # set low threshold to get more peptide sequences (but at a cost of getting 
  # more false positives) 
  $parser->WordAbstScoreMin(-1);
  @seqs = $parser->get_seqs;
  print "@seqs\n"; # prints: 'EYHHYNK RGD ACCCGTNA VEGFRI'
  # reset threshold back:
  $parser->WordAbstScoreMin(0.4);
  # get more data for the abstract:
  $abst = $parser->get_abst;
  print "$abst->{AbstScore}\n"; # abstract score, in the [0,1] interval
  print "$abst->{AbstMtext}\n"; # abstract with sequences marked up: 
  # 'Peptide sequences <mark>EYHHYNK</mark> and <mark>Arg-Gly-Asp,</mark> 
  # but not ACCCGTNA or VEGFRI.'
  # get more data for the words, in addition to peptide sequences:
  @words = $parser->get_words;
  for my $word (@words) {
      # combined word/abstract score, in the [0,1] interval
      print "$word->{WordAbstScore}\n"; 
      # word as found in the abstract, eg 'Arg-Gly-Asp,'
      print "$word->{WordOrig}\n";
      # peptide sequence in 1 letter symbols, eg 'RGD'  
      print "$word->{WordSequence}\n";  
  }
  # There are no mandatory input fields. This will work too, but may give lower score.
  $in = { 
         Abstract => 
          q[Peptide sequences EYHHYNK and Arg-Gly-Asp, but not ACCCGTNA or VEGFRI.],
        };
  $parser->parse_abstract($in);
  @words = $parser->get_words;
  # No peptide sequences are found in empty input:
  $in = undef;
  $parser->parse_abstract($in);
  @words = $parser->get_words;


DESCRIPTION

Provides common methods to parse peptide sequences from Pubmed abstracts. The computed variables can be used for classification in external programs.

For all variables below:

Varables (Abst|Word)Is*: Allowed values: 1/0.

Variables (Abst|Word)Num*: Allowed values: integer.

Variables (Abst|Word)Prop*: Allowed values: real in [0,1].

Variables (Abst|Word)Score*: Allowed values: real in [0,1]. Score near 1 corresponds to ``more relevant'' abstracts or words (that is, likely to contain peptide sequences), score near 0 - to ``less relevant'' abstracts or words.

Variables (Abst|Word)* (all other): Allowed values: string, unless otherwise specified below.

Input variables

Input variables can be used optionally without the 'Abst' prefix, so 'AbstPMID' and 'PMID' are treated identically.

AbstPMID: PubMed ID.

AbstAuthor: article authors.

AbstJournal: journal citation, with year if available. For format, see examples.

AbstTitle: article title.

AbstAbstract: abstract.

AbstMesh: Medical Subject Headings (MeSH) terms.

AbstChemical: chemical list.

Output variables, for the abstract text

AbstNumAb: number of matches to words like antibody, epitope, etc.

AbstNumAllCap: number of all capitalized words in the abstract.

AbstNumBind: number of matches to words like binds, interacts, etc.

AbstNumDigest: number of matches to words like Edman, MS/MS, trypsin, etc.

AbstNumMHC: number of matches to words like MHC, TCR, etc.

AbstNumPeptide: number of matches to words like peptide, oligopeptide, motif, etc.

AbstNumPhage: number of matches to words like phage, display, etc.

AbstNumProtease: number of matches to words like peptidase, cutting, etc.

AbstNumWords: number of words. Allowed values: integer.

AbstPropAllCap: proportion of all capitalized words in the abstract (= AbstNumAllCap / AbstNumWords).

AbstScore: heuristic score to predict whether the abstract contains a peptide sequence, computed based on Abst* variables.

Other variables:

AbstComment: free text comment for debugging.

AbstMtext: abstract with sequences marked using '<mark>...</mark>' tags.

Output variables, for the word

WordIsDNA: is an all DNA sequence? Example: ACCTTG.

WordIsDict: is in the dictionary (currently, of english words and of scientific terms, software names and abbreviations)? Example: MATERIALS, RT-PCR.

WordIsGene: is a gene name, protein name, gene symbol, protein symbol, etc? Example: TFIIA.

WordSeqLen: peptide length, in amino acids.

WordOrig: the word as found in the abstract text.

WordPropDegen: proportion of degenerate amino acids, e.g, 0.6 for AXXXC.

WordPropProtein: a measure which is positive if a given word composition looks more like a protein sequence than like an english word, and negative otherwise. Computed using frequencies of overlapping k-mers. It is defined as follows: WordPropProtein = sum (over all overlapping k-mers within the word) of (log10Pp - log10Pn). log10Pp is log base 10 of the proportion of the k-mer in the database of known protein sequences, and log10Pn - same, for non-sequences (here, english words from pubmed abstracts not related to peptides). For a word with all k-mers equally frequent among sequences and non-sequences, log10Pp = log10Pn, and WordPropProtein = 0. Allowed values: real, [-Inf,+Inf]

WordScore: heuristic score to predict whether the word contains a peptide sequence, computed based on Word* variables.

WordAbstScore: heuristic score to predict whether the word contains a peptide sequence, computed based on both Abst* and Word* variables.

WordSequence: word converted to peptide sequence in 1-letter amino acid symbols.

Other variables:

WordAaSymbols: amino acid code. Allowed values: 1 (1 letter), 3 (3 letter or full name). Note that separate handling of 3 letter symbols and full names is currently not implemented.

WordIdx: word index in the abstract. Allowed values: nonnegative integer.


NOTES

WordPropProtein

Note that to optimize classification using frequencies of k-mers, k should be chosen so that for a given text, there are 'not too many' empty cells, that is, 'not too many; k-mers that did not occur. For a typical english text, with the combined text length of 100,000, alphabet size of 26 (A-Z, case-insensitive), k=3 is a good choice, because there are 26 ** 3 = 17576 different k-mers, and the expected frequency of each k-mer is 5.7. Actually, the expected frequency is somewhat less because of the effect of the word boundaries. That is, text: 'foobletch' contains more 3-mers than 'foo bletch' because after splitting on whitespace, 'oob', 'obl' do not occur in 'foo bletch'.

Variable and method names

By convention, method, variable and keys names like these: VarNamesLikeThese are used for cases where the corresponding field names may be printed in the output table, such as the rdb table in parse_file(). For the rest of the names, var_names_like_these are used.


KNOWN BUGS

False negatives

Peptide length cutoff

Peptide length cutoff is 20 amino acids. This is a somewhat arbitary choice. In various sources, cutoffs between 15 and 50 amino acids are used to define oligopeptides.

False positives

Gene symbols, english words, scientific terms and abbreviations

Some of these were misclassified as peptide sequence, even though this code uses several dictionaries to find such non-sequence words.

Incorrectly parsed sequence

The recommendations of IUPAC can be found in: Nomenclature and Symbolism for Amino Acids and Peptides,

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A1416.html

They are not always followed in the published abstracts. More flexible input rules are thus allowed for peptide sequences. However, the following bugs may occur in parsed sequences.

Amino acid position and the number of repetitions are not resolved

Y(n) usually means amino acid Y at position n, but sometimes also means Y repeated n times. It is always resolved as Y, and n is ignored. However, X(n), where X is 'X', 'Xaa', 'Xxx', etc, usually means 'any amino acid, repeated n times. It is always resolved as X repeated n times.


REFERENCES

ADAM

This section refers to ADAM, on which Peptide::DictionaryAbbreviations is based. See http://arrowsmith.psych.uic.edu. I would like to thank ADAM authors (in particular, Neil Smalheiser) for graciously providing ADAM.

ADAM citation:

Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818.

 ADAM OVERVIEW                                                                 
                                                                          
     ADAM is an abbreviation database generated from the 2006             
     baseline of MEDLINE. It covers frequently used abbreviations         
     and their definitions (or long-forms), including both                
     acronyms and non-acronym abbreviations.                              
     Reference
     
     Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database 
     of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818.
 ADAM Copyright                                                                
                                                                          
     University of Illinois at Chicago, 2006.                             
                                                                          
 ADAM License
                                                                
     By using this software, you expressly agree that your use will be
     noncommercial, that you will not use this software to make money, and that
     you will not distribute the software to anyone else or let anyone else use
     it.  Moreover, you will give credit to the University of Illinois and Dr.
     Smalheiser as the author of the software.  The software is provided "as is"
     and without warranties of any kind, express or implied, including but not
     limited to the implied warranties of merchantability and fitness, and
     statutory warranties of noninfringement.


AUTHOR

Timur Shtatland, tshtatland at mgh dot harvard dot edu

Copyright (C) 2007 by The General Hospital Corporation.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html


SEE ALSO

RDB - a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX Operating system, by Walter V. Hobbs.


APPENDIX

The rest of the documentation details the methods.

new

  Args        : %named_parameters:
                AbstScoreMin : print abstract data if AbstScore >= this threshold.
                WordScoreMin : print word data if WordScore >=  this threshold.
                WordAbstScoreMin : print word data if WordAbstScore >= this threshold.
                Allowed values: real in [0,1] interval, or -1.
                Set to -1 to undefine these, and print all that satisfy 
                *PrintMin*, *PrintMax* thresholds.
                WordNumPrintMin : min number of words to print per abstract.
                Allowed values: 1 and -1
                WordNumPrintMax : max number of words to print per abstract.
                Allowed values: non-negative integer, or -1.
                Set to -1 to undefine these, and print all that satisfy 
                *ScoreMin* thresholds.
                If WordNumPrintMin = 1, at least one line is printed
                per each abstract for which AbstScore satisfies
                AbstScoreMin threshold. That is, 1 line is printed
                even if no sequences were found (many Word* vars will
                be empty or 0 when this happens) or if no sequences
                satisfy WordScore / WordScoreMin threshold. If there
                are sequences found, at least 1 line is printed (which
                may or may not satisfy WordScore / WordScoreMin
                threshold), and the rest are printed only if they
                satisfy it. If WordNumPrintMax = N, where N is a
                positive integer, only the first N words are printed
                (_not_ the best N words by score). A special case is
                (WordNumPrintMin => 1, WordNumPrintMax => 1), which
                prints exactly 1 line per abstract. Word* vars are not
                computed, which make the code much faster. An even
                more special case is (AbstScoreMin => -1, WordScoreMin
                => -1, WordAbstScoreMin => -1, WordNumPrintMin => 1,
                WordNumPrintMax => 1), which prints all abstracts, 1
                abstract per line, and computes Abst*, but not Word*
                vars, thus allowing to separate computation into 2
                steps: compute and select abstracts with good
                AbstScore, then compute and select abstracts and words
                with good WordAbstScore, WordScore.
                in_col : columns for input
                kmer_length : for computing WordPropProtein:    => 3,
         
                print_col : columns for printing
                print_colf : column formats for rdb table, eg [qw(50S 10N)]; 
                default: undef, to be determined automatically from col names.
                verbose : verbosity level for diagnostic msgs. Allowed values: 0..4.
  Example     : my $parser = Peptide::Pubmed->new(verbose => $verbose) or 
                        carp "not ok: new Peptide::Pubmed" and return;
  Description : bare constructor. All work done by init().
  Returns     : TRUE if successful, FALSE otherwise.

get_abst

  Args        : none
  Example     : $abst = $parser->get_abst;
  Description : get the output data for Abst* variables.
  Returns     : a ref to hash with Abst* variables if successful, 
                ref to empty hash otherwise.

get_words

  Args        : none
  Example     : # get data for words likely to contain peptide sequences:
                @words = $parser->get_words;
  Description : get the output data for Words* variables.
  Returns     : an array where each element is a ref to hash with
                Word* variables output data for 1 word. Only words
                that satisfy the threshold: combined word/abstract score 
                WordAbstScore greater than or equal to WordAbstScoreMin, 
                are included, otherwise an empty array is returned. 
                To return all words, set WordAbstScoreMin to -1.
  Args        : none
  Example     : @seqs = $parser->get_seqs;
  Description : 
  Returns     : a list with sequences for all words in the abstract, 
                empty list if none found.

WordAbstScoreMin

  Args        : (optional) new value
  Example     : $parser->WordAbstScoreMin(0.4);
                $parser->WordAbstScoreMin;
  Description : assign to WordAbstScoreMin a new value, if called with an argument.
  Returns     : WordAbstScoreMin value (after assignment)

init

  Args        : none
  Example     : $parser->init or return;
  Description : check to see if the entry is ok. Initialize several fields used for 
                printing, such as column definitions.
  Returns     : always TRUE

splitln

  Arg[1]      : string of items, one item per line.
  Arg[2]      : (optional) %named_parameters:
                lc : convert to lowercase? 1 = yes, 0 = no. Default = 1.
  Example     : my %words = map { $_ => 1 } splitln("foo\nbar\n", lc => 0);
                my %words = map { $_ => 1 } $this->splitln("foo\nbar\n");
  Description : parse input string into list of items, remove leading and trailing 
                whitespace, remove empty lines, convert to lc. Can be used as a 
                method call or direct call.
  Returns     : list of items

parse_file

  Args        : %named_parameters:
                mandatory:
                in_fname : input file name.
                out_fname : output file name.
  Example     : $parser->parse_file(in_fname => $in_fname, out_fname => $out_fname);
  Description : reads input and write output, both in rdb
                format. Input data is 1 abstract per line. Prints data
                for all sequences found, 1 sequence per line. Skips
                abstracts and words based on rules described in
                new(). Because it prints data for 1 word per line, the
                same abstract data is printed for all sequences from
                the same abstract.
  Returns     : TRUE if successful, FALSE otherwise.

read_rdb_header

  Args        : input rdb file handle
  Example     : my ($comment, $ra_col, $ra_colf) = read_rdb_header($in_fh);
  Description : Reads header of an rdb table. 
  Returns     : Returns comment (or empty string), ref to array of column names, 
                reference to array of column definitions.

print_rdb_header

  Args        : output rdb file handle, comment (or undef), ref to array of 
                column names, reference to array of column definitions.
  Example     : print_rdb_header($out_fh, $comment, $self->{print_col}, 
                $self->{print_colf});
  Description : Prints header of an rdb table. If comment is undefined, 
                it is not printed.
  Returns     : always TRUE.

parse_abstract

  Args        : ref to hash with input data for one pubmed abstract
  Example     : $parser->parse_abstract($rh_pm_in);
  Description : does all parsing for 1 abstract by calling init_abstract, AbstVars.
  Returns     : ref to hash with parsed data if successful, FALSE if error. 
                If no sequences are found, the hash will have a single sequence 
                consisting of an empty string (this is not considered an error).

init_abstract

  Args        : ref to hash with input data with field names 'Title', 'Abstract', etc 
                (optionally, with prefix 'Abst', eg 'AbstTitle', 'AbstAbstract'). 
                See field names listed in in_col.
  Example     : $rh_pm_in = { Title => q[Some title.], Abstract => q[Some abstract.] };
                $self->init_abstract($rh_pm_in) or return;
  Description : Reads the input data and stores in 'Abst*' fields
                ('Abst' prefix is added if needed). Adds prefix 'Abst'
                to field names if needed. Converts to lower case and
                stores text in AbstLc* fields. Undefined fields in the
                input are converted to empty strings and
                stored. Currently, there are no mandatory fields:
                undefined $rh_pm_in is not considered an error. In
                such case, all fields will be empty strings. No
                sequences will be found, and the default score (zero)
                will be assigned.
  Returns     : TRUE if successful, FALSE otherwise.

AbstVars

  Args        : none
  Example     : $parser->AbstVars or return;
  Description : Calls all Abst* methods that compute the corresponding variables, 
                eg AbstNumAb. Calls Words which in turn calls Word* methods.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNum*

All AbstNum* methods use the same general scheme, unless specified otherwise. Each method refers to a specific class of patterns, eg AbstNumAb() refers to antibody related patterns. Each method looks in the AbstLcAllText field (concatenated abstract, mesh terms, chemical terms) and finds the matches, stores them in an array ref (eg $self->{abst}{ra_AbstNumAb}) and the number of matches (eg, $self->{abst}{AbstNumAb}).

Pattern matching is done in several steps: patterns that can match anywhere in the word, and patterns that must match the entire word (typically shorter patterns, like 'mab' or 'fab'). Often, there are additional steps. Step 1 patterns are mandatory - if there are no matches at step 1, then step 2 is skipped. If step 1 succeeds, then step 2, optional, patterns, are also added. For example, for AbstNumPeptide(), step 1 patterns may include 'peptid', step 2 - 'synthe'. This is because 'synthe' by itself very often refers no non-peptides, but if 'peptid' matches, then 'synthe' more likely refers to peptide-related terms.

AbstNumAb

  Args        : none
  Example     : $parser->AbstNumAb or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumBind

  Args        : none
  Example     : $parser->AbstNumBind or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumDigest

  Args        : none
  Example     : $parser->AbstNumDigest or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumMHC

  Args        : none
  Example     : $parser->AbstNumMHC or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumPeptide

  Args        : none
  Example     : $parser->AbstNumPeptide or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumPhage

  Args        : none
  Example     : $parser->AbstNumPhage or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstNumProtease

  Args        : none
  Example     : $parser->AbstNumProtease or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

AbstScore

  Args        : none
  Example     : $parser->AbstScore or return;
  Description : Computes AbstScore based on the results of Abst* vars in the abstract.
  Returns     : TRUE if successful, FALSE otherwise.

Words


  Args        : none
  Example     : $parser->Words or return;
  Description : Parses all peptide sequences from abstract text in AbstAbstract. 
                Calls the necessary methods, and stores results in array ref 
                $parser->{words}.
  Returns     : TRUE if successful, FALSE otherwise.

aa3_to_aa1

  Args        : string with a peptide sequence
  Example     : $str = $self->aa3_to_aa1($str)
  Description : finds the first string that looks like protein
                sequence in 3 letter amino acid symbols or full
                names. Cleans (remove non-informative parens, -, /,
                etc), and returns the cleaned sequence. Currently,
                full names are also handled here, but this may change.
  Returns     : sequence (TRUE) if a sequence is found, FALSE otherwise.

to_aa1

  Args        : string with a peptide sequence
  Example     : $str = $self->to_aa1($str)
  Description : finds the first string that looks like peptide sequence in uppercase 
                1 letter amino acid symbols. Cleans (remove non-informative parens, 
                -, /, etc), and returns the cleaned sequence.
  Returns     : sequence (TRUE) if a sequence is found, FALSE otherwise.

clean_seq

  Args        : string with a peptide sequence in 1 letter aa symbols
  Example     : $str = clean_seq($str);
  Description : for sequence in 1 letter aa symbols, remove extra parens, etc. 
                Simple motifs are ok: A(B/C/D)E...
  Returns     : cleaned string.

clean_orphan_parens

  Args        : string with a sequence to clean up.
  Example     : $_ = clean_orphan_parens($_);
  Description : removes orphan parens from a string.
  Returns     : cleaned string.

parse_slashes

  Args        : string with a sequence.
  Example     : $_ = parse_slashes($_);
  Description : handles slashes. Changes A/B/C/etc to (A/B/C/etc)
                which means any of A, B, C, etc. Returns the resulting
                string. Exceptions are as follows, based on the the
                fact that the result would make no sense. If X occurs
                next to '/' or if the result would contain a repeated
                character. For exceptions, splits the string on
                slashes, and returns the first longest string.
  Returns     : see above

has_repeats_at_slashes

  Args        : string with a sequence.
  Example     : print 1 if has_repeats_at_slashes($_);
  Description : see below
  Returns     : TRUE if the string has a repeated char next to a series of 
                slashes and chars, FALSE otherwise.

WordVars

  Args        : $rh_word - ref to hash with data for the word
  Example     : $parser->WordVars($rh_word) or return;
  Description : computes variables for WordScore, eg WordPropDegen, WordIsDNA, etc.
  Returns     : TRUE if successful, FALSE otherwise.

WordPropProtein

  Args        : $word : string
  Example     : $rh_word->{WordPropProtein} = $self->WordPropProtein($word);
  Description : See package DESCRIPTION above.
  Returns     : sum of log10(proportion(k-mers))

WordScore

  Args        : $rh_word - ref to hash with word data.
  Example     : $parser->WordScore($rh_word) or return;
  Description : See package DESCRIPTION above.
  Returns     : TRUE if successful, FALSE otherwise.

WordScoreForMatch

  Args        : $rh_word - ref to hash with word data.
  Example     : $parser->WordScoreForMatch($rh_word) or return;
  Description : Calls WordIsGeneOrDict to compute WordIsDict, WordIsGene. 
                Changes WordScore based on these vars, as well as based on WordIsDNA, 
                WordPropProtein, etc.
  Returns     : TRUE if successful, FALSE otherwise.

WordIsGeneOrDict

  Args        : $rh_word - ref to hash with word data.
  Example     : $parser->WordIsGeneOrDict($rh_word) or return;
  Description : Computes WordIsDict, WordIsGene and related vars. 
  Returns     : TRUE if successful, FALSE otherwise.

WordAbstScore

  Args        : none
  Example     : $parser->WordAbstScore or return;
  Description : Computes WordAbstScore based on the results of
                AbstScore and WordScore vars for all words in the
                abstract. AbstMaxWordScore is high if either AbstScore
                is high or WordScore is high, and higher if both are
                high. Instead of using simply AbstMaxWordScore =
                AbstScore * WordScore, AbstMaxWordScore is also added
                to WordScore with a smaller weight. This is because
                probability that a given word is a sequence increases
                if another word in the same abstract is a
                sequence. The weight for AbstMaxWordScore is smaller
                than that for WordScore in order not to make all words
                have equal AbstMaxWordScore. WordAbstScore = 0 for
                WordScore = 0, to reduce obvious noise (otherwise, all
                words that are clearly not peptide sequences will get
                positive WordAbstScore when they occur in an abstract
                with at least one sequence). Skip transform to (0,1)
                because the scores below are all in (0,1), and the
                weights add to 1, so WordAbstScore is already in
                (0,1).
  Returns     : TRUE if successful, FALSE otherwise.

AbstMtext

  Args        : none
  Example     : $parser->AbstMtext
  Description : Marks all words that are putative sequences with
                '<mark>...</mark>' tags. This is done only for words
                for which WordScore is at least WordScoreMin.
  Returns     : always TRUE

formatScoreProp

  Args[1]     : $number - score or proportion.
  Example     : $rh_word->{WordScore} = formatScoreProp($rh_word->{WordScore});
  Description : changes $number to be in (0,1) interval, formats $number 
                for nice printing
  Returns     : formatted score

find

  Args[1]     : $key - field name
  Args[2]     : $val - value
  Example     : # print WordScore for the first GETRAPL sequence.
                # print $parser->find(WordSequence => 'GETRAPL')->{WordScore};
  Description :  finds the first word for which the value of field $key 
                is equal to $val (string eq, not numeric ==)
  Returns     : returns this word (hash ref) if successful, FALSE otherwise.