LOADING...

Protein Sequence Comparative Analysis (PSCA)


   What is PSCA
How to use PSCA
Acknowledgement
Protein accession (e.g. TM0023) :      

About PSCA

Protein Sequence Comparative Analysis (PSCA) is an integrated web tool for comparative analysis of protein sequence. It analyzes protein sequence in multiple layers as domains and families, secondary structure feature, similarity of PDB protein structures, protein structure prediction and protein homologue. The PSCA user interface is supported by the server and database of Joint Center for Structural Genomics (JCSG).

Layers of Comparative Analysis

Sequence Information

Protein sequence information contains the annotation contents from both of JCSG and SWISS-PROT. The annotation page from SWISS-PROT and TrEMBL databases is accessed from SWALL on the EBI SRS server. Enzymatic and metabolic information of enzyme targets is accessible from KEGG.

A group of JCSG tools provides wide supports for target annotation. Data Acquisition Prioritization System DASP is to prioritize crystallized proteins for data acquisition at the Joint Center for Structural Genomics Structure Determination Core. Functional & Structural Space FSS and Target PDB Monitor TPM monitor the functional and structural coverage of protein targets.

Homologous Protein Sequences

The program NCBI BLAST and PSI-BLAST are used to search homologues against various sets of sequence database with threshold (Evalue cut-off) at 0.001. The data set includes NCBI NR and JCSG sets as JCSG Target, Thermotoga, Yeast, C.elegans and Mouse.

Domains and Families

Domain search is done using HMMER program and PFAM database. Given a protein sequence, PSCA searches and represents Pfam domains at both graphic and textual formats. A single domain is shown with a single color within the corresponding subsequence region. The overlapping domains are shown as mixed colors. PSCA also allows to retrieve a specified Pfam domain in a subset of JCSG database such as the proteome Thermotoga maritima or Caenorhabditis elegans.

Secondary Structure Feature

JNET is used for secondary structure prediction such as alpha helix and beta strand. TMHMM is used for transmembrane helix prediction.

PDB Fold Similarity

The homologous protein sequences usually have similar structures. The sequence alignment program BLAST from NCBI BLAST is used to search homologous sequences against the PDB sequence database pdbnr.

The structure of a PDB chain may be fully or partly solved. There are also missing residues in PDB structure files. The non-redundant sequence database pdbnr consists of unique PDB sequences from the representative PDB chains. Each representative chain has the best percent of structural coverage (%covp) for the group of the same sequences. The percent of structural coverage (%covp) reflects the difference between PDB real sequence submitted and atom sequence extracted from PDB coordinate data. It is possible that an aligned PDB real subsequence has not the corresponding structural data in PDB coordinate file.

covp (%) = (number of aligned atom residues - gaps)/(length of real sequence) * 100%

Fold & Function Assignment System (FFAS)

FFAS is a profile-profile based fold recognition method developed by Godzik Laboratory.

Reference System

The reference system provides two ways for reference collection. The first one is subject-oriented search using protein description and keywords. The second one is sequence-oriented search that is automated procedure based on protein sequence annotation.

PSCA focuses on the automated sequence-oriented way. The references are collected from target sequence itself, domains and families, PDB structures and NCBI NR homologues. A user interface supports the classification and selection of references. You can use a group of buttons to select references based on your interest or the related degree of references that are justified base on some criteria, e.g. trusted or high similarities. Entrez PubMed provides access to MEDLINE citations.

Reference Classification

The reference system classifies references into three groups (Trusted or Extreme similarity, Gathering or High similarity, Noise and Low similarity) using the following discussable criteria. The sequence itself is treated as Trusted.

Pfam domains and families are justified based on Pfam HMM scores as Trusted, Gathering and Noise. A hit better than gathering level suggests very significant. Interpro domains and families are collected from EBI SWALL(SPTR) and EBI Interpro as trusted without judgment.

Sequence homologues of blast search are justified using the following criteria as Extreme similarity (identity>=85% and coverage>=50%), High similarity (identity>=30% and coverage>=50%) and Low similarity (evalue <=0.001, but not enough for high similarity). For PSI-BLAST homologues, the identity level of high similarity is decreased to 25% based on the fact that the alignments of PSI-BLAST are usually longer and more sensitive.

By default, the reference system will show references at level of gathering or high similarity. You can filter and rebuild the list of references using the user inferface including a group of convenient buttons.

Genome Reference Filter

There may be non-specific genome references for a special target. The genome reference filter deletes these non-specific genome references from the list of references.

How To Use PSCA

Direct PSCA Server Call

Given a ACCESSION, you can directly analyze the protein sequence with or without the setting of threshold (E-value cut-off). If you do not set the threshold, the default thresholds are used as expects at 0.01 for Pfam domain search and 0.001 for Blast homologous search. (e.g. http://www1.jcsg.org/prod/newscripts/psca/targetinfo.cgi?acc=TM0023. The expect can be set as hexpect=[non-negative number] for Pfam domain search, bexpect=[non-negative number] for Blast search (e.g. http://www1.jcsg.org/prod/newscripts/psca/targetinfo.cgi?acc=TM0023&hexpect=0.01&bexpect=0.001 ).

Acknowledgement

Protein Sequence Information

Protein sequence information contains the annotation contents from both of JCSG and SWISS-PROT. The annotation page from SWISS-PROT and TrEMBL databases is accessed from SWALL on the EBI SRS server. Enzymatic and metabolic information of enzyme targets is accessible from KEGG.

A group of JCSG tools provides wide supports for target annotation. Data Acquisition Prioritization System DASP is to prioritize crystallized proteins for data acquisition at the Joint Center for Structural Genomics Structure Determination Core. Functional & Structural Space FSS and Target PDB Monitor TPM monitor the functional and structural coverage of protein targets.

Homologous Sequences

Homologous search of protein sequences is done using BLAST and PSI-BLAST from NCBI BLAST with threshold (E-value cut-off) at 0.001. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997 Sep 1;25(17):3389-402.

The clustering of sequences is done using CD-HI program developed by Godzik Laboratory. Clustering of highly homologous sequences to reduce the size of large protein database, Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics, (2001) 17:282-283.

Pfam Domains

Domain and family search is done using HMMER program and PFAM database with threshold (E-value cut-off) at 0.01. The Pfam protein families database. A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe, and E.L.L. Sonnhammer Nucleic Acids Research, 28:263-266, 2000.

Secondary Structure Feature

Secondary structure prediction is done using JNET:A Neural Network Protein Secondary Structure Prediction Method. Cuff J. A and Barton G.J (1999) Application of enhanced multiple sequence alignment profiles to improve protein secondary structure prediction, Proteins 40:502-511.

Transmembrane helix prediction is done using TMHMM. TMHMM is a method for prediction transmembrane helices based on a hidden Markov model and developed by Anders Krogh and Erik Sonnhammer.

PDB Fold Similarity

The PDB sequences and structure data are downloaded from PDB FTP server. The Blast search is done against the pdbnr database that is a non-redundant PDB sequence database consisting of unique PDB sequences from the representative PDB chains. Each representative chain has the best structural coverage in the group of the same sequences.

Fold & Function Assignment System

Fold & Function Assignment System (FFAS) is developed and maintained by Godzik Laboratory. Jaroszewski L, Rychlewski L, Godzik A.Improving the quality of twilight-zone alignments. Protein Sci 2000 Aug;9(8):1487-96

Reference System

Entrez PubMed provides access to MEDLINE citations. Protein sequence information is collected from NCBI GenBank and SWALL (SPTR) on the EBI SRS server. Functional and structural information is collected from EBI InterPro, Pfam and PDB. A user interface supports the classification and selection of references.

Contact Webmaster JCSG Menu