Composition Profiler Home | Run Profiler | Examples | Help

Options

Query and background samples

Composition Profiler provides an easy way to visually investigate bias in amino acid composition between two sets of protein sequences. A set of proteins under study (the query sample) can be analyzed against a representative sample of proteins from the organism under study, or a group of proteins with a contrasting functional annotation (the background sample), which provides a suitable background amino acid distribution.

Sequences should be entered in FastA format.

There are no theoretical limits on the number of sequences that can be given as input. Composition Profiler treats differences in amino acids composition as "signals" – since the p-value is a function of the sample size and signal strength, samples which are not large enough for the inherent signal will give differences with p-value above the statistical significance threshold, and will be discarded as spurious. With small sample sizes, only the strongest signals will be identified.

For example, if a sample consisting of only one sequence, AAAAAAAAAA, were to be analyzed against SwissProt, because the difference between 100% A in the sample and 7.89% A in SwissProt presents a very strong signal (12 fold enrichment), the difference will be statistically significant. Larger sample size allows identification of more subtle signals.

Background distribution

Alternatively, the query sample can be analyzed against one of the standard protein datasets:

  • SwissProt 51 (Bairoch et al., 2005), closest to the distribution of amino acids in nature among the four datasets
  • PDB Select 25 (Berman et al, 2000), a subset of structures from the Protein Data Bank with less than 25% sequence identity, biased towards the composition of proteins amenable to crystallization studies
  • Surface residues determined by the Molecular Surface Package over a sample of PDB structures of monomeric proteins, suitable for analyzing phenomena on protein surfaces, such as binding
  • DisProt 3.4, comprised of a set of consensus sequences of experimentally determined disordered regions (Sickmeier et al., 2007)
In order to expedite the calcutions, amino acid compositions of the standard datasets have been pre-computed, as means and standard deviations over 100,000 bootstrap iterations.

Residue \ %SwissProtPDB S25Surface ResiduesDisProt
Ala (A)7.89 ± 0.057.70 ± 0.086.03 ± 0.138.10 ± 0.35
Arg (R)5.40 ± 0.044.93 ± 0.066.56 ± 0.134.82 ± 0.23
Asn (N)4.13 ± 0.044.58 ± 0.066.23 ± 0.153.82 ± 0.27
Asp (D)5.35 ± 0.035.83 ± 0.058.18 ± 0.105.80 ± 0.30
Cys (C)1.50 ± 0.021.74 ± 0.050.78 ± 0.040.80 ± 0.08
Gln (Q)3.95 ± 0.033.95 ± 0.055.21 ± 0.095.27 ± 0.37
Glu (E)6.67 ± 0.046.65 ± 0.078.70 ± 0.179.89 ± 0.61
Gly (G)6.96 ± 0.047.16 ± 0.077.06 ± 0.117.41 ± 0.40
His (H)2.29 ± 0.022.41 ± 0.042.60 ± 0.061.93 ± 0.11
Ile (I)5.90 ± 0.045.61 ± 0.062.77 ± 0.073.24 ± 0.13
Leu (L)9.65 ± 0.048.68 ± 0.085.11 ± 0.086.22 ± 0.25
Lys (K)5.92 ± 0.056.37 ± 0.089.75 ± 0.167.85 ± 0.45
Met (M)2.38 ± 0.022.22 ± 0.041.13 ± 0.041.87 ± 0.10
Phe (F)3.96 ± 0.033.98 ± 0.042.38 ± 0.052.44 ± 0.13
Pro (P)4.83 ± 0.034.57 ± 0.055.63 ± 0.108.11 ± 0.63
Ser (S)6.83 ± 0.046.19 ± 0.066.87 ± 0.138.65 ± 0.43
Thr (T)5.41 ± 0.025.63 ± 0.056.08 ± 0.115.56 ± 0.24
Trp (W)1.13 ± 0.011.44 ± 0.031.33 ± 0.050.67 ± 0.06
Tyr (Y)3.03 ± 0.023.50 ± 0.043.58 ± 0.082.13 ± 0.15
Val (V)6.73 ± 0.036.72 ± 0.064.01 ± 0.065.41 ± 0.44

To illustrate the importance of choosing an appropriate background distribution, we generated composition profiles of PDBS25 (A), surface residues dataset (B) and DisProt (C) against SwissProt:

All three graphs have the same y-axis scale, the same order of amino acids and the same color-coding scheme (Vihinen flexibility), which allows direct visual comparison of the amino acid biases in the three datasets.

Discovery

Alpha value

Statistical significance associated with a specific enrichment or depletion is estimated using the two-sample t-test between two sequences of binary indicator variables, one sequence for each of the samples. A particular enrichment or depletion is statistically significant when p-value (the lowest value at which the null hypothesis that the same underlying Gaussian distribution generated both samples can be rejected), is lower than or equal to a user-specified statistical significance (alpha) value.

Bonferroni correction

A correction of the alpha value in cases when multiple hypotheses are tested. See (Weisstein) for details.

Number of bootstrap iterations

In the context of calculating composition differences, bootstrapping is used for non-parametric estimation of the confidence intervals for the reported amino acid compositions. More precisely, reported compositions are means of pseudo-replicate compositions, and confidence intervals are standard deviations of pseudo-replicate compositions.

In the context of relative entropy calculations, bootstrapping is used to estimate the statistical significance of the observed relative entropy. In each iteration, random samples of the two starting samples are created, relative entropy between the random samples is computed and compared to the observed relative entropy.

The observed relative entropy is independent of the number of bootstrap interations, and is relatively fast to compute. Bulk of the time spent on the calculations goes towards determining the statistical significance, and for large datasets, this time may be considerable. We therefore advise to start the comparison with a small number of iterations to get a rough estimate for the p-value, and to increase the number of iterations for comparisons which have a p-value below the threshold of 1 / number of iterations.

We here provide an example of running times for the relative entropy calculations between the alpha MoRF dataset and a sample of proteins from SwissProt (both datasets are provided as part of the program distribution).

Iterations 10 100 1,000 10,000 100,000
Running time 1s 3s 23s 3m 40s 36m 38s

In principle, if we disregard the short initialization period when amino acids are counted and the counts stored, the running time grows linearly with the number of iterations.

Amino acid grouping, ordering and color-coding

Alpha helix frequency (Nagano, 1973)

YPG NSR TCI VDW QLK MFA HE
0.630.700.72 0.770.780.83 0.870.940.94 0.971.001.06 1.101.231.23 1.231.231.29 1.291.54
  

Aromatics

Aromatic amino acids (F, W, Y) are colored green, and the remaining amino acids are colored black.

Beta structure frequency (Nagano, 1973)

ERN PSK HDG AYC WQT LMF VI
0.330.670.72 0.750.770.81 0.870.90.9 0.961.071.13 1.131.181.23 1.261.291.37 1.411.54
  

Bulkiness (Zimmerman et al., 1968)

GSA DNC EHR QKT MPY FIL VW
3.49.4711.5 11.6812.8213.46 13.5713.6914.28 14.4515.7115.77 16.2517.4318.03 19.821.421.4 21.5721.67
  

Charge

Positively charged residues (K, R) are colored blue, negatively charged residues (D, E) are colored red; and neutral residues are colored black.

Coil propensity (Nagano, 1973)

FML AEH IQV KWC TDR SGY NP
0.580.620.63 0.720.750.76 0.80.810.83 0.840.871.01 1.031.041.33 1.341.351.35 1.381.43
  

Disorder propensity (Dunker et al., 2001)

Disorder-promoting residues (A, R, S, Q, E, G, K, P ) are colored red, order-promoting residues (N, C, I, L, F, W, Y, V) are colored blue, and disorder-order neutral residues (D, H, M, T) are colored black.

Flexibility (Vihinen et al., 1994)

WCF IYV LHM ATR GQS NPD EK
0.9040.9060.915 0.9270.9290.931 0.9350.9500.952 0.9840.9971.008 1.0311.0371.046 1.0481.0491.068 1.0941.102
  

Hydrophobicity (Eisenberg et al., 1984)

RKD QNE HST PYC GAM WLV FI
-2.53-1.5-0.9 -0.85-0.78-0.74 -0.4-0.18-0.05 0.120.260.29 0.480.620.64 0.811.061.08 1.191.38
  

Hydrophobicity (Kyte and Doolittle, 1982)

RKD ENQ HPY WST GAM CFL VI
-4.5-3.9-3.5-3.5 -3.5-3.5-3.2-1.6 -1.3-0.9-0.8-0.7 -0.41.81.92.5 2.83.84.24.5
  

Hydrophobicity (Fauchere and Pliska, 1983)

RKD ENQ SGH TAP YVC LFI MW
-1.01-0.99-0.77 -0.64-0.6-0.22 -0.040.00.13 0.260.310.72 0.961.221.54 1.71.791.8 1.232.25
  

Interface propensity (Jones and Thornton, 1997)

DKS PTA EQG NRV LHC IMY FW
-0.38-0.36-0.33 -0.25-0.18-0.17 -0.13-0.11-0.07 0.120.270.27 0.40.410.43 0.440.660.66 0.820.83
  

Linker propensity (George and Heringa, 2003)

CGW DIN KSV AYH TMQ ELF RP
0.7780.8350.895 0.9160.9220.944 0.9440.9470.955 0.9641.01.014 1.0171.0321.047 1.0511.0851.119 1.1431.299
  

Polarity (Zimmerman et al., 1968)

AGI LVF MCP YTS WNQ KDE HR
0.00.00.13 0.130.130.35 1.431.481.58 1.611.661.67 2.13.383.53 49.549.749.9 51.652.0
  

Size (Dawson, 1972)

GAD CSN ETV ILP QHM FKW YR
0.52.52.5 3.03.05.0 5.05.05.0 5.55.55.5 6.06.06.0 6.57.07.0 7.07.5
  

Surface exposure (Janin, 1979)

KRE QDN YPT HSA GWM LFV IC
-1.8-1.4-0.7 -0.7-0.6-0.5 -0.4-0.3-0.2 -0.1-0.10.3 0.30.30.4 0.50.50.6 0.70.9
  

Solvation potential (Jones and Thornton, 1996)

WFI LMY VCH AGS TPN RQD EK
-0.68-0.55-0.49 -0.49-0.40-0.32 -0.31-0.30-0.06 0.050.080.15 0.160.190.22 0.410.450.64 0.771.61
  

Additional color schemes

WebLogo default colors

Amino Acid Color Name RGB Hexadecimal
G, S, T, Y, C green [0,204,0] 00CC00
N, Q purple [204,0,204] CC00CC
K, R, H blue [0,204,0] 0000CC
D, E red [204,0,0] CC0000
P,A,W,F,L,I,M,V black [0,0,0] 000000

Shapley color table

Amino Acids Color Name RGB Hexadecimal
D, T dark red [160,0,66] A00042
E red-brown [102,0,0] 660000
C bright yellow [255,255,112] FFFF70
M, Y dark yellow [184,160,66] B8A042
K blue [71,71,184] 4747B8
R dark blue [0,0,124]00007C
S, Q orange [255,76,76] FF4C4C
F, P, W dark grey [83,76,66] 534C42
N flesh [255,124,112] FF7C70
G, V light grey [235,235,235] EBEBEB
I dark green [0,76,0] 004C00
L grey-green [69,94,69] 455E45
A light green [140,255,140] 8CFF8C
H pale blue [112,112,255] 7070FF

In the original Shapley scheme, G and V are color-coded as white. Since this would render them invisible against a white background, their color has been changed to light grey.

Amino Colors

Amino Acid Color Name RGB Hexadecimal
D, E bright red [230,10,10] E60A0A
C, M yellow [230,230,0] E6E600
K, R blue [20,90,255] 145AFF
S, T orange [250,150,0] FA9600
F, Y mid blue [50,50,170] 3232AA
N, Q cyan [0,220,220] 00DCDC
G light grey [235,235,235] EBEBEB
L, V, I green [15,130,15] 0F820F
A dark grey [200,200,200] C8C8C8
W pink [180,90,180] B45AB4
H pale blue [130,130,210] 8282D2
P flesh [220,150,130] DC9682

Show Y-axis label

Shows label on the Y-axis of the profile graph.

Output format

Composition Profiler supports:

  • Encapsulated PostScript (EPS)
  • Portable Document Format (PDF)
  • Graphics Interchange Format (GIF)
  • Portable Network Graphics (PNG)

Additionally, raw values for bar heights and error bars are displayed if plain text (TXT) option is chosen.

Output size

Height and width of the output image, in pixels, centimeters or inches.

Antialiasing

Turns antialiasing on or off. Applicable to bitmaps only (GIF and PNG).

Resolution

Sets up the image resolution. Applicable to bitmaps only (GIF and PNG).

References

  • Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, and Yeh LS. (2005) "The Universal Protein Resource (UniProt)." Nucleic Acids Research 33:D154-159.
  • Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE. (2000) "The Protein Data Bank." Nucleic Acids Research, 28:235-242.
  • Dawson DM. (1972) In The Biochemical Genetics of Man (Brock DJH and Mayo O, eds.), Academic Press, New York, 1-38.
  • Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW et al. (2001) "Intrinsically disordered protein." J. Mol. Graph. Model., 19, 26-59.
  • Eisenberg D, Schwarz E, Komaromy M, and Wall R. (1984) "Analysis of membrane and surface protein sequences with the hydrophobic moment plot." J. Mol. Biol. 179:125-142.
  • Fauchere J-L, and Pliska VE. (1983) "Hydrophobic parameters pi of amino acid side chains from partitioning of N-acetyl-amino-acid amides." Eur. J. Med. Chem. 18:369-375.
  • George RA, and Heringa J. (2003) An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15:871-879.
  • Hogg RV and Craig A. (1994) Introduction to Mathematical Statistics, 5th edition, Prentice Hall.
  • Janin J. (1979) "Surface and inside volumes in globular proteins." Nature, 277, 491-492.
  • Jones S, and Thornton J. (1997) "Analysis of protein-proteins interaction sites using surface patches." J. Mol. Biol. 272:121-132.
  • Jones S, and Thornton J. (1996) "Principles of protein-protein interactions." Proc. Natl. Acad. Sci. USA, 93:13-20.
  • Kyte J, and Doolittle RF. (1982) "A simple method for displaying the hydropathic character of a protein." J. Mol. Biol. 157:105-132.
  • Nagano K. (1973) "Local analysis of the mechanism of protein folding. I. Prediction of helices, loops, and beta-structures from primary structure." J. Mol. Biol. 75:401-420.
  • Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, and Dunker AK.(2007) "DisProt: the Database of Disordered Proteins." Nucleic Acids Research. 35:D786-93.
  • Vihinen M, Torkkila E, and Riikonen P. (1994) "Accuracy of protein flexibility predictions." Proteins, 19, 141-149.
  • Weisstein EW. "Bonferroni Correction." From MathWorld - A Wolfram Web Resource. http://mathworld.wolfram.com/BonferroniCorrection.html
  • Zimmerman JM, Eliezer N, and Simha R. (1968) "The characterization of amino acid sequences in proteins by statistical methods." J J. Theor. Biol. 21:170-201.