Options
Query and background samples
Composition Profiler provides an easy way to visually investigate bias
in amino acid composition between two sets of protein sequences. A set
of proteins under study (the query sample) can be analyzed
against a representative sample of proteins from the organism under
study, or a group of proteins with a contrasting functional annotation
(the background sample), which provides a suitable background
amino acid distribution.
Sequences should be entered in
FastA
format.
There are no theoretical limits on the number of sequences that can be
given as input. Composition Profiler treats differences in amino acids
composition as "signals" – since the p-value is a function of the sample
size and signal strength, samples which are not large enough for the
inherent signal will give differences with p-value above the statistical
significance threshold, and will be discarded as spurious. With small
sample sizes, only the strongest signals will be identified.
For example, if a sample consisting of only one sequence, AAAAAAAAAA,
were to be analyzed against SwissProt, because the difference between
100% A in the sample and 7.89% A in SwissProt presents a very strong
signal (12 fold enrichment), the difference will be statistically
significant. Larger sample size allows identification of more subtle signals.
Background distribution
Alternatively, the query sample can be analyzed against one of the standard protein datasets:
- SwissProt 51 (Bairoch et al., 2005), closest to the distribution of amino acids in nature among the four datasets
- PDB Select 25 (Berman et al, 2000), a subset of structures from the Protein Data Bank with less than 25% sequence identity, biased towards the composition of proteins amenable to crystallization studies
- Surface residues determined by the Molecular Surface Package over a sample of PDB structures of monomeric proteins, suitable for analyzing phenomena on protein surfaces, such as binding
- DisProt 3.4, comprised of a set of consensus sequences of experimentally determined disordered regions (Sickmeier et al., 2007)
In order to expedite the calcutions, amino acid compositions of the standard datasets have been pre-computed, as means and standard deviations over 100,000 bootstrap iterations.
Residue \ % | SwissProt | PDB S25 | Surface Residues | DisProt |
Ala (A) | 7.89 ± 0.05 | 7.70 ± 0.08 | 6.03 ± 0.13 | 8.10 ± 0.35 |
Arg (R) | 5.40 ± 0.04 | 4.93 ± 0.06 | 6.56 ± 0.13 | 4.82 ± 0.23 |
Asn (N) | 4.13 ± 0.04 | 4.58 ± 0.06 | 6.23 ± 0.15 | 3.82 ± 0.27 |
Asp (D) | 5.35 ± 0.03 | 5.83 ± 0.05 | 8.18 ± 0.10 | 5.80 ± 0.30 |
Cys (C) | 1.50 ± 0.02 | 1.74 ± 0.05 | 0.78 ± 0.04 | 0.80 ± 0.08 |
Gln (Q) | 3.95 ± 0.03 | 3.95 ± 0.05 | 5.21 ± 0.09 | 5.27 ± 0.37 |
Glu (E) | 6.67 ± 0.04 | 6.65 ± 0.07 | 8.70 ± 0.17 | 9.89 ± 0.61 |
Gly (G) | 6.96 ± 0.04 | 7.16 ± 0.07 | 7.06 ± 0.11 | 7.41 ± 0.40 |
His (H) | 2.29 ± 0.02 | 2.41 ± 0.04 | 2.60 ± 0.06 | 1.93 ± 0.11 |
Ile (I) | 5.90 ± 0.04 | 5.61 ± 0.06 | 2.77 ± 0.07 | 3.24 ± 0.13 |
Leu (L) | 9.65 ± 0.04 | 8.68 ± 0.08 | 5.11 ± 0.08 | 6.22 ± 0.25 |
Lys (K) | 5.92 ± 0.05 | 6.37 ± 0.08 | 9.75 ± 0.16 | 7.85 ± 0.45 |
Met (M) | 2.38 ± 0.02 | 2.22 ± 0.04 | 1.13 ± 0.04 | 1.87 ± 0.10 |
Phe (F) | 3.96 ± 0.03 | 3.98 ± 0.04 | 2.38 ± 0.05 | 2.44 ± 0.13 |
Pro (P) | 4.83 ± 0.03 | 4.57 ± 0.05 | 5.63 ± 0.10 | 8.11 ± 0.63 |
Ser (S) | 6.83 ± 0.04 | 6.19 ± 0.06 | 6.87 ± 0.13 | 8.65 ± 0.43 |
Thr (T) | 5.41 ± 0.02 | 5.63 ± 0.05 | 6.08 ± 0.11 | 5.56 ± 0.24 |
Trp (W) | 1.13 ± 0.01 | 1.44 ± 0.03 | 1.33 ± 0.05 | 0.67 ± 0.06 |
Tyr (Y) | 3.03 ± 0.02 | 3.50 ± 0.04 | 3.58 ± 0.08 | 2.13 ± 0.15 |
Val (V) | 6.73 ± 0.03 | 6.72 ± 0.06 | 4.01 ± 0.06 | 5.41 ± 0.44 |
To illustrate the importance of choosing an appropriate background distribution,
we generated composition profiles of PDBS25 (A), surface residues dataset
(B) and DisProt (C) against SwissProt:
All three graphs have the same y-axis scale, the same order of amino acids
and the same color-coding scheme (Vihinen flexibility), which allows direct
visual comparison of the amino acid biases in the three datasets.
Discovery
Alpha value
Statistical significance associated with a specific enrichment or depletion is estimated using the two-sample t-test between two sequences of binary indicator variables, one sequence for each of the samples. A particular enrichment or depletion is statistically significant when p-value (the lowest value at which the null hypothesis that the same underlying Gaussian distribution generated both samples can be rejected), is lower than or equal to a user-specified statistical significance (alpha) value.
Bonferroni correction
A correction of the alpha value in cases when multiple hypotheses are tested. See (Weisstein) for
details.
Number of bootstrap iterations
In the context of calculating composition differences, bootstrapping
is used for non-parametric estimation of the confidence intervals for
the reported amino acid compositions. More precisely, reported
compositions are means of pseudo-replicate compositions, and
confidence intervals are standard deviations of pseudo-replicate
compositions.
In the context of relative entropy calculations, bootstrapping is used
to estimate the statistical significance of the observed relative
entropy. In each iteration, random samples of the two starting samples
are created, relative entropy between the random samples is computed
and compared to the observed relative entropy.
The observed relative entropy is independent of the number of bootstrap
interations, and is relatively fast to compute. Bulk of the time spent
on the calculations goes towards determining the statistical
significance, and for large datasets, this time may be considerable.
We therefore advise to start the comparison with a small number of
iterations to get a rough estimate for the p-value, and to increase
the number of iterations for comparisons which have a p-value below the
threshold of 1 / number of iterations.
We here provide an example of running times for the relative entropy
calculations between the alpha MoRF dataset and a sample of proteins
from SwissProt (both datasets are provided as part of the program
distribution).
Iterations |
10 |
100 |
1,000 |
10,000 |
100,000 |
Running time |
1s |
3s |
23s |
3m 40s |
36m 38s |
In principle, if we disregard the short initialization period when amino
acids are counted and the counts stored, the running time grows linearly
with the number of iterations.
Amino acid grouping, ordering and color-coding
Alpha helix frequency (Nagano, 1973)
Y | P | G |
N | S | R |
T | C | I |
V | D | W |
Q | L | K |
M | F | A |
H | E |
0.63 | 0.70 | 0.72 |
0.77 | 0.78 | 0.83 |
0.87 | 0.94 | 0.94 |
0.97 | 1.00 | 1.06 |
1.10 | 1.23 | 1.23 |
1.23 | 1.23 | 1.29 |
1.29 | 1.54 |
| |
Aromatics
Aromatic amino acids (F, W, Y) are colored
green, and the remaining amino acids are
colored black.
Beta structure frequency (Nagano, 1973)
E | R | N |
P | S | K |
H | D | G |
A | Y | C |
W | Q | T |
L | M | F |
V | I |
0.33 | 0.67 | 0.72 |
0.75 | 0.77 | 0.81 |
0.87 | 0.9 | 0.9 |
0.96 | 1.07 | 1.13 |
1.13 | 1.18 | 1.23 |
1.26 | 1.29 | 1.37 |
1.41 | 1.54 |
| |
Bulkiness (Zimmerman et al., 1968)
G | S | A |
D | N | C |
E | H | R |
Q | K | T |
M | P | Y |
F | I | L |
V | W |
3.4 | 9.47 | 11.5 |
11.68 | 12.82 | 13.46 |
13.57 | 13.69 | 14.28 |
14.45 | 15.71 | 15.77 |
16.25 | 17.43 | 18.03 |
19.8 | 21.4 | 21.4 |
21.57 | 21.67 |
| |
Charge
Positively charged residues (K, R) are
colored blue, negatively charged residues
(D, E) are colored red;
and neutral residues are colored black.
Coil propensity (Nagano, 1973)
F | M | L |
A | E | H |
I | Q | V |
K | W | C |
T | D | R |
S | G | Y |
N | P |
0.58 | 0.62 | 0.63 |
0.72 | 0.75 | 0.76 |
0.8 | 0.81 | 0.83 |
0.84 | 0.87 | 1.01 |
1.03 | 1.04 | 1.33 |
1.34 | 1.35 | 1.35 |
1.38 | 1.43 |
| |
Disorder propensity (Dunker et al., 2001)
Disorder-promoting residues (A, R, S, Q, E, G, K, P
) are colored red, order-promoting
residues (N, C, I, L, F, W, Y, V) are colored
blue, and disorder-order neutral residues
(D, H, M, T) are colored
black.
Flexibility (Vihinen et al., 1994)
W | C | F |
I | Y | V |
L | H | M |
A | T | R |
G | Q | S |
N | P | D |
E | K |
0.904 | 0.906 | 0.915 |
0.927 | 0.929 | 0.931 |
0.935 | 0.950 | 0.952 |
0.984 | 0.997 | 1.008 |
1.031 | 1.037 | 1.046 |
1.048 | 1.049 | 1.068 |
1.094 | 1.102 |
| |
Hydrophobicity (Eisenberg et al., 1984)
R | K | D |
Q | N | E |
H | S | T |
P | Y | C |
G | A | M |
W | L | V |
F | I |
-2.53 | -1.5 | -0.9 |
-0.85 | -0.78 | -0.74 |
-0.4 | -0.18 | -0.05 |
0.12 | 0.26 | 0.29 |
0.48 | 0.62 | 0.64 |
0.81 | 1.06 | 1.08 |
1.19 | 1.38 |
| |
Hydrophobicity (Kyte and Doolittle, 1982)
R | K | D |
E | N | Q |
H | P | Y |
W | S | T |
G | A | M |
C | F | L |
V | I |
-4.5 | -3.9 | -3.5 | -3.5 |
-3.5 | -3.5 | -3.2 | -1.6 |
-1.3 | -0.9 | -0.8 | -0.7 |
-0.4 | 1.8 | 1.9 | 2.5 |
2.8 | 3.8 | 4.2 | 4.5 |
| |
Hydrophobicity (Fauchere and Pliska, 1983)
R | K | D |
E | N | Q |
S | G | H |
T | A | P |
Y | V | C |
L | F | I |
M | W |
-1.01 | -0.99 | -0.77 |
-0.64 | -0.6 | -0.22 |
-0.04 | 0.0 | 0.13 |
0.26 | 0.31 | 0.72 |
0.96 | 1.22 | 1.54 |
1.7 | 1.79 | 1.8 |
1.23 | 2.25 |
| |
Interface propensity (Jones and Thornton, 1997)
D | K | S |
P | T | A |
E | Q | G |
N | R | V |
L | H | C |
I | M | Y |
F | W |
-0.38 | -0.36 | -0.33 |
-0.25 | -0.18 | -0.17 |
-0.13 | -0.11 | -0.07 |
0.12 | 0.27 | 0.27 |
0.4 | 0.41 | 0.43 |
0.44 | 0.66 | 0.66 |
0.82 | 0.83 |
| |
Linker propensity (George and Heringa, 2003)
C | G | W |
D | I | N |
K | S | V |
A | Y | H |
T | M | Q |
E | L | F |
R | P |
0.778 | 0.835 | 0.895 |
0.916 | 0.922 | 0.944 |
0.944 | 0.947 | 0.955 |
0.964 | 1.0 | 1.014 |
1.017 | 1.032 | 1.047 |
1.051 | 1.085 | 1.119 |
1.143 | 1.299 |
| |
Polarity (Zimmerman et al., 1968)
A | G | I |
L | V | F |
M | C | P |
Y | T | S |
W | N | Q |
K | D | E |
H | R |
0.0 | 0.0 | 0.13 |
0.13 | 0.13 | 0.35 |
1.43 | 1.48 | 1.58 |
1.61 | 1.66 | 1.67 |
2.1 | 3.38 | 3.53 |
49.5 | 49.7 | 49.9 |
51.6 | 52.0 |
| |
Size (Dawson, 1972)
G | A | D |
C | S | N |
E | T | V |
I | L | P |
Q | H | M |
F | K | W |
Y | R |
0.5 | 2.5 | 2.5 |
3.0 | 3.0 | 5.0 |
5.0 | 5.0 | 5.0 |
5.5 | 5.5 | 5.5 |
6.0 | 6.0 | 6.0 |
6.5 | 7.0 | 7.0 |
7.0 | 7.5 |
| |
Surface exposure (Janin, 1979)
K | R | E |
Q | D | N |
Y | P | T |
H | S | A |
G | W | M |
L | F | V |
I | C |
-1.8 | -1.4 | -0.7 |
-0.7 | -0.6 | -0.5 |
-0.4 | -0.3 | -0.2 |
-0.1 | -0.1 | 0.3 |
0.3 | 0.3 | 0.4 |
0.5 | 0.5 | 0.6 |
0.7 | 0.9 |
| |
Solvation potential (Jones and Thornton, 1996)
W | F | I |
L | M | Y |
V | C | H |
A | G | S |
T | P | N |
R | Q | D |
E | K |
-0.68 | -0.55 | -0.49 |
-0.49 | -0.40 | -0.32 |
-0.31 | -0.30 | -0.06 |
0.05 | 0.08 | 0.15 |
0.16 | 0.19 | 0.22 |
0.41 | 0.45 | 0.64 |
0.77 | 1.61 |
| |
Additional color schemes
WebLogo default colors
Amino Acid |
Color Name |
RGB |
Hexadecimal |
G, S, T, Y, C |
green |
[0,204,0] |
00CC00 |
N, Q |
purple |
[204,0,204] |
CC00CC |
K, R, H |
blue |
[0,204,0] |
0000CC |
D, E |
red |
[204,0,0] |
CC0000 |
P,A,W,F,L,I,M,V |
black |
[0,0,0] |
000000 |
Shapley color table
Amino Acids |
Color Name |
RGB |
Hexadecimal |
D, T |
dark red |
[160,0,66] |
A00042 |
E |
red-brown |
[102,0,0] |
660000 |
C |
bright yellow |
[255,255,112] |
FFFF70 |
M, Y |
dark yellow |
[184,160,66] |
B8A042 |
K |
blue |
[71,71,184] |
4747B8 |
R |
dark blue |
[0,0,124] | 00007C |
S, Q |
orange |
[255,76,76] |
FF4C4C |
F, P, W |
dark grey |
[83,76,66] |
534C42 |
N |
flesh |
[255,124,112] |
FF7C70 |
G, V |
light grey |
[235,235,235] |
EBEBEB |
I |
dark green |
[0,76,0] |
004C00 |
L |
grey-green |
[69,94,69] |
455E45 |
A |
light green |
[140,255,140] |
8CFF8C |
H |
pale blue |
[112,112,255] |
7070FF |
In the original Shapley scheme, G and V are color-coded as white. Since this
would render them invisible against a white background, their color has been
changed to light grey.
Amino Colors
Amino Acid |
Color Name |
RGB |
Hexadecimal |
D, E |
bright red |
[230,10,10] |
E60A0A |
C, M |
yellow |
[230,230,0] |
E6E600 |
K, R |
blue |
[20,90,255] |
145AFF |
S, T |
orange |
[250,150,0] |
FA9600 |
F, Y |
mid blue |
[50,50,170] |
3232AA |
N, Q |
cyan |
[0,220,220] |
00DCDC |
G |
light grey |
[235,235,235] |
EBEBEB |
L, V, I |
green |
[15,130,15] |
0F820F |
A |
dark grey |
[200,200,200] |
C8C8C8 |
W |
pink |
[180,90,180] |
B45AB4 |
H |
pale blue |
[130,130,210] |
8282D2 |
P |
flesh |
[220,150,130] |
DC9682 |
Show Y-axis label
Shows label on the Y-axis of the profile graph.
Output format
Composition Profiler supports:
- Encapsulated PostScript (EPS)
- Portable Document Format (PDF)
- Graphics Interchange Format (GIF)
- Portable Network Graphics (PNG)
Additionally, raw values for bar heights and error bars are displayed if
plain text (TXT) option is chosen.
Output size
Height and width of the output image, in pixels, centimeters or inches.
Antialiasing
Turns antialiasing on or off. Applicable to bitmaps only (GIF and PNG).
Resolution
Sets up the image resolution. Applicable to bitmaps only (GIF and PNG).
References
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, and Yeh LS. (2005) "The Universal Protein Resource (UniProt)." Nucleic Acids Research 33:D154-159.

- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE. (2000) "The Protein Data Bank." Nucleic Acids Research, 28:235-242.
- Dawson DM. (1972) In The Biochemical Genetics of Man (Brock DJH and Mayo O, eds.), Academic Press, New York, 1-38.
- Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW et al. (2001) "Intrinsically disordered protein." J. Mol. Graph. Model., 19, 26-59.

- Eisenberg D, Schwarz E, Komaromy M, and Wall R. (1984) "Analysis of membrane and surface protein sequences with the hydrophobic moment plot." J. Mol. Biol. 179:125-142.

- Fauchere J-L, and Pliska VE. (1983) "Hydrophobic parameters pi of amino acid side chains from partitioning of N-acetyl-amino-acid amides." Eur. J. Med. Chem. 18:369-375.

- George RA, and Heringa J. (2003) An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng. 15:871-879.

- Hogg RV and Craig A. (1994) Introduction to Mathematical Statistics, 5th edition, Prentice Hall.
- Janin J. (1979) "Surface and inside volumes in globular proteins." Nature, 277, 491-492.

- Jones S, and Thornton J. (1997) "Analysis of protein-proteins interaction sites using surface patches." J. Mol. Biol. 272:121-132.

- Jones S, and Thornton J. (1996) "Principles of protein-protein interactions." Proc. Natl. Acad. Sci. USA, 93:13-20.
- Kyte J, and Doolittle RF. (1982) "A simple method for displaying the hydropathic character of a protein." J. Mol. Biol. 157:105-132.

- Nagano K. (1973) "Local analysis of the mechanism of protein folding. I. Prediction of helices, loops, and beta-structures from primary structure." J. Mol. Biol. 75:401-420.
- Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, and Dunker AK.(2007) "DisProt: the Database of Disordered Proteins." Nucleic Acids Research. 35:D786-93.

- Vihinen M, Torkkila E, and Riikonen P. (1994) "Accuracy of protein flexibility predictions." Proteins, 19, 141-149.

- Weisstein EW. "Bonferroni Correction." From MathWorld - A Wolfram Web Resource. http://mathworld.wolfram.com/BonferroniCorrection.html
- Zimmerman JM, Eliezer N, and Simha R. (1968) "The characterization of amino acid sequences in proteins by statistical methods." J J. Theor. Biol. 21:170-201.

|