|
|
pepstats |
Please help by correcting and extending the Wiki pages.
pepstats reads one or more protein sequences and writes an output file with various statistics on the protein properties. This includes:
% pepstats Calculates statistics of protein properties Input protein sequence(s): tsw:laci_ecoli Pepstats program output file [laci_ecoli.pepstats]: |
Go to the input files for this example
Go to the output files for this example
Calculates statistics of protein properties
Version: EMBOSS:6.3.0
Standard (Mandatory) qualifiers:
[-sequence] seqall Protein sequence(s) filename and optional
format, or reference (input USA)
[-outfile] outfile [*.pepstats] Pepstats program output file
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers:
-aadata datafile [Eamino.dat] Amino acid properties
-mwdata datafile [Emolwt.dat] Molecular weight data for amino
acids
-[no]termini boolean [Y] Include charge at N and C terminus
-mono boolean [N] Use monoisotopic weights
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
"-outfile" associated qualifiers
-odirectory2 string Output directory
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write first file to standard output
-filter boolean Read first file from standard input, write
first file to standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options and exit. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
-version boolean Report version number and exit
|
| Qualifier | Type | Description | Allowed values | Default |
|---|---|---|---|---|
| Standard (Mandatory) qualifiers | ||||
| [-sequence] (Parameter 1) |
seqall | Protein sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required |
| [-outfile] (Parameter 2) |
outfile | Pepstats program output file | Output file | <*>.pepstats |
| Additional (Optional) qualifiers | ||||
| (none) | ||||
| Advanced (Unprompted) qualifiers | ||||
| -aadata | datafile | Amino acid properties | Data file | Eamino.dat |
| -mwdata | datafile | Molecular weight data for amino acids | Data file | Emolwt.dat |
| -[no]termini | boolean | Include charge at N and C terminus | Boolean value Yes/No | Yes |
| -mono | boolean | Use monoisotopic weights | Boolean value Yes/No | No |
| Associated qualifiers | ||||
| "-sequence" associated seqall qualifiers | ||||
| -sbegin1 -sbegin_sequence |
integer | Start of each sequence to be used | Any integer value | 0 |
| -send1 -send_sequence |
integer | End of each sequence to be used | Any integer value | 0 |
| -sreverse1 -sreverse_sequence |
boolean | Reverse (if DNA) | Boolean value Yes/No | N |
| -sask1 -sask_sequence |
boolean | Ask for begin/end/reverse | Boolean value Yes/No | N |
| -snucleotide1 -snucleotide_sequence |
boolean | Sequence is nucleotide | Boolean value Yes/No | N |
| -sprotein1 -sprotein_sequence |
boolean | Sequence is protein | Boolean value Yes/No | N |
| -slower1 -slower_sequence |
boolean | Make lower case | Boolean value Yes/No | N |
| -supper1 -supper_sequence |
boolean | Make upper case | Boolean value Yes/No | N |
| -sformat1 -sformat_sequence |
string | Input sequence format | Any string | |
| -sdbname1 -sdbname_sequence |
string | Database name | Any string | |
| -sid1 -sid_sequence |
string | Entryname | Any string | |
| -ufo1 -ufo_sequence |
string | UFO features | Any string | |
| -fformat1 -fformat_sequence |
string | Features format | Any string | |
| -fopenfile1 -fopenfile_sequence |
string | Features file name | Any string | |
| "-outfile" associated outfile qualifiers | ||||
| -odirectory2 -odirectory_outfile |
string | Output directory | Any string | |
| General qualifiers | ||||
| -auto | boolean | Turn off prompts | Boolean value Yes/No | N |
| -stdout | boolean | Write first file to standard output | Boolean value Yes/No | N |
| -filter | boolean | Read first file from standard input, write first file to standard output | Boolean value Yes/No | N |
| -options | boolean | Prompt for standard and additional values | Boolean value Yes/No | N |
| -debug | boolean | Write debug output to program.dbg | Boolean value Yes/No | N |
| -verbose | boolean | Report some/full command line options | Boolean value Yes/No | Y |
| -help | boolean | Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose | Boolean value Yes/No | N |
| -warning | boolean | Report warnings | Boolean value Yes/No | Y |
| -error | boolean | Report errors | Boolean value Yes/No | Y |
| -fatal | boolean | Report fatal errors | Boolean value Yes/No | Y |
| -die | boolean | Report dying program messages | Boolean value Yes/No | Y |
| -version | boolean | Report version number and exit | Boolean value Yes/No | N |
ID LACI_ECOLI Reviewed; 360 AA.
AC P03023; O09196; P71309; Q2MC79; Q47338;
DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DT 19-JUL-2003, sequence version 3.
DT 15-JUN-2010, entry version 117.
DE RecName: Full=Lactose operon repressor;
GN Name=lacI; OrderedLocusNames=b0345, JW0336;
OS Escherichia coli (strain K12).
OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC Enterobacteriaceae; Escherichia.
OX NCBI_TaxID=83333;
RN [1]
RP NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RX MEDLINE=78246991; PubMed=355891; DOI=10.1038/274765a0;
RA Farabaugh P.J.;
RT "Sequence of the lacI gene.";
RL Nature 274:765-769(1978).
RN [2]
RP NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA Chen J., Matthews K.K.S.M.;
RL Submitted (MAY-1991) to the EMBL/GenBank/DDBJ databases.
RN [3]
RP NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA Marsh S.;
RL Submitted (JAN-1997) to the EMBL/GenBank/DDBJ databases.
RN [4]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC STRAIN=K12 / MG1655 / ATCC 47076;
RA Chung E., Allen E., Araujo R., Aparicio A.M., Davis K., Duncan M.,
RA Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O., Lew H., Lin D.,
RA Namath A., Oefner P., Roberts D., Schramm S., Davis R.W.;
RT "Sequence of minutes 4-25 of Escherichia coli.";
RL Submitted (JAN-1997) to the EMBL/GenBank/DDBJ databases.
RN [5]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC STRAIN=K12 / MG1655 / ATCC 47076;
RX MEDLINE=97426617; PubMed=9278503; DOI=10.1126/science.277.5331.1453;
RA Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA Mau B., Shao Y.;
RT "The complete genome sequence of Escherichia coli K-12.";
RL Science 277:1453-1474(1997).
RN [6]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
RX PubMed=16738553; DOI=10.1038/msb4100049;
RA Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S.,
RA Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.;
RT "Highly accurate genome sequences of Escherichia coli K-12 strains
[Part of this file has been deleted for brevity]
FT CHAIN 1 360 Lactose operon repressor.
FT /FTId=PRO_0000107963.
FT DOMAIN 1 58 HTH lacI-type.
FT DNA_BIND 6 25 H-T-H motif.
FT VARIANT 282 282 Y -> D (in T41 mutant).
FT MUTAGEN 17 17 Y->H: Broadening of specificity.
FT MUTAGEN 22 22 R->N: Recognizes an operator variant.
FT CONFLICT 286 286 L -> S (in Ref. 1, 4 and 7).
FT HELIX 6 11
FT TURN 12 14
FT HELIX 17 24
FT HELIX 33 45
FT HELIX 51 56
FT STRAND 63 69
FT HELIX 74 89
FT STRAND 93 98
FT STRAND 101 103
FT HELIX 104 115
FT TURN 116 118
FT STRAND 122 126
FT HELIX 130 139
FT TURN 140 142
FT STRAND 145 150
FT STRAND 154 156
FT STRAND 158 161
FT HELIX 163 177
FT STRAND 181 186
FT HELIX 192 207
FT STRAND 213 217
FT HELIX 222 234
FT STRAND 240 246
FT HELIX 247 259
FT TURN 265 267
FT STRAND 268 271
FT HELIX 277 281
FT STRAND 282 284
FT STRAND 287 290
FT HELIX 293 308
FT STRAND 314 319
FT STRAND 322 324
FT STRAND 334 338
FT HELIX 343 353
FT HELIX 354 356
SQ SEQUENCE 360 AA; 38590 MW; 347A8DEE92D736CB CRC64;
MKPVTLYDVA EYAGVSYQTV SRVVNQASHV SAKTREKVEA AMAELNYIPN RVAQQLAGKQ
SLLIGVATSS LALHAPSQIV AAIKSRADQL GASVVVSMVE RSGVEACKAA VHNLLAQRVS
GLIINYPLDD QDAIAVEAAC TNVPALFLDV SDQTPINSII FSHEDGTRLG VEHLVALGHQ
QIALLAGPLS SVSARLRLAG WHKYLTRNQI QPIAEREGDW SAMSGFQQTM QMLNEGIVPT
AMLVANDQMA LGAMRAITES GLRVGADISV VGYDDTEDSS CYIPPLTTIK QDFRLLGQTS
VDRLLQLSQG QAVKGNQLLP VSLVKRKTTL APNTQTASPR ALADSLMQLA RQVSRLESGQ
//
|
PEPSTATS of LACI_ECOLI from 1 to 360 Molecular weight = 38590.16 Residues = 360 Average Residue Weight = 107.195 Charge = 1.5 Isoelectric Point = 6.8820 A280 Molar Extinction Coefficients = 22920 (reduced) 23045 (cystine bridges) A280 Extinction Coefficients 1mg/ml = 0.594 (reduced) 0.597 (cystine bridges) Improbability of expression in inclusion bodies = 0.660 Residue Number Mole% DayhoffStat A = Ala 44 12.222 1.421 B = Asx 0 0.000 0.000 C = Cys 3 0.833 0.287 D = Asp 17 4.722 0.859 E = Glu 15 4.167 0.694 F = Phe 4 1.111 0.309 G = Gly 22 6.111 0.728 H = His 7 1.944 0.972 I = Ile 18 5.000 1.111 J = --- 0 0.000 0.000 K = Lys 11 3.056 0.463 L = Leu 41 11.389 1.539 M = Met 10 2.778 1.634 N = Asn 12 3.333 0.775 O = --- 0 0.000 0.000 P = Pro 14 3.889 0.748 Q = Gln 28 7.778 1.994 R = Arg 19 5.278 1.077 S = Ser 32 8.889 1.270 T = Thr 19 5.278 0.865 U = --- 0 0.000 0.000 V = Val 34 9.444 1.431 W = Trp 2 0.556 0.427 X = Xaa 0 0.000 0.000 Y = Tyr 8 2.222 0.654 Z = Glx 0 0.000 0.000 Property Residues Number Mole% Tiny (A+C+G+S+T) 120 33.333 Small (A+B+C+D+G+N+P+S+T+V) 197 54.722 Aliphatic (A+I+L+V) 137 38.056 Aromatic (F+H+W+Y) 21 5.833 Non-polar (A+C+F+G+I+L+M+P+V+W+Y) 200 55.556 Polar (D+E+H+K+N+Q+R+S+T+Z) 160 44.444 Charged (B+D+E+H+K+R+Z) 69 19.167 Basic (H+K+R) 37 10.278 Acidic (B+D+E+Z) 32 8.889 |
Absorption coefficients use values read from the EMBOSS data file 'Eamino.dat'. Values in this file assume cysteines are reduced. If cysteines are in disulphide bridges the value should be adjusted as documented at the top of the file, and a local copy used to override the default values.
Molecular weights are read from a local data file Emolwt.dat.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.
To see the available EMBOSS data files, run:
% embossdata -showall
To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:
% embossdata -fetch -file Exxx.dat
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
DayhoffStat is the amino acid's molar percentage divided by the Dayhoff statistic. The Dayhoff statistic is read from the EMBOSS data file Edayhoff.freq and is the amino acid's relative occurence per 1000 aa normalised to 100.
The probability of expression in inclusion bodies is sometimes referred to as a type of solubility measure. If, however, a recombinant protein is expressed in Escherichia coli, it can be expressed as soluble in the cytosol or insoluble in inclusion bodies. If the Harrison model predicts a given protein to be probably expressed in includion bodies, this doesn't mean that it is not possible to get it soluble in the cytosol. One example: Thermatoga maritima cell divison protein FtsA with a C-terminal His-Tag has a 58% Harrison probability of being expressed in inclusion bodies. However, there was plenty of soluble protein in the E. coli cytosol (F. van den Ent and J. Lowe, EMBO J. 19, 5300-5307 2000). If the protein is expressed in inclusion bodies or not is not only dependent on the sequence, but also on many other factors, such as E. coli strain, incubation temperature, type of expression vector, strength of promoter and medium.
| Program name | Description |
|---|---|
| backtranambig | Back-translate a protein sequence to ambiguous nucleotide sequence |
| backtranseq | Back-translate a protein sequence to a nucleotide sequence |
| charge | Draw a protein charge plot |
| checktrans | Reports STOP codons and ORF statistics of a protein |
| compseq | Calculate the composition of unique words in sequences |
| emowse | Search protein sequences by digest fragment molecular weight |
| freak | Generate residue/base frequency table or plot |
| iep | Calculate the isoelectric point of proteins |
| mwcontam | Find weights common to multiple molecular weights files |
| mwfilter | Filter noisy data from molecular weights file |
| octanol | Draw a White-Wimley protein hydropathy plot |
| pepinfo | Plot amino acid properties of a protein sequence in parallel |
| pepwindow | Draw a hydropathy plot for a protein sequence |
| pepwindowall | Draw Kyte-Doolittle hydropathy plot for a protein alignment |
| wordcount | Count and extract unique words in molecular sequence(s) |
Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.