Bioinformatic Linguistic Models
1 IPN –
UPIBI, , Mexico city
zaratustra@universo.com
Abstract Informatics is
applied in this document to work out linguistic patterns in protein sequences,
to this effect, the information found in protein-sequence banks was taken as
the basis and behavioral patterns were searched for. This allowed to state
linguistic equations for the analyzed proteins. To this end, a series of tools
(programs in “C”) was created, which enabled the analysis of the information
both proteinic and genetical.
1 Introduction
There
is a vast amount of information of genetical and protein sequences on the net
and it is constantly expanding. New decodifications of genes and proteins
creates numerous needs, ranging from classification and storage to
comprehension of their meaning and biological relevance.
It is necessary to design and develop tools which not only facilitate
the storage of information and the extraction of data, but also contribute in
the fathoming of patterns in biological information. In the long run,
this will allow to model gene and or protein sequences.
This would open multiple possibilities, some very pragmatic, like
designing better matrices for protein purification, others include the
possibility to design genes and proteins suiting our interest, programming
microorganism genes (m.o.)
As an example of the latter, this project aims to find behavior
patterns in gene and protein sequences applying informatics.
2 Methodology
Initially,
information was compiled from protein-and-gene-sequence banks. This information
was then classified and integrated to search for behavior patterns at a
genetical level. To this end, Mathematics-Linguistic tools were used.
In the present work, cellulases and amylases were used as study
proteins due to their importance in economy. Cytocrhrome C (CYC) was also
included as the model obtained from their analysis works as a referent to
validate the other models, since there are enough studies on the base of CYC to
compare to our results.
It is worth noting that the procedures used here may be applied to the
analysis of proteins, and given the case, to the study of genes since the tools
created to analyze our sequences view proteins and genes as a sequence of
characters.
3 Results and Discussion
As
the first step, information was collected from protein-and-gene-sequence banks.
They are available in Internet. The first part of the project was to find the
electronic address. After an initial search, it was decided to use the European
Bioinformatic Institute site, found at: www.ebi.ac.uk
[11].
Cytochrome C was the first protein worked upon. This has several
advantages:
It is a relatively small protein (between 100 and 120 amino acids.)
There are several studies based on Cytochrome C to whom it is possible
to compare our results.
The protein bank “Siwssprot” from the European Bioinformatic Institute
was chosen and cytochrome C of Euglena gracilis (CYC_ EUGGR) was selected. It has a 102 amino acids
sequence (a.a.) and is as follows:
GDAERGKKLF ESRAAQCHSA QKGVNSTGPS LWGVYGRTSG
SVPGYAYSNA NKNAAIVWEE ETLHKFLENP KKYVPGTKMA FAGIKAKKDR QDIIAYMKTL
KD
A
Blast-p analysis was performed on this sequence, using the tools found at the
web page of the Institute [11], the sequences with significant alignments to
the cytochrome C of Euglena gracilis were the following:
1 CYC_EUGGR
|
7 CYC_BOVIN |
13 CYC_HUMAN |
19 CYC_NEUCR |
2 CYC_EUGVI |
8 CYC_CYPCA |
14 CYC_CANFA |
20 CYC_RANCA |
3 CYC _MOUSE |
9 CYC_MACGI |
15 CYC_MIRLE |
21 CYC_MINSC |
4 CYC _RAT |
10 CYC_HIPAM |
16 CYC_KATPE |
22 CYC_APTPA |
5 CYC_EQUAS |
11 CYC_THELA |
17 CYC_MACMU |
23 CYC_ENTTR |
6 CYC_HORSE |
12 CYC_CRIFA |
18 CYC_ESCGI |
24 CYC_MOUSE |
Observing the above we can assert that all the
sequences correlated to the one we administered were cytochromes. It should be
noted that the program was not given any clues that the sequence for the
analysis was a cytochrome C.
For the amylases, several Blast-P analyses were carried out based on
different amylase sequences until one was found that allowed the right analysis
to later establish the linguistic equation for the amylases.
First the Blast-p analysis was carried out using the Bacillus
megaterium amylase. Homology was
mostly attained with several types of glucosidases, such as xilanases, manases
and to a lesser extent amylases. When the Bacillus circulans sequence was
used, higher homology was found to cyclomaltodextrin glucanotransferases.
From these results, it was decided to reclassify the information and
carry out a Blast-p analysis again, this time using the Bacillus subtilis amylase sequence reported in the European
Bioinformatic Institute with ID
AMY_BACSU STANDARD; a séquense of 660 a.a.
MFAKRFKTSL |
LPLFAGFLLL |
FHLVLAGPAA |
ASAETANKSN |
ELTAPSIKSG |
TILHAWNWSF |
NTLKHNMKDI |
HDAGYTAIQT |
SPINQVKEGN |
QGDKSMSNWY |
WLYQPTSYQI |
GNRYLGTEQE |
FKEMCAAAEE |
YGIKVIVDAV |
INHTTSDYAA |
ISNEVKSIPN |
WTHGNTQIKN |
WSDRWDVTQN |
SLLGLYDWNT |
QNTQVQSYLK |
RFLDRALNDG |
ADGFRFDAAK |
HIELPDDGSY |
GSQFWPNITN |
TSAEFQYGEIL |
QDSASRDAA |
YANYMDVTAS |
NYGHSIRSAL |
KNRNLGVSNI |
SHYASDVSAD |
KLVTWVESHD |
TYANDDEEST |
WMSDDDIRLG |
WAVIASRSGS |
TPLFFSRPEG |
GGNGVRFPGK |
SQIGDRGSAL |
FEDQAITAVN |
RFHNVMAGQP |
EELSNPNGNN |
QIFMNQRGSH |
GVVLANAGSS |
SVSINTATKL |
PDGRYDNKAG |
AGSFQVNDGK |
LTGTINARSV |
AVLYPDDIAK |
APHVFLENYK |
TGVTHSFNDQ |
LTITLRADAN |
TTKAVYQINN |
GPDDRRLRME |
INSQSEKEIQ |
FGKTYTIMLK |
GTNSDGVTRT |
EKYSFVKRDP |
ASAKTIGYQN |
PNHWSQVNAY |
IYKHDGSRVI |
ELTGSWPGKP |
MTKNADGIYT |
LTLPADTDTT |
NAKVIFNNGS |
AQVPGQNQPG |
FDYVLNGLYN |
DSGLSGSLPH |
|
|
|
|
This
sequence was subject of a Blast-p analysis and the results are summarized in
the following Table 1:
Table
1. Summary of
Blast-p with the sequence AMY_BACSU
Enzyme |
n° |
% |
alpha-amylase |
25 |
50 |
cyclomaltodextrin |
9 |
18 |
pancreatic alpha-amylase |
4 |
8 |
(amy b) alpha-amylase b |
4 |
8 |
alpha-amylase precursor |
3 |
6 |
salivary alpha-amylase |
2 |
4 |
acid alpha-amylase |
1 |
2 |
supposed alpha-amylase |
1 |
2 |
maltogenase |
1 |
2 |
Afterwards,
information from Blast-p was classified and integrated in order to use it for
linguistic-behavior-pattern search. To this end, several informatic tools were
developed based on mathematics linguistic and evolutional systems.
In this stage the linguistic structure for the protein or gene was
established. To illustrate this and as an example of the results obtained,
we will start on Table 2, which shows a small fragment of the Blast-p used for
cytochrome C.
Table 2. Fragment of Blast-p
performed on cytochrome C
Cytochrome C (CYC) |
|
|||||||||||
Organism |
Amino acid Sequence (a.a) |
|
||||||||||
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
|
EUGGR |
G |
D |
A |
E |
R |
G |
K |
K |
L |
F |
E |
|
EUGVI |
G |
D |
A |
E |
R |
G |
K |
K |
L |
F |
E |
|
MOUSE |
G |
D |
A |
E |
A |
G |
K |
K |
I |
F |
V |
|
EQUAS |
G |
D |
V |
E |
K |
G |
K |
K |
I |
F |
V |
|
HORSE |
G |
D |
V |
E |
K |
G |
K |
K |
I |
F |
V |
|
BOVIN |
G |
D |
V |
E |
K |
G |
K |
K |
I |
F |
V |
|
Table
2 shows that in all the CYC the first amino acid is G, the second is D, while
the third may possibly be two a.a A and V this place was marked x1,
where x indicates a variable and 1 the first place where a variation occurs.
The same thing happens in the 5th a.a. this was marked x2
because it was the 2nd place with diverse a.a. for the same
position. This is how linguistic structure is obtained:
G D x1 E x2 G
K K x3 F x4
Through
this process, the analysis of different type of protein changed has been
developed, finding its linguistic structure.
To make the establishment of the
linguistic equation easier several programs were created. Their pseudocode base is :
- Open “input” file where a.a. chains are
stored and open “output” file, where the linguistic structure will be stored,
(in this case “salebio.txt).
- Read data from input file and store in the
array “texto”.
- Compare characters of each column. If all
are the same, save the character in "textsal" if not, store an
"x" in that position.
- Print "textsal" and
save in salebio.txt.
- Close files and system.
These
programs were fed with the data from the full Blast-p on cytochrome C (25CYC
between 95 and 102 a.a.) and the full Blast-p on amylases (50 proteins of
620-660 a.a) and the following linguistic structures were obtained:
Linguistic
structure for cytochrome C:
GD...G...F.....QCH....G....GP.L.G..GR..G...G..Y..A.......W....L...L..PKK..PGTKM.F.G.K....R.D..........
Where
letters indicate that all the cases analyzed show that amino acid in that
position and the dots indicate that in that position different amino acids
appear.
Linguistic
Structure for a-amylase:
.
................................................G........................................................W.........YQP......................G...........F.......................................N...............................................................................................................G.........G...R................H......................................................................Y........V................................................................V....H...D................................................................................................................................NG................................................RG.......................N.................................TLGY...................................G..........................
From
the id cards for base protein used (CYC_ EUGGR y
AMY_BACSU ) available en the protein
bank of the EBI [11] it was determined
which were the a.a. belonging to the hemo group and the active site. Comparing
the linguistic equations it was established that all the a.a. were present.
Based on this linguistic analysis it was determined that seemingly, the active
sites of amylases and the hemo group of
the cytochrome C tend to be highly conservative.
A
possible conformation of hemo group in cytochrome C would be:
1 |
17 |
18 |
79 |
85 |
|
||
x |
x+16 |
x+17 |
x+78 |
x+84 |
|||
G |
C |
H |
M |
K |
|
||
The
possible active site for the amylases would be:
x |
x+1 |
x+2 |
x+3 |
x+4 |
x+13 |
D |
A |
A |
K |
H |
D |
Where
x is the site for the a.a. where the active site or hemo group starts,
depending on the case.
For
a better understanding of the linguistic proprieties of the studied proteins it
was necessary to start an analysis at the character level. Each amino acid was
viewed as a character in a sentence, in that sense, programs were created that
allowed to visualize the percentage of occurrence for a given amino acid for a
given site with very interesting results.
To do that, a program was created in C to which, data were indeed
supplied from Table 2, yielding the results on Table 3:
Table
3. Output of a.a. percentile variations ones based on
data frome Table 2.
a.a./ |
Percentage of occurrence |
||||||||||
Site |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
A |
0 |
0 |
40 |
0 |
20 |
0 |
0 |
0 |
0 |
0 |
0 |
B |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
C |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
D |
0 |
100 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
E |
0 |
0 |
0 |
100 |
0 |
0 |
0 |
0 |
0 |
0 |
20 |
F |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
100 |
0 |
G |
100 |
0 |
0 |
0 |
0 |
100 |
0 |
0 |
0 |
0 |
0 |
H |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
I |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
80 |
0 |
0 |
K |
0 |
0 |
0 |
0 |
60 |
0 |
100 |
100 |
0 |
0 |
0 |
L |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
20 |
0 |
0 |
M |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
N |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
P |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Q |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
R |
0 |
0 |
0 |
0 |
20 |
0 |
0 |
0 |
0 |
0 |
0 |
S |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
T |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
V |
0 |
0 |
60 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
80 |
W |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
|
G |
D |
. |
E |
. |
G |
K |
K |
. |
F |
. |
From
Table 3, the computer renders the analysis of the following form, the first
column prints a list of all the amino acids and next it establishes the
percentage of occurrence for that amino acid in each specific site in the
protein. In the last line, the linguistic equation is printed.
For example, in the first column, all the cytochrome C from Table 2
have Glycine (G), so on Table 3, next to G for site1 “100” is printed,
indicating a 100% of occurrence for G in that site and in the last line G is
written. On the third site we can see 40% de Alanine (A) and 60% de Valine (V)
so in the last line “.” Is printed, indicating that this site bears variations.
The pseudocode for the program performing this task is as follows:
- Open “input” file where a.a. chains are
stored and open “output” file, where the linguistic structure and table will be
stored.
- Read data from input file and store in the
array “texto”.
- Count for the times a character appears in
each column. If all are the same, save the character in "textsal" if
not, store an "." in that position.
-
Print a list of all tha amino acid and next the percentage of
occurrence of the amino acid for each specific site in the protein and print
"textsal" under the corresponding column.
-
Save the printout in the output file.
- Close files and system.
These
results allow us to see among other things the patterns of variation for a
specific site in the protein, allowing to appreciate if amino acids
occurring in a specific site are all the same type. i.e. 100% neutral or if
there are combinations i.e. 70% neutral 20% acid 10% basic and even 30% neutro-aromatic 70% neutral non
aromatic.
Below are some of
the results obtained preliminarily from tha analysis of the complete resulting
tables.
On the first
table for variations on cytochrome C note that on site 14 el 88% of the CYC
analyzed presented Cysteine, while only 12% presented Alanine, looking for
other sites where this proportion of variation existed, it was found that site
39 and 48 bear a 88-12% proportion, but in these cases Thymine was present 88%
of the times and serine 12%, as we can see on Table 4, on the Blast-p we find
what Table 5 shows.
Table
4. Fragment of analysis of variations in CYC |
|
Table
5. Fragment of Blast-p analysis performed on CYC |
||||||
a.a. |
Site |
|
Organism |
site |
||||
14 |
39 |
48 |
|
14 |
39 |
48 |
||
A |
12 |
, |
, |
|
1 :CYC_EUGGR |
A |
S |
S |
B |
, |
, |
, |
|
2 :CYC_EUGVI |
A |
S |
S |
C |
88 |
, |
, |
|
3 :CYC2_MOUSE |
C |
T |
T |
D |
, |
, |
, |
|
4
:CYC2_RAT |
C |
T |
T |
E |
, |
, |
, |
|
5
:CYC_EQUAS |
C |
T |
T |
F |
, |
, |
, |
|
6 :CYC_HORSE |
C |
T |
T |
G |
, |
, |
, |
|
7 :CYC_BOVIN |
C |
T |
T |
H |
, |
, |
, |
|
8 :CYC_CYPCA |
C |
T |
T |
I |
, |
, |
, |
|
9 :CYC_MACGI |
C |
T |
T |
K |
, |
, |
, |
|
10 :CYC_HIPAM |
C |
T |
T |
L |
, |
, |
, |
|
11 :CYC_THELA |
C |
T |
T |
M |
, |
, |
, |
|
12 :CYC_CRIFA |
A |
S |
S |
N |
, |
, |
, |
|
13 :G298836
|
C |
T |
T |
P |
, |
, |
, |
|
14 :CYC_HUMAN |
C |
T |
T |
Q |
, |
, |
, |
|
15 :CYC_CANFA |
C |
T |
T |
R |
, |
, |
, |
|
16 :CYC_MIRLE |
C |
T |
T |
S |
, |
12 |
12 |
|
17 :CYC_KATPE |
C |
T |
T |
T |
, |
88 |
88 |
|
18 :CYC_MACMU |
C |
T |
T |
V |
, |
, |
, |
|
19 :CYC_ESCGI |
C |
T |
T |
W |
, |
, |
, |
|
20 :CYC_NEUCR |
C |
T |
T |
Y |
, |
, |
, |
|
21 :CYC_RANCA |
C |
T |
T |
|
|
|
|
|
22 :CYC_MINSC |
C |
T |
T |
|
|
|
|
|
23 :CYC_APTPA |
C |
T |
T |
|
|
|
|
|
24 :CYC_ENTTR |
C |
T |
T |
|
|
|
|
|
25 :CYC_MOUSE |
C |
T |
T |
From
the Blast-p the organisms can be divided in two categories, those with cysteine
in site 14 and Thymine in sites 39 y 48 and those with Alanine in site 14 and
Serine in the other two. Based on these results, it is possible to assert that
it is possible that the a.a. sequence CTT be substituted for ASS in other
words, it is possible that CTT and ASS act as synonyms.
Finally,
another tool was developed which in conjunction with the previous work enables
us to establish the linguistic equation as well as comparing it with amino acid
sequences in such a way that it selects those matching the equation. Its
pseudocode is as follows:
-
Open “input” file where a.a. chains from which
the linguistic equation is derived are stored and open “output” file, where the
linguistic results will be stored.
-
Read data from input file and store in the
array “texto”.
-
Establish the linguistic equation.
-
Open the file where the chains to be compared
are stored.
-
For each of the chains to be compared, verify
that the equation exists within the chain.
-
Print a list of all the chains which fulfill
the linguistic equation and point out how many sequences there were in the file
and from those, how many fulfilled the equation.
-
Save the printout in the output file.
-
Close files and system.
This
program could be used for example to create a sequence identifier, in such way
that instead of having the sequences stored for multiple proteins, only a
linguistic equation is stored for each type of protein and with it perform a
preliminary identification.
4 Conclusions
It
is possible to fin linguistic structures based on the analysis of the protein
information, since among different enzymes catalyzing the same reaction there
are a.a. zones that repeat independently in the organism from which enzymes are
extracted.
From the results in different Blast-p analyses performed for amylases
we can conclude that it is possible that there are certain sequences of a.a
acting as verbs and another series of amino acids acting as subject. When they
are conjugated a high specificity is attained in the enzymes. Besides, based on
the analysis of the variations of cytochrome C it is possible that there are
equivalent amino acid sequences which could act as synonyms.
The development of programs is feasible to facilitate the establishment
of different grammar characteristics for the protein or gene sequences.
References
1. El Origen de las formas, edición especial de Mundo Científico #188,
Barcelona (Marzo de 1998)
2. Singh, Jagjit,: Teoría de la Información, del lenguaje y de la
cibernética, Ed. Alianza Editorial AU-29, Madrid (1982)
3. Galindo, S.F.: Algunas propiedades matemáticas de los sistemas
lingüísticos en: las Memorias sobre Sistemas Evolutivos del ler Congreso
Internacional de Investigación en Ciencias Computacionales, Instituto
Tecnológico de Toluca, Metepec Edo. de México (1994)
4. Galindo, S.F.: Sistemas Evolutivos de Reescritura, en Memorias sobre
Sistemas Evolutivos del ler. Congreso Internacional de Investigación en
Ciencias Computacionales, Instituto Tecnológico de Toluca, Metepec Edo. de
México (1994)
5. Galindo, S.F.: Sistemas Evolutivos de Lenguajes de Trayectoria, En
las Memorias de la VI Reunión Nacional de Inteligencia Artificial, Ed. Limusa,
Querétaro, Qro. (1989)
6. Jullien, Remi, Botet, Robert, y Kolb, M.: Los Agregados, en Mundo
Científico vol. 6, #54, pag. 36, Ed. Fontalba, S.A., Barcelona, España
7. Galindo, P.G.Z.. y Rodríguez, P.P.: Modelos Bioinformáticos, en las
memorias del VIII Congreso Nacional de Biotecnología y Bioingeniería y IV
Congreso Latinoamericano de Biotecnología y Bioingeniería, pag 599, Huatulco,
Oaxaca, México (1999)
8. Segovia, L.: Bioinformática: Análisis de la familia estructural de
las Beta-lactamasas, en las memorias del VIII Congreso Nacional de
Biotecnología y Bioingeniería y IV Congreso Latinoamericano de Biotecnología y
Bioingeniería, pag 598, Huatulco, Oaxaca, México (1999)
9. Lehninger,
A.: Biochemistry, 2. Edición, Nueva York
(1975)
10. Smith, C.U.M.: Biología
Molecular, Ed, Alianza Editorial AU-7, Madrid (1971)
11. www.ebi.ac.uk Página del
Instituto Europeo de Bioinformática