Bioinformatic Linguistic Models

Gamma Zaratustra Galindo Pérez1

1 IPN – UPIBI, ,  Mexico city

 zaratustra@universo.com

Abstract Informatics is applied in this document to work out linguistic patterns in protein sequences, to this effect, the information found in protein-sequence banks was taken as the basis and behavioral patterns were searched for. This allowed to state linguistic equations for the analyzed proteins. To this end, a series of tools (programs in “C”) was created, which enabled the analysis of the information both proteinic and genetical.

1  Introduction

There is a vast amount of information of genetical and protein sequences on the net and it is constantly expanding. New decodifications of genes and proteins creates numerous needs, ranging from classification and storage to comprehension of their meaning and biological relevance.

It is necessary to design and develop tools which not only facilitate the storage of information and the extraction of data, but also contribute in the fathoming of patterns in biological information. In the long run, this will allow to model gene and or protein sequences.

This would open multiple possibilities, some very pragmatic, like designing better matrices for protein purification, others include the possibility to design genes and proteins suiting our interest, programming microorganism genes (m.o.)

As an example of the latter, this project aims to find behavior patterns in gene and protein sequences applying informatics.

2  Methodology

Initially, information was compiled from protein-and-gene-sequence banks. This information was then classified and integrated to search for behavior patterns at a genetical level. To this end, Mathematics-Linguistic tools were used.

In the present work, cellulases and amylases were used as study proteins due to their importance in economy. Cytocrhrome C (CYC) was also included as the model obtained from their analysis works as a referent to validate the other models, since there are enough studies on the base of CYC to compare to our results.

It is worth noting that the procedures used here may be applied to the analysis of proteins, and given the case, to the study of genes since the tools created to analyze our sequences view proteins and genes as a sequence of characters.

3 Results and Discussion

As the first step, information was collected from protein-and-gene-sequence banks. They are available in Internet. The first part of the project was to find the electronic address. After an initial search, it was decided to use the European Bioinformatic Institute site, found at: www.ebi.ac.uk [11].

Cytochrome C was the first protein worked upon. This has several advantages:

It is a relatively small protein (between 100 and 120 amino acids.)

There are several studies based on Cytochrome C to whom it is possible to compare our results.

The protein bank “Siwssprot” from the European Bioinformatic Institute was chosen and cytochrome C of Euglena gracilis  (CYC_ EUGGR)  was selected. It has a 102 amino acids sequence (a.a.) and is as follows:

 

GDAERGKKLF ESRAAQCHSA QKGVNSTGPS LWGVYGRTSG SVPGYAYSNA NKNAAIVWEE    ETLHKFLENP    KKYVPGTKMA  FAGIKAKKDR    QDIIAYMKTL KD

 

A Blast-p analysis was performed on this sequence, using the tools found at the web page of the Institute [11], the sequences with significant alignments to the cytochrome C of Euglena gracilis were the following:

 

1  CYC_EUGGR 

7    CYC_BOVIN 

13   CYC_HUMAN

19  CYC_NEUCR

2  CYC_EUGVI            

8    CYC_CYPCA         

14   CYC_CANFA    

20  CYC_RANCA     

3  CYC _MOUSE             

9    CYC_MACGI      

15   CYC_MIRLE      

21  CYC_MINSC       

4  CYC _RAT                  

10  CYC_HIPAM     

16   CYC_KATPE      

22  CYC_APTPA      

5  CYC_EQUAS                

11  CYC_THELA      

17   CYC_MACMU     

23  CYC_ENTTR      

6   CYC_HORSE             

12  CYC_CRIFA    

18   CYC_ESCGI      

24  CYC_MOUSE      

 

Observing the above we can assert that all the sequences correlated to the one we administered were cytochromes. It should be noted that the program was not given any clues that the sequence for the analysis was a cytochrome C.

For the amylases, several Blast-P analyses were carried out based on different amylase sequences until one was found that allowed the right analysis to later establish the linguistic equation for the amylases.

First the Blast-p analysis was carried out using the Bacillus megaterium  amylase. Homology was mostly attained with several types of glucosidases, such as xilanases, manases and to a lesser extent amylases. When the Bacillus circulans sequence was used, higher homology was found to cyclomaltodextrin glucanotransferases.

From these results, it was decided to reclassify the information and carry out a Blast-p analysis again, this time using the Bacillus subtilis  amylase sequence reported in the European Bioinformatic Institute with ID  AMY_BACSU  STANDARD;  a séquense of 660 a.a.

 

MFAKRFKTSL

LPLFAGFLLL

FHLVLAGPAA

ASAETANKSN

ELTAPSIKSG

TILHAWNWSF

NTLKHNMKDI

HDAGYTAIQT

SPINQVKEGN

QGDKSMSNWY

WLYQPTSYQI

GNRYLGTEQE

FKEMCAAAEE

YGIKVIVDAV

INHTTSDYAA

ISNEVKSIPN

WTHGNTQIKN

WSDRWDVTQN

SLLGLYDWNT

QNTQVQSYLK

RFLDRALNDG

ADGFRFDAAK

HIELPDDGSY

GSQFWPNITN

TSAEFQYGEIL

QDSASRDAA

YANYMDVTAS

 NYGHSIRSAL

KNRNLGVSNI

SHYASDVSAD

KLVTWVESHD

TYANDDEEST

WMSDDDIRLG

WAVIASRSGS

TPLFFSRPEG

GGNGVRFPGK

 SQIGDRGSAL

FEDQAITAVN

RFHNVMAGQP

EELSNPNGNN

QIFMNQRGSH

GVVLANAGSS

SVSINTATKL

PDGRYDNKAG

AGSFQVNDGK

LTGTINARSV

AVLYPDDIAK

APHVFLENYK

TGVTHSFNDQ

LTITLRADAN

TTKAVYQINN

GPDDRRLRME

INSQSEKEIQ

FGKTYTIMLK

GTNSDGVTRT

EKYSFVKRDP

ASAKTIGYQN

PNHWSQVNAY

 IYKHDGSRVI

ELTGSWPGKP

MTKNADGIYT

LTLPADTDTT

NAKVIFNNGS

AQVPGQNQPG

 FDYVLNGLYN

DSGLSGSLPH

 

 

 

 

 

This sequence was subject of a Blast-p analysis and the results are summarized in the following Table 1:

Table 1. Summary of  Blast-p with the sequence AMY_BACSU

Enzyme

%

 

alpha-amylase

25

50

cyclomaltodextrin

9

18

pancreatic alpha-amylase

4

8

(amy b) alpha-amylase b

4

8

alpha-amylase precursor

3

6

salivary alpha-amylase

2

4

acid alpha-amylase

1

2

supposed alpha-amylase

1

2

 maltogenase

1

2

 

Afterwards, information from Blast-p was classified and integrated in order to use it for linguistic-behavior-pattern search. To this end, several informatic tools were developed based on mathematics linguistic and evolutional systems.

In this stage the linguistic structure for the protein or gene was established. To illustrate this and as an example of the results obtained, we will start on Table 2, which shows a small fragment of the Blast-p used for cytochrome C.

Table 2. Fragment of Blast-p performed on cytochrome C

Cytochrome C (CYC)

 

Organism

Amino acid Sequence (a.a)

 

 

1

2

3

4

5

6

7

8

9

10

11

EUGGR

G

D

A

E

R

G

K

K

L

F

E

EUGVI

G

D

A

E

R

G

K

K

L

F

E

MOUSE

G

D

A

E

A

G

K

K

I

F

V

EQUAS

G

D

V

E

K

G

K

K

I

F

V

HORSE

G

D

V

E

K

G

K

K

I

F

V

BOVIN

G

D

V

E

K

G

K

K

I

F

V

 

Table 2 shows that in all the CYC the first amino acid is G, the second is D, while the third may possibly be two a.a A and V this place was marked x1, where x indicates a variable and 1 the first place where a variation occurs. The same thing happens in the 5th a.a. this was marked x2 because it was the 2nd place with diverse a.a. for the same position. This is how linguistic structure is obtained:

 

G  D  x1  E  x2  G  K  K  x3   F x4

 

Through this process, the analysis of different type of protein changed has been developed, finding its linguistic structure.

To make the establishment of the linguistic equation easier several programs were created. Their pseudocode base is :  

 

-  Open “input” file where a.a. chains are stored and open “output” file, where the linguistic structure will be stored, (in this case “salebio.txt).

-   Read data from input file and store in the array “texto”.

-  Compare characters of each column. If all are the same, save the character in "textsal" if not, store an "x" in that position.

-  Print "textsal" and save in salebio.txt.

-  Close files and system.

 

These programs were fed with the data from the full Blast-p on cytochrome C (25CYC between 95 and 102 a.a.) and the full Blast-p on amylases (50 proteins of 620-660 a.a) and the following linguistic structures were obtained:

Linguistic structure for cytochrome C:

 

GD...G...F.....QCH....G....GP.L.G..GR..G...G..Y..A.......W....L...L..PKK..PGTKM.F.G.K....R.D..........

 

Where letters indicate that all the cases analyzed show that amino acid in that position and the dots indicate that in that position different amino acids appear.

 

Linguistic Structure for a-amylase:

.

................................................G........................................................W.........YQP......................G...........F.......................................N...............................................................................................................G.........G...R................H......................................................................Y........V................................................................V....H...D................................................................................................................................NG................................................RG.......................N.................................TLGY...................................G..........................

 

From the id cards for base protein used (CYC_ EUGGR y AMY_BACSU  ) available en the protein bank of the EBI [11]  it was determined which were the a.a. belonging to the hemo group and the active site. Comparing the linguistic equations it was established that all the a.a. were present. Based on this linguistic analysis it was determined that seemingly, the active sites of  amylases and the hemo group of the cytochrome C tend to be highly conservative.

 

A possible conformation of hemo group in cytochrome C would be:

 

1

17

18

79

85

 

x

x+16

x+17

x+78

x+84

G

C

H

M

K

 

 

The possible active site for the amylases would be:

 

x

x+1

x+2

x+3

x+4

x+13

D

A

A

K

H

D

 

Where x is the site for the a.a. where the active site or hemo group starts, depending on the case.

 

For a better understanding of the linguistic proprieties of the studied proteins it was necessary to start an analysis at the character level. Each amino acid was viewed as a character in a sentence, in that sense, programs were created that allowed to visualize the percentage of occurrence for a given amino acid for a given site with very interesting results.

To do that, a program was created in C to which, data were indeed supplied from Table 2, yielding the results on Table 3:

Table 3. Output of a.a. percentile variations ones based on data frome Table 2.

a.a./

Percentage of occurrence

Site

1

2

3

4

5

6

7

8

9

10

11

A

0

0

40

0

20

0

0

0

0

0

0

B

0

0

0

0

0

0

0

0

0

0

0

C

0

0

0

0

0

0

0

0

0

0

0

D

0

100

0

0

0

0

0

0

0

0

0

E

0

0

0

100

0

0

0

0

0

0

20

F

0

0

0

0

0

0

0

0

0

100

0

G

100

0

0

0

0

100

0

0

0

0

0

H

0

0

0

0

0

0

0

0

0

0

0

I

0

0

0

0

0

0

0

0

80

0

0

K

0

0

0

0

60

0

100

100

0

0

0

L

0

0

0

0

0

0

0

0

20

0

0

M

0

0

0

0

0

0

0

0

0

0

0

N

0

0

0

0

0

0

0

0

0

0

0

P

0

0

0

0

0

0

0

0

0

0

0

Q

0

0

0

0

0

0

0

0

0

0

0

R

0

0

0

0

20

0

0

0

0

0

0

S

0

0

0

0

0

0

0

0

0

0

0

T

0

0

0

0

0

0

0

0

0

0

0

V

0

0

60

0

0

0

0

0

0

0

80

W

0

0

0

0

0

0

0

0

0

0

0

 

G

D

.

E

.

G

K

K

.

F

.

 

From Table 3, the computer renders the analysis of the following form, the first column prints a list of all the amino acids and next it establishes the percentage of occurrence for that amino acid in each specific site in the protein. In the last line, the linguistic equation is printed.

For example, in the first column, all the cytochrome C from Table 2 have Glycine (G), so on Table 3, next to G for site1 “100” is printed, indicating a 100% of occurrence for G in that site and in the last line G is written. On the third site we can see 40% de Alanine (A) and 60% de Valine (V) so in the last line “.” Is printed, indicating that this site bears variations.

The pseudocode for the program performing this task is as follows:

 

-  Open “input” file where a.a. chains are stored and open “output” file, where the linguistic structure and table will be stored.

-   Read data from input file and store in the array “texto”.

-  Count for the times a character appears in each column. If all are the same, save the character in "textsal" if not, store an "." in that position.

-           Print a list of all tha amino acid and next the percentage of occurrence of the amino acid for each specific site in the protein and print "textsal" under the corresponding column.

-           Save the printout in the output file.

-  Close files and system.

 

 

These results allow us to see among other things the patterns of variation for a specific site in the protein, allowing to appreciate if amino acids occurring in a specific site are all the same type. i.e. 100% neutral or if there are combinations i.e. 70% neutral 20% acid 10% basic and even 30% neutro-aromatic 70% neutral non aromatic.

Below are some of the results obtained preliminarily from tha analysis of the complete resulting tables.

On the first table for variations on cytochrome C note that on site 14 el 88% of the CYC analyzed presented Cysteine, while only 12% presented Alanine, looking for other sites where this proportion of variation existed, it was found that site 39 and 48 bear a 88-12% proportion, but in these cases Thymine was present 88% of the times and serine 12%, as we can see on Table 4, on the Blast-p we find what Table 5 shows.

 

Table 4. Fragment of analysis of variations in CYC

 

Table 5. Fragment of Blast-p analysis performed on CYC

a.a.

Site

 

Organism

site

 14

 39

 48

 

14

39

48

A

12

 ,

 ,

 

 1 :CYC_EUGGR          

A

S

S

B

 ,

 ,

 ,

 

 2 :CYC_EUGVI           

A

S

S

C

 88

 ,

 ,

 

 3 :CYC2_MOUSE          

C

T

T

D

 ,

,

 ,

 

 4 :CYC2_RAT            

C

T

T

E

 ,

 ,

 ,

 

 5 :CYC_EQUAS           

C

T

T

F

 ,

 ,

 ,

 

 6 :CYC_HORSE           

C

T

T

G

 ,

 ,

 ,

 

 7 :CYC_BOVIN           

C

T

T

H

 ,

 ,

 ,

 

 8 :CYC_CYPCA           

C

T

T

I

,

,

,

 

 9 :CYC_MACGI           

C

T

T

K

 ,

 ,

 ,

 

10 :CYC_HIPAM           

C

T

T

L

 ,

 ,

,

 

11 :CYC_THELA           

C

T

T

M

 ,

 ,

 ,

 

12 :CYC_CRIFA           

A

S

S

N

 ,

 ,

 ,

 

13 :G298836                       

C

T

T

P

 ,

 ,

 ,

 

14 :CYC_HUMAN           

C

T

T

Q

 ,

,

 ,

 

15 :CYC_CANFA             

C

T

T

R

 ,

 ,

 ,

 

16 :CYC_MIRLE              

C

T

T

S

 ,

 12

 12

 

17 :CYC_KATPE           

C

T

T

T

 ,

 88

 88

 

18 :CYC_MACMU           

C

T

T

V

,

 ,

 ,

 

19 :CYC_ESCGI           

C

T

T

W

 ,

,

,

 

20 :CYC_NEUCR           

C

T

T

Y

,

 ,

 ,

 

21 :CYC_RANCA           

C

T

T

 

 

 

 

 

22 :CYC_MINSC

C

T

T

 

 

 

 

 

23 :CYC_APTPA           

C

T

T

 

 

 

 

 

24 :CYC_ENTTR           

C

T

T

 

 

 

 

 

25 :CYC_MOUSE           

C

T

T

 

 

From the Blast-p the organisms can be divided in two categories, those with cysteine in site 14 and Thymine in sites 39 y 48 and those with Alanine in site 14 and Serine in the other two. Based on these results, it is possible to assert that it is possible that the a.a. sequence CTT be substituted for ASS in other words, it is possible that CTT and ASS act as synonyms.

 

Finally, another tool was developed which in conjunction with the previous work enables us to establish the linguistic equation as well as comparing it with amino acid sequences in such a way that it selects those matching the equation. Its pseudocode is as follows:

 

-           Open “input” file where a.a. chains from which the linguistic equation is derived are stored and open “output” file, where the linguistic results will be stored.

-           Read data from input file and store in the array “texto”.

-           Establish the linguistic equation.

-           Open the file where the chains to be compared are stored.

-           For each of the chains to be compared, verify that the equation exists within the chain.

-           Print a list of all the chains which fulfill the linguistic equation and point out how many sequences there were in the file and from those, how many fulfilled the equation.

-           Save the printout in the output file.

-           Close files and system.

 

This program could be used for example to create a sequence identifier, in such way that instead of having the sequences stored for multiple proteins, only a linguistic equation is stored for each type of protein and with it perform a preliminary identification.

4   Conclusions

It is possible to fin linguistic structures based on the analysis of the protein information, since among different enzymes catalyzing the same reaction there are a.a. zones that repeat independently in the organism from which enzymes are extracted.

From the results in different Blast-p analyses performed for amylases we can conclude that it is possible that there are certain sequences of a.a acting as verbs and another series of amino acids acting as subject. When they are conjugated a high specificity is attained in the enzymes. Besides, based on the analysis of the variations of cytochrome C it is possible that there are equivalent amino acid sequences which could act as synonyms.

The development of programs is feasible to facilitate the establishment of different grammar characteristics for the protein or gene sequences.

References

1. El Origen de las formas, edición especial de Mundo Científico #188, Barcelona (Marzo de 1998)

2. Singh, Jagjit,: Teoría de la Información, del lenguaje y de la cibernética, Ed. Alianza Editorial AU-29, Madrid (1982)

3. Galindo, S.F.: Algunas propiedades matemáticas de los sistemas lingüísticos en: las Memorias sobre Sistemas Evolutivos del ler Congreso Internacional de Investigación en Ciencias Computacionales, Instituto Tecnológico de Toluca, Metepec Edo. de México (1994)

4. Galindo, S.F.: Sistemas Evolutivos de Reescritura, en Memorias sobre Sistemas Evolutivos del ler. Congreso Internacional de Investigación en Ciencias Computacionales, Instituto Tecnológico de Toluca, Metepec Edo. de México (1994)

5. Galindo, S.F.: Sistemas Evolutivos de Lenguajes de Trayectoria, En las Memorias de la VI Reunión Nacional de Inteligencia Artificial, Ed. Limusa, Querétaro, Qro. (1989)

6. Jullien, Remi, Botet, Robert, y Kolb, M.: Los Agregados, en Mundo Científico vol. 6, #54, pag. 36, Ed. Fontalba, S.A., Barcelona, España

7. Galindo, P.G.Z.. y Rodríguez, P.P.: Modelos Bioinformáticos, en las memorias del VIII Congreso Nacional de Biotecnología y Bioingeniería y IV Congreso Latinoamericano de Biotecnología y Bioingeniería, pag 599, Huatulco, Oaxaca, México (1999)

8. Segovia, L.: Bioinformática: Análisis de la familia estructural de las Beta-lactamasas, en las memorias del VIII Congreso Nacional de Biotecnología y Bioingeniería y IV Congreso Latinoamericano de Biotecnología y Bioingeniería, pag 598, Huatulco, Oaxaca, México (1999)

9. Lehninger, A.: Biochemistry, 2. Edición, Nueva York (1975)

10. Smith, C.U.M.: Biología Molecular, Ed, Alianza Editorial AU-7, Madrid (1971)

11.  www.ebi.ac.uk  Página del Instituto Europeo de Bioinformática