Biological Database | Online Biotech Notes
Online Biotech Notes
Biological Databases
Databases are effectively electronic filling cabinets, a convenient and efficient method to store biological data in an electronic format. The common biological data are
- Nucleotide sequences (genes and genomes)
- Protein sequences
- Macromolecular structures
- Metabolic pathways
- Gene expression data
- Protein-protein interactions
- Literature
The amount of biological data are increasing exponentially due to various sequencing projects and new technologies used in large-scale genomic and proteomic datasets.
Probably, the first published work on the biological sequence databases was Atlas of protein sequence and structures. 1965, by Margaret Dayhoff et.al.
Characteristics of Biological Databases
All the biological databases use their own standardized formats. But all databases have two main features
Non-Redundancy
- Each entry in the database occurs only once i.e., duplication of the one entry is not allowed in the same database.
Data Sharing
- Biological data in databases are shared for the scientific community's examination and inspection.
Classification of Databases |
Classification of Biological Databases
There are many different database types depending both on the nature of biological data and on the Here, we will discuss only databases concerned only with nature of the information being stored.
1. Sequence Databases
Primary Sequence Databases
Primary databases are repositories for raw sequence data that is generated through laboratory experiments. These can be described in two subsections
1. Nucleotide Database
The sequences were collected from published sources or direct author submission. TheTe are three major nucleotide sequence databases.
(a) European Molecular Biology Laboratory (EMBL)
It is maintained by EBI. Sequence Retrieval system (SRS) is used to retrieve information and links the sequence databases with other databases including maibi_E facility.
(b) DNA DataBank of Japan (DDBJ)
It is maintained by National Institute of Genetics. Sequences may be submitted from all over the world through a web based data submission tool.
(c) Genetic Sequence DataBank (GenBank)
This is maintained by National Centre for Biotechnology Inf on (NCBI)
Entrez, integrated retrieval system, is used to retrieve information from GenBank Each entry in the Geetheng follows flat-file formats. Each of the nucleotide database work collaboratively, the resources exchange data among the three on daily basis.
2. Protein Sequence Databases
(a) Protein Information Resources (PIR)
This database was developed by National Biomedical Research Foundation. This database is divided into four sub-sections
(i) PIR1 It contains fully classified and annotated entries.
(ii) PIR2 It includes preliminary entries and may contain redundancy.
(iii) PIR3 It contains unverified entries.
(iv) P1R4 It may contains conceptual translations of artefactual sequences, genetically engineered sequences sequences not transcribed or translated.
(b) SWISS-PROT
It was produced collaboratively by the Department of Medical Biochemistry at the Unheisil) of Geneva and EMBL. It is now maintained by SWISS Institute of Bioinformacs (SIB), SWISS-PROT bay* minimal redundancy and high level annotations.
(c) Translated EMBL (TrEMBL)
- The database contains translations of all coding sequences (CDS) in EMBL. It is divided in two sections.
(i) SWISS-PROT TrEMBL (SP-TrEMBL)
- It contains entries incorporated into SWISS-PROT
(ii) Remaining TrEMBL (REM-TrEMBL)
- Entries not incorporated with SWISS-PROT. NRL-31) is database is produced by PIR. ATLAS retrieval system is used to access information.
Martinsried Institute for Protein Sequences (MIPS)
- This database is distributed with PATCH X and access to it is provided through its web server.
Composite Databases
- This database amalgamates a variety of different primary database sources. which 03'1%6 the need to search multiple resources.
Secondary Sequence Database
The source of data of secondary databases are the primary databases. Secondary ec is different and have their own fonnats. --c sequence alignment. The inf tion noosed in each of the secondarY databases having the informauon about Aren't"' region'', obtained t rough multiple oatla• PROSITE This database is maintained by SWISS Institute of Bioinformatics. The prote,n family can be characterized the single conserved motif, responsible for key biological functions. Such motifs are represented as regular expressions
PROSITE
- Regular expression is the consensus descriptions of motifs. e.g.. C-T-X2-1161-C-RMS1. square within bracket any residue can be placed at that position.
PRINTS
- In PRINT databases, the protein family is represented by the signature or fingerprint. The fingerprint is osensus description of several conserved motifs within the sequences of particular protein family.
Blocks
- In this database, the motifs or blocks are created by automatically detecting the most highly conserved regions of each protein family.
P fam
- This database is maintained by Sanger Centre. It has a collection of Hidden Markov Models (HMMs) for protein domains. It is a statistically based mathematical treatments, consisting of linear chains of match. delete and insertion states which encode the conserved region within aligned family.
Profiles
- The protein family is characterized by the profile in this database. Profile indicates the position of residues where insertions and deletions (INDELs) are allowed and where the conserved regions are these profiles known as weight matrices.
Identify
- This database is derived from BLOCKS and PRINTS. In identify e-motif is used as a search software to access protein function.
2. Structural Databases
Structural Classification of Proteins (SCoP)
- This database is based on evolutionary studies of all proteins of known structures. The levels of hierarchial classifications of structures in SCoP are class, fold, superfamily, family. proteins and sequences. Here. four major levels are described.
Class
- Proteins of similar secondary structures are said to belong to same class. Classes are all a proteins. all 13. 0.113. a + 13 (segregated a and 13 regions). multi-domain, small proteins and membrane proteins.
Fold
- The fold shows arrangement and topological connections of the major secondary structure .in protein. In the tame fold category, may not have a common evolutionary origin.
Super-family
- In the same superfamily, proteins show low sequence identity but ancestors are the same as their gattline and function share common characteristics.
Family
- Members of the same family are clearly evolutionary related and show 30% or more identity.
Class, Architecture, Topology. Homology (CAM)
- It is a hierarchial domain classification of protein structures in the protein data bank. Here only those structures are "It'fisidered which have resolution better than 3.0 angstroms. There are four basic levels of classification.
Class
- This is determined according to the secondary structure composition and packing Within the major classes arc recognized (i) all a. (ii) mainly p. (iii) a - p (a/(S and a + (iv) with low secondary strustriacturag. pnitt
Architecture
- This describes the overall shape of the domain structure, ignoring their connecnyities et. barcrseck sandwitch, etc. et,* TaPoloRY The proteins are grouped into fold families depending on the overall shape and connects secondary structures. Those proteins share the same topology that have 60% or more than 60% identity.
Homology
- This indicates that structures, share a common ancestor and show high structural and functional
PDB Sum
- It provides summaries and analysis of all structures in the protein data bank.
3. 3D Structure Databases
Protein Data Bank (PDB)
Molecular Modelling Database (MMDB)
- There are two 3D structure databases of NCBI-MMDB and CDD (Conserved Domain Database). This is the retrival version of PDB structures, having experimentally determined biomolecular structures.
Conserved Domain Database (CDD)
- This provides a directory of the sequence and structure alignments representing conserved functional domains tOn within proteins, CDs are displayed in MMDB structure summaries and link to a sequence alignment.
4. Literature Databases
- PubMed Central (PMC) is the literature database that provides access to full-text articles and journals for stolen ul researchers.
5. Gene Expression Databases
- These databases can be explained in three major sub-headings
SAGEmap
- The repository of serial analysis of gene expression (SAGE) data.
GEO
- The repository and retrieval system for any high-throughput gene expression data.
GENSAT
- The database of mouse's central nervous system data, produced by the National Institute of SetaILO Disorders and Stroke, USA.
Probe
- The database have entries of probe sequences. The entries indicate the intended experimental aPfha°35 and include the experimental results generated by using the probe.
6. Chemical Databases
- Pubchem is the popular chemical database of NCBI. It contains structural, chemical and biological Ob' small molecules and their diagnostic and therapeutic applications.
0 Response to "Biological Database | Online Biotech Notes"
Post a Comment