Online Biotech Notes

Biological Databases

Databases are effectively electronic filling cabinets, a convenient and efficient method to store biological data in an electronic format. The common biological data are

Nucleotide sequences (genes and genomes)
Protein sequences
Macromolecular structures
Metabolic pathways
Gene expression data
Protein-protein interactions
Literature

The amount of biological data are increasing exponentially due to various sequencing projects and new technologies used in large-scale genomic and proteomic datasets.

Probably, the first published work on the biological sequence databases was Atlas of protein sequence and structures. 1965, by Margaret Dayhoff et.al.

Characteristics of Biological Databases

All the biological databases use their own standardized formats. But all databases have two main features

Non-Redundancy

Each entry in the database occurs only once i.e., duplication of the one entry is not allowed in the same database.

Data Sharing

Biological data in databases are shared for the scientific community's examination and inspection.

So, we can define biological database as a collection of data i.e., structured, searchable, updated periodically and cross-referenced.

Classification of Databases

Classification of Biological Databases

There are many different database types depending both on the nature of biological data and on the Here, we will discuss only databases concerned only with nature of the information being stored.

1. Sequence Databases

Primary Sequence Databases

Primary databases are repositories for raw sequence data that is generated through laboratory experiments. These can be described in two subsections

1. Nucleotide Database

The sequences were collected from published sources or direct author submission. TheTe are three major nucleotide sequence databases.

(a) European Molecular Biology Laboratory (EMBL)

It is maintained by EBI. Sequence Retrieval system (SRS) is used to retrieve information and links the sequence databases with other databases including maibi_E facility.

(b) DNA DataBank of Japan (DDBJ)

It is maintained by National Institute of Genetics. Sequences may be submitted from all over the world through a web based data submission tool.

(c) Genetic Sequence DataBank (GenBank)

This is maintained by National Centre for Biotechnology Inf on (NCBI)

Entrez, integrated retrieval system, is used to retrieve information from GenBank Each entry in the Geetheng follows flat-file formats. Each of the nucleotide database work collaboratively, the resources exchange data among the three on daily basis.

2. Protein Sequence Databases

(a) Protein Information Resources (PIR)

This database was developed by National Biomedical Research Foundation. This database is divided into four sub-sections

(i) PIR1 It contains fully classified and annotated entries.

(ii) PIR2 It includes preliminary entries and may contain redundancy.

(iii) PIR3 It contains unverified entries.

(iv) P1R4 It may contains conceptual translations of artefactual sequences, genetically engineered sequences sequences not transcribed or translated.

(b) SWISS-PROT

It was produced collaboratively by the Department of Medical Biochemistry at the Unheisil) of Geneva and EMBL. It is now maintained by SWISS Institute of Bioinformacs (SIB), SWISS-PROT bay* minimal redundancy and high level annotations.

(c) Translated EMBL (TrEMBL)

The database contains translations of all coding sequences (CDS) in EMBL. It is divided in two sections.

(i) SWISS-PROT TrEMBL (SP-TrEMBL)

It contains entries incorporated into SWISS-PROT

(ii) Remaining TrEMBL (REM-TrEMBL)

Entries not incorporated with SWISS-PROT. NRL-31) is database is produced by PIR. ATLAS retrieval system is used to access information.

Martinsried Institute for Protein Sequences (MIPS)

This database is distributed with PATCH X and access to it is provided through its web server.

Composite Databases

This database amalgamates a variety of different primary database sources. which 03'1%6 the need to search multiple resources.

Secondary Sequence Database

The source of data of secondary databases are the primary databases. Secondary ec is different and have their own fonnats. --c sequence alignment. The inf tion noosed in each of the secondarY databases having the informauon about Aren't"' region'', obtained t rough multiple oatla• PROSITE This database is maintained by SWISS Institute of Bioinformatics. The prote,n family can be characterized the single conserved motif, responsible for key biological functions. Such motifs are represented as regular expressions

PROSITE

Regular expression is the consensus descriptions of motifs. e.g.. C-T-X2-1161-C-RMS1. square within bracket any residue can be placed at that position.

PRINTS

In PRINT databases, the protein family is represented by the signature or fingerprint. The fingerprint is osensus description of several conserved motifs within the sequences of particular protein family.

Blocks

In this database, the motifs or blocks are created by automatically detecting the most highly conserved regions of each protein family.

P fam

This database is maintained by Sanger Centre. It has a collection of Hidden Markov Models (HMMs) for protein domains. It is a statistically based mathematical treatments, consisting of linear chains of match. delete and insertion states which encode the conserved region within aligned family.

Profiles

The protein family is characterized by the profile in this database. Profile indicates the position of residues where insertions and deletions (INDELs) are allowed and where the conserved regions are these profiles known as weight matrices.

Identify

This database is derived from BLOCKS and PRINTS. In identify e-motif is used as a search software to access protein function.

2. Structural Databases

We can classify proteins on the basis of structure, as many protein share structural similarities. Sometime during evolution protein functions remain same while secondary structural environment shows variation. To understand structure and sequence relationships. variety of structural classification has been done. Some classification schemes are as follows:

Structural Classification of Proteins (SCoP)

This database is based on evolutionary studies of all proteins of known structures. The levels of hierarchial classifications of structures in SCoP are class, fold, superfamily, family. proteins and sequences. Here. four major levels are described.

Class

Proteins of similar secondary structures are said to belong to same class. Classes are all a proteins. all 13. 0.113. a + 13 (segregated a and 13 regions). multi-domain, small proteins and membrane proteins.

Fold

The fold shows arrangement and topological connections of the major secondary structure .in protein. In the tame fold category, may not have a common evolutionary origin.

Super-family

In the same superfamily, proteins show low sequence identity but ancestors are the same as their gattline and function share common characteristics.

Family

Members of the same family are clearly evolutionary related and show 30% or more identity.

Class, Architecture, Topology. Homology (CAM)

It is a hierarchial domain classification of protein structures in the protein data bank. Here only those structures are "It'fisidered which have resolution better than 3.0 angstroms. There are four basic levels of classification.

Class

This is determined according to the secondary structure composition and packing Within the major classes arc recognized (i) all a. (ii) mainly p. (iii) a - p (a/(S and a + (iv) with low secondary strustriacturag. pnitt

Architecture

This describes the overall shape of the domain structure, ignoring their connecnyities et. barcrseck sandwitch, etc. et,* TaPoloRY The proteins are grouped into fold families depending on the overall shape and connects secondary structures. Those proteins share the same topology that have 60% or more than 60% identity.

Homology

This indicates that structures, share a common ancestor and show high structural and functional

PDB Sum

It provides summaries and analysis of all structures in the protein data bank.

3. 3D Structure Databases

Protein Data Bank (PDB)

This is maintained by Research Collaboratory for Structural Bioinformatics (RCSB), Brookhaven Nation Lahniato USA. It is the repository for, the macromolecular structures derived experimentally by X-ray crystallography ‘r‘iit. neutron diffraction and cryo-electron microscopy.

Auto Dep Input Tool (ADIT) is used to deposit structures to the PDB. It checks the coordinate format and vandrioe tests on a structure prior to deposition.

Molecular Modelling Database (MMDB)

There are two 3D structure databases of NCBI-MMDB and CDD (Conserved Domain Database). This is the retrival version of PDB structures, having experimentally determined biomolecular structures.

Conserved Domain Database (CDD)

This provides a directory of the sequence and structure alignments representing conserved functional domains tOn within proteins, CDs are displayed in MMDB structure summaries and link to a sequence alignment.

4. Literature Databases

PubMed Central (PMC) is the literature database that provides access to full-text articles and journals for stolen ul researchers.

5. Gene Expression Databases

These databases can be explained in three major sub-headings

SAGEmap

The repository of serial analysis of gene expression (SAGE) data.

GEO

The repository and retrieval system for any high-throughput gene expression data.

GENSAT

The database of mouse's central nervous system data, produced by the National Institute of SetaILO Disorders and Stroke, USA.

Probe

The database have entries of probe sequences. The entries indicate the intended experimental aPfha°35 and include the experimental results generated by using the probe.

6. Chemical Databases

Pubchem is the popular chemical database of NCBI. It contains structural, chemical and biological Ob' small molecules and their diagnostic and therapeutic applications.

Other Databases

dbEsT - This database having information about the Expressed Sequence Tags (ESTs).

UniGene - Database of ESTs focuses on human.

SGD - This is the genome database of different strains of yeasts.

THANKYOU!!!

Biological Database | Online Biotech Notes

Online Biotech Notes

Biological Databases

Characteristics of Biological Databases

Non-Redundancy

Data Sharing

Classification of Databases

Classification of Biological Databases

1. Sequence Databases

Primary Sequence Databases

1. Nucleotide Database

(a) European Molecular Biology Laboratory (EMBL)

(b) DNA DataBank of Japan (DDBJ)

(c) Genetic Sequence DataBank (GenBank)

2. Protein Sequence Databases

(a) Protein Information Resources (PIR)

(b) SWISS-PROT

(c) Translated EMBL (TrEMBL)

Martinsried Institute for Protein Sequences (MIPS)

Composite Databases

Secondary Sequence Database

PROSITE

PRINTS

Blocks

P fam

Profiles

Identify

2. Structural Databases

Structural Classification of Proteins (SCoP)

Class

Fold

Super-family

Family

Class, Architecture, Topology. Homology (CAM)

Class

Architecture

Homology

PDB Sum

3. 3D Structure Databases

Protein Data Bank (PDB)

Molecular Modelling Database (MMDB)

Conserved Domain Database (CDD)

4. Literature Databases

5. Gene Expression Databases

SAGEmap

GEO

GENSAT

Probe

6. Chemical Databases

Other Databases

0 Response to "Biological Database | Online Biotech Notes"

Post a Comment

Bhanu prakash

advertising articles 2

Advertise under the article