Abstract
Background: The mathematical foundation for the information theory in communication engineering was developed by Claude Shannon in 1948. Since then the information theory has been utilized to investigate various information carrying systems including biomolecules such as DNA and proteins.
Objective: In this study, a measure for the structural information content estimate of proteomes is proposed. The considered primary structure feature for the information content investigation is the sequence length organization of proteomic proteins, as opposed to the amino acid order in individual protein sequences.
Method: We analyzed and compared the information content estimates of a representative proteome set of ten proteomes for measured, model-predicted (linguistic distribution model) and simulated (random sequence length) cases.
Results: Excellent agreement was observed in the measured and model-predicted information contents of the proteomes. The overall average information per proteomic protein was obtained as 8 and 7 bits for the measured/model-predicted and the simulated proteomic collection data, respectively.
Conclusion: The study reveals that the biological interaction mechanisms may primarily rely on the number of amino acids than the amino acid order of an interaction-initiating protein sequence. The approach presented here may serve as a practical tool for studying and comparing biological processes taking place in an organism or in a collection of organisms, and is anticipated to offer numerous promises for the exploration of proteomic information characteristics present in different structural hierarchies such as the secondary and tertiary structures.
Keywords: Proteomic message source, information theory, protein length distribution, Menzerath-Altmann distribution model, Biomolecules, linguistic distribution model.
Graphical Abstract