Introduction

Traumatic brain injury (TBI) is a heterogeneous condition resulting from an external force on the head, causing brain damage and impairing cognitive, physical, and emotional functions. TBI is a significant cause of mortality and morbidity worldwide, particularly among young and elderly populations. Symptoms vary depending on the severity and location of the injury and may include headache, dizziness, confusion, memory loss, personality changes, and loss of consciousness (Dadas et al. 2018). TBI can also lead to chronic neurological and cognitive disorders, such as epilepsy, Parkinson’s disease, and Alzheimer’s disease (Smith et al. 2013).

TBI diagnosis is based on clinical assessment and neuroimaging modalities, such as computed tomography and magnetic resonance imaging (Cheema et al. 2024). Treatment strategies involve pharmacological, surgical, rehabilitative, and psychological interventions (Maas et al. 2008). Preventative measures include wearing protective equipment, using seatbelts, and enforcing safety regulations (Langlois et al. 2006).

Biomarkers are biological indicators that can be measured to diagnose, monitor, and predict the outcome of TBI. They provide objective and specific information regarding the extent and nature of brain damage, as well as responses to treatment and recovery (Mondello et al. 2021). Biomarker discovery for TBI relies on various approaches and technologies, such as proteomics, transcriptomics, and metabolomics, which enable the analysis of molecular changes in the brain following injury (Zetterberg and Blennow 2016). Several protein biomarkers have been proposed for TBI, including S-100 calciumbinding protein B (S-100B), neuron-specific enolase, tau, and glial fibrillary acidic protein (GFAP), each reflecting different aspects of brain injury and recovery (Papa et al. 2016).

Ubiquitin is a regulatory protein found in all cells of the body. Ubiquitin C-terminal hydrolase L1 (UCH-L1), a specific isoform of ubiquitin, is primarily located in central neurons and the neuroendocrine system but has also been detected in the testis, ovaries, and kidneys (Zetterberg et al. 2010). GFAP, a member of the intermediate filament family of cytoskeletal proteins, provides structural support to neuroglia. Neuroglia help maintain homeostasis, form myelin, and protect neurons in both the peripheral and central nervous systems. GFAP has also been detected in other cell types outside the central nervous system, including Schwann cells, myoepithelial cells, chondrocytes, fibroblasts, and lymphocytes (Posti et al. 2016).

GFAP and UCH-L1 are frequently used together in m-TBI biomarker analysis to measure the different cell types potentially affected by injury. UCH-L1 is associated with more diffuse brain injuries, whereas GFAP is typically elevated in focal injuries (Papa et al. 2012). The UCH-L1 and GFAP proteins are measured and reported separately, with both results needed to obtain a final brain traumatic indicator (BTI) result. A BTI is reported as “positive” if either or both UCH-L1 and GFAP levels exceed the predetermined cutoff (Mitchell et al. 2020).

S-100B, a calcium-binding protein primarily produced by astrocytes, serves as a biomarker for neural distress and plays a dual role in brain function (Michetti et al. 2018). At low concentrations, it promotes neuronal survival and astrocyte proliferation, whereas at high levels, it induces inflammation and neuronal death (Rothermundt et al. 2003; Sorci et al. 2010). S-100B is involved in various neurological disorders, including acute brain injury, neurodegenerative diseases, and psychiatric conditions (Michetti et al. 2018). Although often considered a brain-specific marker, S-100B is also synthesized in other tissues (Gayger-Dias et al. 2023). The protein’s ability to cross the blood-brain barrier remains debated, with recent research emphasizing the role of the glymphatic system in S-100B clearance (Gayger-Dias et al. 2023). S-100B has diverse functions, including the regulation of protein phosphorylation, energy metabolism, and cell proliferation (Sorci et al. 2010). Its levels in biological fluids are used to monitor disease progression; however, its broad involvement reduces specificity (Michetti et al. 2018).

Proteomics and bioinformatics are particularly useful for identifying and validating protein biomarkers for TBI, as proteins play a crucial role in brain function and pathology (Kobeissy et al. 2008). Protein structure prediction is a fundamental aspect of computational biology and bioinformatics, aiming to determine the threedimensional structure of a protein from its amino acid sequence. This field has seen significant advancements with the integration of conventional computational methods and deep learning techniques.

Traditional approaches to protein structure prediction often involve comparative modeling, in which the structure of an unknown protein is inferred based on its similarity to one or more known protein structures. These methods rely heavily on the availability of homologous protein sequences in databases (Jisna and Jayaraj 2021). In recent years, deep learning has revolutionized protein structure prediction. Techniques such as convolutional neural networks and recurrent neural networks have been employed to extract complex features from protein sequences, leading to more accurate predictions.

Accurate protein structure prediction is crucial for various applications, including drug discovery, antibody design, and understanding protein–protein interactions. As the field continues to evolve, computational methods are expected to become even more integral to biological research and medicine (Jisna and Jayaraj 2021).

This study aims to discover biomarkers for TBI using an integrative approach that combines proteomics and bioinformatics. Additionally, it employs systematic in silico prediction and analysis of novel biomarker proteins to interpret the structural and functional correlations between known and newly determined protein structures. These findings could be effectively used in further studies as potential candidates for drug targeting.

Materials and methods

Our methodology for analyzing traumatic brain injury biomarker proteins included predicting conserved regions, domains, secondary structures, three-dimensional structures, post-translational modification (PTM) sites, signatures, and motifs.

Conserved regions

Multiple sequence alignments of GFAP (NP_002046.1), S-100B (NP_006263.1), and UCH-L1 (NP_004172.2) were performed using BIOEDIT 7.2 software (Hall et al. 2001) to extract conserved regions through hidden Markov model (HMM) profile-profile algorithms and seeded guide trees. BIOEDIT 7.2 is a user-friendly biological sequence alignment editor that provides basic editing, alignment, manipulation, and analysis functionalities for protein sequences and is comparable to the best alignment techniques.

Molecular evolutionary and phylogenetic analysis

The evolutionary history was inferred using the Neighbor-Joining approach. To increase the probability of accurately observing amino acid sequences in our data, the maximum likelihood method was used to determine the topology and branch lengths of the phylogenetic tree.

MEGA11 (Tamura et al. 2021) represents a significant advancement in computational molecular evolution. It offers a comprehensive suite of tools for constructing time trees of species, pathogens, and gene families, employing rapid relaxed-clock methods to estimate divergence times and confidence intervals. The software has been enhanced with new features, including a Bayesian method for estimating the neutral evolutionary probabilities of alleles using multispecies sequence alignments and a machine learning approach to test for the autocorrelation of evolutionary rates in phylogenies.

Domain separation

Domain separation is the first step in predicting a three-dimensional protein structure. The NCBI Conserved Domains Database (CDD) (Lu et al. 2020) is a freely accessible tool for annotating sequences with the positions of conserved protein domain footprints, functional sites, and motifs deduced from these footprints.

ThreaDom has been the top prediction server for protein domains in CASP12, CASP13, CASP14, and CASP15. ThreaDomEx, which integrates ThreaDom and DomEx, provides more precise predictions (Wang et al. 2017). ProDom is a comprehensive database of protein domain families derived from a global comparison of protein sequences (Bru et al. 2005). The NCBI CDD also queries the Conserved Domain Database (Marchler-Bauer et al. 2015).

Secondary structure prediction

Several servers have been utilized for secondary structure prediction, including PredictProtein (Qiu et al. 2020), a meta-service that provides predictions of structural and functional features of proteins, such as secondary structure, solvent accessibility, transmembrane helices, coiled coils, disulfide bonds, and disorder regions. JPred (Drozdetskiy et al. 2015) employs the Jnet algorithm, one of the most accurate methods for secondary structure prediction. PredictProtein and JPred were used to analyze the exposed and buried regions of GFAP, S-100B, and UCH-L1 proteins.

RaptorX, a deep learning-based method, has achieved state-of-the-art performance in contact prediction in CASP12 and CASP13. Other methods, such as PSIPRED, SOPMA, Porter, YASPIN, and PROTEUS, use different neural network architectures and input features to predict secondary structure elements (alpha helices, beta strands, and coils) with varying accuracy depending on sequence quality and protein size.

Three-dimensional (3-D) structure prediction

Protein structure prediction, a key area in computational biology, involves homology modeling, fold recognition, and ab initio methods. Various servers, including I-TASSER (Zhou et al. 2022), Swiss-Model (Waterhouse et al. 2018), Phyre2 (Kelley et al. 2015), and GalaxyWEB (Ko et al. 2012), have been developed for these techniques.

I-TASSER, a top-performing platform in CASP7– CASP14 assessments, uses iterative simulations for full-length atomic model construction. SWISS-MODEL (Waterhouse et al. 2018), a dedicated service for protein structure homology modeling, provides access to a vast collection of experimentally determined protein structures. The Robetta server (Kim et al. 2004) offers automated methods for protein structure analysis and prediction. DeepMind’s AlphaFold (Jumper et al. 2020), the winner of the CASP13 competition, accurately predicts protein structures from amino acid sequences.

Model refinement

Web-based tools such as DeepRefiner (Shuvo et al. 2021), GalaxyRefine (Heo et al. 2013), ModRefiner (Xu and Zhang 2011), and 3Drefine (Bhattacharya et al. 2016) refine protein structures using energy minimization and molecular dynamics techniques. These tools enhance both global and local structural features of initial protein models. The refinement process involves optimizing the hydrogen bonding network and applying composite physics- and knowledge-based force fields for atomic-level energy minimization (Feig and Mirjalili 2015). The refined protein structures can be used for various downstream analyses.

Model evaluation

Large-scale model quality assessment (QA) techniques are employed alongside model clustering to rank and select protein structural models. Various metrics, such as GDT-TS, GDT-HA, TM-score, Z-score, MolProbity (MP) score, QMEAN score, projected absolute model quality Z-score, clash score, and root-meansquare deviation (RMSD), are used to evaluate refinement category predictions. These metrics assess model quality aspects, including total fold, interatomic contact distributions, and dihedral angle distributions.

The efficacy of automated protein structure prediction methods for GFAP, S-100B, and UCH-L1 was assessed using servers such as GalaxyRefiner (Heo et al. 2013), ModRefiner (Xu and Zhang 2011), ProQ– Protein Quality Predictor (Benkert et al. 2011), ProSAweb (Wiederstein and Sippl 2007), RAMACHANDRAN PLOT Server (Kleywegt and Jones 1996), QMEAN Server for Model Quality Estimation (Studer et al. 2020), TM-Score (Zhang and Skolnick 2004), and SAVES v6.0 (Hooft et al. 1996), a multiprogram that includes ERRAT (Colovos and Yeates 1993), VERIFY 3D (Lüthy et al. 1992), PROVE (Pontius et al. 1996), PROCHECK (Laskowski et al. 1993), and WHATCHECK (Hooft et al. 1996). Additionally, TM-align (Zhang and Skolnick 2005) was used for structural alignment.

Functional motifs prediction

Motifs and fingerprints are instrumental in identifying distant sequence relationships and facilitating protein–protein interactions (PPI). The PROSITE web server (De Castro et al. 2006; Sigrist et al. 2012), including its enhanced version ScanProsite, was used to match regular expressions with a query sequence. The SMART (Letunic et al. 2021) web server stores sequence information from multiple sequence alignments and represents it using probabilistic models, such as Position-Specific Scoring Matrices (PSSMs), profiles, or HMMs. Several servers like MotifScan (Shao et al. 2012), MotifFinder, InterPro (Mitchell et al. 2015), and Superfamily (Wilson et al. 2009), and visualization tools like CDvist (Adebali et al. 2015) aid in identifying and interpreting functional motifs within the protein.

Structural classification

The InterPro database (Mitchell et al. 2015) classifies protein sequences into families and identifies significant domains and conserved regions. InterProScan checks sequences against InterPro’s signatures, which are prediction models defining protein families, domains, or functional sites. Protein structural domains are classified in the SCOP database (Andreeva et al. 2014) based on their structures and amino acid sequences. Databases such as CATH (Sillitoe et al. 2021) and PIR (Wu et al. 2003) predict protein function based on structural features, while Superfamily (Wilson et al. 2009) provides annotation and classification of protein domains and families. CATH (Sillitoe et al. 2021) recognizes domains in protein structures from the wwPDB and groups them into evolutionary superfamilies.

Pathway and systems biology analysis

To elucidate the functional relationships between GFAP, S-100B, and UCH-L1 in TBI, we conducted a structured bioinformatics analysis using the STRING database (version 11.5) (Szklarczyk et al. 2019). The proteins GFAP (ENSP00000253408), S-100B (ENSP00000291700), and UCH-L1 (ENSP00000284440) were queried using their Ensembl identifiers to construct a PPI network. Interactions were predicted using STRING’s default parameters, including a medium confidence threshold (score ≥ 0.4), and integrated evidence from co-expression, experimental datasets, and text mining. Functional enrichment analysis was performed to identify associations with TBI-related pathways, such as neuroinflammation and ubiquitination, using Gene Ontology (GO), Reactome, and WikiPathways annotations. The network topology and interaction scores were visualized using coordinates provided in the STRING output files, and all raw data were cross-validated for consistency.

Results and discussion

The UniProt Knowledgebase (UniProtKB) was used to retrieve the amino acid sequences of three biomarker proteins: GFAP (accession number NP_002046.1), UCH-L1 (accession number NP_004172.2), and S-100B (accession number NP_006263.1). These proteins were then subjected to in silico prediction and threedimensional structural analysis.

Prediction of the conserved region of GFAP, S100B and UCHL-1

BioEdit 7.2 software was used to assess essential features and predict conserved regions in UCH-L1, S-100B, and GFAP, identifying 5, 1, and 5 conserved segments, respectively. The analysis highlighted significant similarity and crucial roles for these conserved regions, with minimum segment lengths of sixteen and maximum average entropy values of 0.0331 (Table 1).

Table 1

Predicted conserved region of UCH-L1, S-100B and GFAP protein of traumatic brain injury using BioEdit

ProteinsaRegionbPositioncConsensusdSegment lengtheAverage entropy (Hx)f
UCH-L1161–92NFRKKQIEELKGQEVSPKVYFMKQTIGNSCGT320.0095
2108–123FEDGSVLKQFLSETEK160.0331
3125–144SPEDRAKCFEKNEAIQAAHD200.0000
4146–186VAQEGQCRVDDKVNFHFILFNNVDGHLY ELDGRMPFPVNHG410.0074
5196–223DAAKVCREFTEREQGEVRFSAVALCKAA280.0000
S-100B122–38EGDKHKLKKSELKELIN170.0199
GFAP166–130GFKETRASERAEMMELNDRFASYIEKVRF LEQQNKALAAELNQLRAKEPTKLADVYQAELRELRL650.0094
2157–175RQKLQDETNLRLEAENNLA190.0338
3177–194YRQEADEATLARLDLERK180.0063
4251–276ASSNMHEAEEWYRSKFADLTDAAARN260.0234

a Proteins: The analyzed protein names.

b Region: The conserved sequence region identified within each protein.

c Position: The specific amino acid range where the conserved region is located.

d Consensus: The consensus sequence of the conserved region is based on multiple sequence alignments.

e Segment length: The number of amino acids in the conserved region.

f Average entropy (Hx): A measure of sequence variability within the conserved region, where lower entropy values indicate higher conservation

Molecular evolutionary and phylogenetic analysis

The Maximum Likelihood approach, based on the JTT matrix-based model, along with the Neighbor-Joining and UPGMA methods, was used to infer evolutionary history (Tables 2, 3, and 4). For GFAP, two primary groups were identified. Group A comprised primates, including Homo sapiens, Pan troglodytes, and Gorilla gorilla, demonstrating strong evolutionary conservation. Group B consisted of species from diverse orders, suggesting broader functional diversification (Figure 1). Similarly, the S-100B phylogenetic tree revealed two clusters. Group A included Homo sapiens, Macaca mulatta, and various rodents, indicating high functional conservation. Group B comprised a smaller but diverse set of species, highlighting the widespread distribution of S-100B across taxa (Figure 2). For UCH-L1, a highly conserved pattern was observed. Group A encompassed vertebrates such as Mesocricetus auratus, Peromyscus maniculatus bairdii, and Homo sapiens, underscoring its essential role in cellular processes. Notably, Homo sapiens clustered closely with Macaca fascicularis, reaffirming the evolutionary stability of UCH-L1 within primates. Group B, though smaller, demonstrated the presence of UCH-L1 across diverse species, reinforcing its fundamental biological importance (Figure 3).

Table 2

Maximum likelihood estimate of substitution matrix of GFAP protein of traumatic brain injury using MEGA 11

From \ ToARNDCQEGHILKMFPSTWYV
A-0.14070.12300.21980.06040.11850.34170.67370.02620.09850.14640.11390.05700.02900.51311.37421.38960.00630.02341.0058
R0.2117-0.09940.04110.10710.64280.10200.52620.38220.06510.17572.01240.05230.01370.18600.35400.19710.09340.03940.0591
N0.22260.1195-1.47670.03300.16380.18550.29990.48020.13400.06490.78110.04020.01550.03191.79100.71410.00210.11750.0567
D0.32950.04101.2234-0.01110.11102.48770.49260.12290.03160.02900.08710.02310.00680.03330.20830.12890.00430.07600.1084
C0.22870.26970.06900.0280-0.01940.01730.21140.08630.04100.07770.01510.04960.14240.03240.76160.14240.08200.35380.2136
Q0.22160.79920.16940.13850.0096-1.09440.08950.67660.02130.33460.91430.05540.00960.42090.19390.15880.01280.04260.0618
E0.42520.08430.12762.06500.00570.7278-0.43230.02910.03050.04610.53430.02130.00920.05030.11060.10060.00850.01060.1602
G0.69360.36000.17060.33830.05750.04920.3576-0.02400.01470.03280.08330.01580.01060.05450.66310.09620.04050.00880.1618
H0.08760.84930.88730.27420.07621.20910.07810.0781-0.04950.25520.16190.04000.09520.29900.26280.14470.00950.97870.0419
I0.14400.06330.10820.03080.01580.01670.03580.02080.0216-1.10240.06240.58620.16320.02580.14320.77430.01000.05083.2788
L0.12360.09860.03030.01630.01730.15100.03120.02690.06440.6365-0.04520.46820.52540.27790.20960.08270.03940.04040.6062
K0.14721.72830.55790.07510.00520.63150.55500.10450.06260.05520.0692-0.07580.00520.05670.16780.29300.00660.01470.0427
M0.18720.11420.07300.05050.04300.09730.05610.05050.03931.31761.82290.1928-0.09170.04300.10110.64200.01500.03181.0462
F0.05510.01730.01620.00870.07140.00970.01410.01950.05410.21191.18190.00760.0530-0.03890.33410.04220.04000.91920.2044
P0.78140.18820.02690.03380.01300.34260.06160.08070.13620.02690.50130.06680.01990.0312-0.98690.35730.00520.01910.0728
S1.54950.26521.11610.15670.22670.11690.10020.72630.08860.11050.28000.14640.03470.19840.7308-1.45000.02310.10530.1406
T1.82670.17220.51880.11300.04940.11150.10630.12280.05690.69620.12880.29800.25680.02920.30841.6904-0.00600.03370.3938
W0.03370.33380.00610.01530.11640.03680.03680.21130.01530.03680.25110.02760.02450.11330.01840.11030.0245-0.12560.0827
Y0.05560.06240.15460.12070.22240.05420.02030.02030.69690.08270.11390.02710.02311.15250.02980.22240.06100.0556-0.0569
V1.16480.04550.03630.08380.06530.03830.14910.18200.01452.59740.83170.03830.36870.12470.05540.14440.34690.01780.0277-

[i] Each entry is the probability of substitution (r) from one amino acid (row) to another (column). Substitution patterns and rates were estimated under the Jones-Taylor-Thornton model (Jones et al. 1992). Relative values of instantaneous r should be considered when evaluating them. For simplicity, the sum of r values is made equal to 100. The amino acid frequencies are 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). For estimating ML values, a tree topology was automatically computed. The maximum Log-likelihood for this computation was –2493.154. This analysis involved 42 amino acid sequences. There was a total of 436 positions in the final dataset. Evolutionary analyses were conducted in MEGA11 (Tamura et al. 2021)

Table 3

Maximum likelihood estimate of substitution matrix of S-100B protein of traumatic brain injury using MEGA 11

From \ ToARNDCQEGHILKMFPSTWYV
A0.14070.12300.21980.06040.11850.34170.67370.02620.09850.14640.11390.05700.02900.51311.37421.38960.00630.02341.0058
R0.21170.09940.04110.10710.64280.10200.52620.38220.06510.17572.01240.05230.01370.18600.35400.19710.09340.03940.0591
N0.22260.11951.47670.03300.16380.18550.29990.48020.13400.06490.78110.04020.01550.03191.79100.71410.00210.11750.0567
D0.32950.04101.22340.01110.11102.48770.49260.12290.03160.02900.08710.02310.00680.03330.20830.12890.00430.07600.1084
C0.22870.26970.06900.02800.01940.01730.21140.08630.04100.07770.01510.04960.14240.03240.76160.14240.08200.35380.2136
Q0.22160.79920.16940.13850.00961.09440.08950.67660.02130.33460.91430.05540.00960.42090.19390.15880.01280.04260.0618
E0.42520.08430.12762.06500.00570.72780.43230.02910.03050.04610.53430.02130.00920.05030.11060.10060.00850.01060.1602
G0.69360.36000.17060.33830.05750.04920.35760.02400.01470.03280.08330.01580.01060.05450.66310.09620.04050.00880.1618
H0.08760.84930.88730.27420.07621.20910.07810.07810.04950.25520.16190.04000.09520.29900.26280.14470.00950.97870.0419
I0.14400.06330.10820.03080.01580.01670.03580.02080.02161.10240.06240.58620.16320.02580.14320.77430.01000.05083.2788
L0.12360.09860.03030.01630.01730.15100.03120.02690.06440.63650.04520.46820.52540.27790.20960.08270.03940.04040.6062
K0.14721.72830.55790.07510.00520.63150.55500.10450.06260.05520.06920.07580.00520.05670.16780.29300.00660.01470.0427
M0.18720.11420.07300.05050.04300.09730.05610.05050.03931.31761.82290.19280.09170.04300.10110.64200.01500.03181.0462
F0.05510.01730.01620.00870.07140.00970.01410.01950.05410.21191.18190.00760.05300.03890.33410.04220.04000.91920.2044
P0.78140.18820.02690.03380.01300.34260.06160.08070.13620.02690.50130.06680.01990.03120.98690.35730.00520.01910.0728
S1.54950.26521.11610.15670.22670.11690.10020.72630.08860.11050.28000.14640.03470.19840.73081.45000.02310.10530.1406
T1.82670.17220.51880.11300.04940.11150.10630.12280.05690.69620.12880.29800.25680.02920.30841.69040.00600.03370.3938
W0.03370.33380.00610.01530.11640.03680.03680.21130.01530.03680.25110.02760.02450.11330.01840.11030.02450.12560.0827
Y0.05560.06240.15460.12070.22240.05420.02030.02030.69690.08270.11390.02710.02311.15250.02980.22240.06100.05560.0569
V1.16480.04550.03630.08380.06530.03830.14910.18200.01452.59740.83170.03830.36870.12470.05540.14440.34690.01780.0277

[i] Each entry is the probability of substitution (r) from one amino acid (row) to another (column). Substitution patterns and rates were estimated under the Jones-Taylor-Thornton model (Jones et al. 1992). Relative values of instantaneous r should be considered when evaluating them. For simplicity, the sum of r values is made equal to 100. The amino acid frequencies are 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). For estimating ML values, a tree topology was automatically computed. The maximum log-likelihood for this computation was –516.110. This analysis involved 42 amino acid sequences. There was a total of 436 positions in the final dataset. Evolutionary analyses were conducted in MEGA11 (Tamura et al. 2021)

Table 4

Maximum likelihood estimate of substitution matrix of UCH-L1 protein of traumatic brain injury using MEGA 11

From \ ToARNDCQEGHILKMFPSTWYV
A0.14070.12300.21980.06040.11850.34170.67370.02620.09850.14640.11390.05700.02900.51311.37421.38960.00630.02341.0058
R0.21170.09940.04110.10710.64280.10200.52620.38220.06510.17572.01240.05230.01370.18600.35400.19710.09340.03940.0591
N0.22260.11951.47670.03300.16380.18550.29990.48020.13400.06490.78110.04020.01550.03191.79100.71410.00210.11750.0567
D0.32950.04101.22340.01110.11102.48770.49260.12290.03160.02900.08710.02310.00680.03330.20830.12890.00430.07600.1084
C0.22870.26970.06900.02800.01940.01730.21140.08630.04100.07770.01510.04960.14240.03240.76160.14240.08200.35380.2136
Q0.22160.79920.16940.13850.00961.09440.08950.67660.02130.33460.91430.05540.00960.42090.19390.15880.01280.04260.0618
E0.42520.08430.12762.06500.00570.72780.43230.02910.03050.04610.53430.02130.00920.05030.11060.10060.00850.01060.1602
G0.69360.36000.17060.33830.05750.04920.35760.02400.01470.03280.08330.01580.01060.05450.66310.09620.04050.00880.1618
H0.08760.84930.88730.27420.07621.20910.07810.07810.04950.25520.16190.04000.09520.29900.26280.14470.00950.97870.0419
I0.14400.06330.10820.03080.01580.01670.03580.02080.02161.10240.06240.58620.16320.02580.14320.77430.01000.05083.2788
L0.12360.09860.03030.01630.01730.15100.03120.02690.06440.63650.04520.46820.52540.27790.20960.08270.03940.04040.6062
K0.14721.72830.55790.07510.00520.63150.55500.10450.06260.05520.06920.07580.00520.05670.16780.29300.00660.01470.0427
M0.18720.11420.07300.05050.04300.09730.05610.05050.03931.31761.82290.19280.09170.04300.10110.64200.01500.03181.0462
F0.05510.01730.01620.00870.07140.00970.01410.01950.05410.21191.18190.00760.05300.03890.33410.04220.04000.91920.2044
P0.78140.18820.02690.03380.01300.34260.06160.08070.13620.02690.50130.06680.01990.03120.98690.35730.00520.01910.0728
S1.54950.26521.11610.15670.22670.11690.10020.72630.08860.11050.28000.14640.03470.19840.73081.45000.02310.10530.1406
T1.82670.17220.51880.11300.04940.11150.10630.12280.05690.69620.12880.29800.25680.02920.30841.69040.00600.03370.3938
W0.03370.33380.00610.01530.11640.03680.03680.21130.01530.03680.25110.02760.02450.11330.01840.11030.02450.12560.0827
Y0.05560.06240.15460.12070.22240.05420.02030.02030.69690.08270.11390.02710.02311.15250.02980.22240.06100.05560.0569
V1.16480.04550.03630.08380.06530.03830.14910.18200.01452.59740.83170.03830.36870.12470.05540.14440.34690.01780.0277

[i] Each entry is the probability of substitution (r) from one amino acid (row) to another (column). Substitution pattern and rates were estimated under the Jones-Taylor-Thornton model (Jones et al. 1992). Relative values of instantaneous r should be considered when evaluating them. For simplicity, the sum of r values is made equal to 100. The amino acid frequencies are 7.69% (A), 5.11% (R), 4.25% (N), 5.13% (D), 2.03% (C), 4.11% (Q), 6.18% (E), 7.47% (G), 2.30% (H), 5.26% (I), 9.11% (L), 5.95% (K), 2.34% (M), 4.05% (F), 5.05% (P), 6.82% (S), 5.85% (T), 1.43% (W), 3.23% (Y), and 6.64% (V). For estimating ML values, a tree topology was automatically computed. The maximum log-likelihood for this computation was –1195.037. This analysis involved 42 amino acid sequences. There was a total of 223 positions in the final dataset. Evolutionary analyses were conducted in MEGA11 (Tamura et al. 2021). Click or tap here to enter text

Figure 1

Molecular phylogenetic analysis of the GFAP protein using the maximum likelihood method. The evolutionary history was inferred using the maximum likelihood method and the JTT matrix-based model (Jones et al. 1992). The tree with the highest log likelihood (–2493.15) is shown. Initial trees for the heuristic search were obtained automatically by applying the Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using the JTT model, followed by selecting the topology with the highest log likelihood value. The final dataset included 436 positions

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g001_min.jpg
Figure 2

Molecular phylogenetic analysis of the S-100B protein using the maximum likelihood method. The evolutionary history was inferred using the maximum likelihood method and the JTT matrix-based model (Jones et al. 1992). The tree with the highest log likelihood (–517.32) is shown. Initial trees for the heuristic search were obtained automatically by applying the Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using the JTT model, followed by selecting the topology with the highest log likelihood value. The final dataset included 92 positions

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g002_min.jpg
Figure 3

Molecular phylogenetic analysis of the UCH-L1 protein using the maximum likelihood method. The evolutionary history was inferred using the maximum likelihood method and the JTT matrix-based model (Jones et al. 1992). The tree with the highest log likelihood (–1186.09) is shown. Initial trees for the heuristic search were obtained automatically by applying the Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using the JTT model, followed by selecting the topology with the highest log likelihood value. The final dataset included 223 positions

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g003_min.jpg

Domain separation

The CD-Search results provide domain multiple sequence alignments by integrating user queries and annotating protein domains on these sequences. For the GFAP protein, the NCBI Conserved Domain Search identified two domains: one with accession number pfam00038, spanning intervals 68–376 with an E-value of 1.12e–127, and another with accession number pfam04732, covering intervals 4–66 with an E-value of 2.51e–08. ThreaDom analysis also revealed two domains for GFAP, spanning 1–171 and 172–345, with a cutoff of 0.56. Similarly, the S-100B protein showed one domain via NCBI CDD, with accession number cd05027, an interval of 2–89, and an E-value of 1.68e–47. ThreaDom analysis identified a single domain in GFAP with the same cutoff of 0.56. For the UCH-L1 protein, NCBI CDD revealed a single domain with accession number cd09616, spanning intervals 5–219 with an E-value of 3.16e–127. ThreaDom also identified one domain in GFAP, using the same cutoff of 0.56 (Table 5 and Figure 4).

Table 5

Domain assignment of GFAP, S-100B and UCH-L1 protein using CD-Search (NCBI server)

ProteinsaNamebAccessioncDescriptiondIntervaleE -valuefBitscoregSuperfamilyh
GFAPFilamentpfam00038Intermediate filament protein68–3761.12e–127371.555cl25641
Filament_headpfam04732Intermediate filament head (DNA binding) region: This family represents the N-terminal head…4–662.51e–0850.8519cl04711
S-100BS-100Bcd05027S-100B: The S-100B domain is found in proteins similar to S100B. S100B is a calciumbinding protein2–891.68e–47146.155cl08302
UCH-L1Peptidase_C12_ UCH_L1_L3cd09616Cysteine peptidase C12 containing ubiquitin carboxyl-terminal hydrolase ( UCH) families L1 and…5–2193.16e–127321.122cl08306

a Proteins: The analyzed protein names.

b Name: The specific domain or structural component of the protein.

c Accession: The unique identifier assigned to the protein family or domain in the database.

d Description: A brief functional or structural description of the protein or domain.

e Interval: The residue range within the protein where the domain is located.

f E-value: The statistical significance of the match, with lower values indicating higher confidence.

g Bitscore: A sequence similarity measure where higher scores indicate more decisive matches.

h Superfamily: The broader classification of structurally and functionally related proteins

Figure 4

Domain separation of (A) the GFAP protein (T11613), (B) the S-100B protein (T11612), and (C) the UCH-L1 protein (T11624) using the ThreaDom server

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g004_min.jpg

Secondary structure and solvent accessibility prediction

The GFAP protein’s secondary structure is predominantly alpha-helical, accounting for 65% (281 residues), with a minor presence of beta strands (6%, 25 residues) and coils (28%, 120 residues), achieving an overall prediction confidence of 86%. Similarly, the S-100B protein is helix-dominant, with 67% (63 residues) forming alpha helices, no beta strands, and 33% (19 residues) structured as coils, with an 87.5% confidence level. In contrast, UCH-L1 exhibits a more balanced composition, with alpha helices and coils each constituting 41% (91 and 92 residues, respectively), while beta strands make up 18% (40 residues), with an 80.4% confidence level.

Solvent accessibility analyses indicate that GFAP and S-100B are primarily buried, with solvent exposure levels of 63.89% and 66.30%, respectively, while UCH-L1 has a more exposed surface, with 42.60% solvent-accessible regions compared to 57.40% buried regions. These structural characteristics provide valuable insights into the proteins’ solvent interactions and potential functional dynamics (Table 6). This concise overview is suitable for inclusion in a review article, offering a clear snapshot of the proteins’ structural profiles.

Table 6

Predicted secondary structure of proteins using different servers

Proteina2ry structurebExposedcIntermediatedBuriede
Alpha helixBeta sheetOthers (Coil-Turn-Loop)
GFAP65%6.2%28.7%36.11%63.89 %
S-100B67.39%0%32.61%33.70%66.30 %
UCH-L140.81%17.94%41.26%42.60%57.40 %

a Protein: The analyzed protein name.

b 2ry structure (secondary structure): The predicted composition of the protein secondary structure elements.

c Exposed: The percentage of residues that are solvent-exposed on the protein surface.

d Intermediate: The percentage of residues partially buried in the protein structure.

e Buried: The percentage of residues fully buried within the protein core

Three-dimensional (3-D) structure prediction

Initial models were generated, developed, and reviewed using several servers aligned with CASP15 protocols to create the 3D model, and the highest-quality model was selected.

Construction of an initial model using target-template alignment

GalaxyWEB, Swiss-Model, and LOMETS were used for aligned regions, while I-TASSER, Robetta, Phyre2, and AlphaFold targeted low-similarity regions to construct structural models for unaligned regions. AlphaFold demonstrated superior performance, particularly in modeling full-length structures with high confidence scores, making it a critical tool for assessing structural integrity.

For the GFAP protein, I-TASSER generated five models, with a C-score of –3.23 for the main protein, –3.24 for Domain 1, and –1.15 for Domain 2. In contrast, AlphaFold provided a QMEAN Z-score of 0.89, indicating a highly accurate model. In the case of the S-100B protein, the C-scores were 0.06 for the main protein, –0.5 for Domain 1, and –0.25 for Domain 2, while AlphaFold achieved a QMEAN score of 0.79 ± 0.09, confirming its reliability. For the UCH-L1 protein, I-TASSER developed five models, with a C-score of 1.51 for Domain 1. In contrast, AlphaFold provided an RMSD value of 3.36 Å after refinement, suggesting enhanced accuracy in secondary structure alignment.

Each query sequence was given five models by GalaxyWEB, which also selected templates for modeling by rescoring HHsearch results. While Phyre2 built 3D models using advanced distant homology detection techniques, SWISS-MODEL generated multiple models with QMEAN scores of 0.86 ± 0.06, 0.27 ± 0.12, and 0.69 ± 0.07 for GFAP; 0.81 ± 0.06, 0.80 ± 0.09, and 0.81 ± 0.11 for S-100B; and 0.86 ± 0.06 and 0.87 ± 0.06 for UCH-L1. Among these, AlphaFold consistently ranked as one of the top-performing predictors, producing models with high structural fidelity across all three biomarkers.

Reduced-level structure assembly and refinement simulations

The second stage of structure prediction involved refining the S-100B protein. In terms of hydrogen bonds, backbone structure, and side-chain positioning, the results from the GalaxyWEB, ModRefiner, and 3Drefine servers successfully optimized the basic starting models, bringing them closer to their native state. Refinement improved the physical quality of global and local structures compared to the original model generated by selected servers, such as I-TASSER for the target domains. This was achieved by lowering the RMSD and clash scores while increasing the TM-score, enhancing structural accuracy and stability.

Model evaluation and selection

The best 3D model of the correct fold was chosen through model evaluation from all generated conformations, selecting those most closely resembling the native structure. Various evaluation metrics were used to assess structural accuracy and stability, including Swiss-Model Works, QMEAN Server, TM-align, TM-score, Z-score, RMSD, Clash-score, and PROCHECK. AlphaFold and I-TASSER were identified as the best-performing approaches, consistently ranking among the top predictors in CASP11, CASP12, CASP13, CASP14, and CASP15 assessments.

The I-TASSER server produced five full-length models with high C-scores, an estimated TM-score of 0.92 ± 0.06, and an RMSD of 2.7 ± 2.0 Å, confirming the accuracy of its models. However, AlphaFold delivered the best structural predictions for GFAP, S-100B, and UCH-L1, with TM-scores exceeding 0.99, demonstrating near-native accuracy. The selected AlphaFold models outperformed other methods in terms of RMSD reduction and global alignment accuracy, making them the optimal choice for further structural and functional interpretation.

The LOMETS server’s best prediction of the threedimensional structures of GFAP, S-100B, and UCH-L1 (Table 10) further validated AlphaFold’s superiority. The estimated scores for the projected three-dimensional structures using AlphaFold consistently ranked higher than experimentally determined structures in terms of RMSD, TM-score, C-score, QMEAN Z-score, MolProbity score, and Clash score (Tables 710).

Table 7

Three-dimensional structure prediction of the GFAP protein for the main protein

Serversa \ ScoresI-TasserLometsRobettaPhyre2Swiss-ModelAlphaFold
RMSDb
 3Drefine33.22.682.71.992.83
 GalaxyWebrefine2.973.182.862.52.192.89
 Modrefine3.073.293.412.622.013.08
 DeepRefiner0.583.263.492.872.62.54
TM-scorec
 3Drefine0.96550.93380.66410.97480.84890.7870
 GalaxyWebrefine0.99000.94830.91120.99000.97400.8215
 Modrefine0.99750.70930.96710.99930.99180.9449
 DeepRefiner0.98540.92160.88150.96630.94610.8796
GDT-TSd
 3Drefine0.18520.18690.21820.65380.49030.1794
 GalaxyWebrefine0.18920.18580.21640.65930.48540.173
 Modrefine0.17940.1840.21530.67580.52110.1649
 DeepRefiner0.25820.18740.18870.71980.49840.1719
GDT-HAe
 3Drefine0.11460.13250.14760.46430.33120.1285
 GalaxyWebrefine0.11920.13080.14810.46430.32310.1273
 Modrefine0.10760.12670.14470.49180.3620.1152
 DeepRefiner0.22280.12760.12380.52470.33120.1238
QMEANf
 3Drefine0.51 ± 0.050.53 ± 0.050.57 ± 0.050.74 ± 0.090.70 ± 0.070.58 ± 0.05
 GalaxyWebrefine0.52 ± 0.050.52 ± 0.050.57 ± 0.050.77 ± 0.090.73 ± 0.070.60 ± 0.05
 Modrefine0.51 ± 0.050.52 ± 0.050.55 ± 0.050.77 ± 0.090.73 ± 0.070.57 ± 0.05
 DeepRefiner0.73 ± 0.090.54 ± 0.050.58 ± 0.050.76 ± 0.090.73 ± 0.070.59 ± 0.05
MolProbityg
 3Drefine3.731.91.291.651.392.04
 GalaxyWebrefine2.331.330.730.81.030.69
 Modrefine2.562.181.521.371.461.59
 DeepRefiner2.732.952.612.622.743.1
Clash scoreh
 3Drefine40.186.142.5713.957.066
 GalaxyWebrefine13.282.710.710.661.570.57
 Modrefine40.86259.866.658.6311.29
 DeepRefiner185.48157.7130.93143.73126.97129.11
 Aligned lengthi15619918489127151
 RFj84.19%96.05%99.07%100%98.68%99.07 %
 Overall factork86.32%95.88%-100%100%99.70 %

a Servers: The computational protein structure prediction and refinement tools.

b RMSD (root mean square deviation): Measures the average deviation between the predicted and reference structures, with lower values indicating better accuracy.

c TM-score (template modeling score): Assesses the similarity between the predicted and native structures, where values closer to 1 indicate higher accuracy.

d GDT-TS (Global Distance Test-Total Score): Evaluates the accuracy of structural alignment by considering the fraction of residues within a certain distance threshold from the reference structure.

e GDT-HA (Global Distance Test-High Accuracy): A more stringent version of GDT-TS, focusing on higher precision in structural alignment.

f QMEAN (Qualitative Model Energy Analysis): A composite score reflecting the overall quality of the predicted structure based on statistical potentials.

g MolProbity: A structural validation score considering atomic clashes, bond angles, and steric hindrances, where lower values indicate better quality.

h Clash score: The number of atomic clashes per 1000 atoms, with lower values suggesting fewer steric conflicts.

i Aligned length: The number of residues successfully aligned between the predicted and reference structures.

j RF (Residue Frequency): The percentage of correctly predicted residues compared to the reference structure.

k Overall factor: A combined score reflecting the overall reliability of the predicted model

Table 8

3D-Structure prediction of S100B protein for the main protein

Serversa \ ScoresI-TasserLometsQuarkRobettaPhyre2Swiss-ModelAlphaFold
RMSDb
 3Drefine3.373.623.413.353.323.253.42
 GalaxyWebrefine3.413.633.393.393.333.253.44
 Modrefine3.283.372.933.293.273.333.4
 DeepRefiner3.453.623.543.443.63.383.38
TM-scorec
 3Drefine0.98740.99730.99510.98350.98740.98380.9905
 GalaxyWebrefine0.98760.99820.99660.98380.96930.98030.9886
 Modrefine0.99780.99820.99920.99980.99960.99990.9992
 DeepRefiner0.98070.99800.99250.98730.98500.49270.9892
GDT-TSd
 3Drefine0.44840.48640.42390.45650.41850.44290.4429
 GalaxyWebrefine0.45110.49730.43210.45380.42660.44290.4484
 Modrefine0.45110.38860.0740.45650.43480.44840.4457
 DeepRefiner0.44570.47250.40490.45920.40490.21990.4484
GDT-HAe
 3Drefine0.24180.2690.23370.250.22010.250.2446
 GalaxyWebrefine0.23910.29770.24180.24460.22830.250.25
 Modrefine0.23910.19570.05040.25540.24180.25270.2473
 DeepRefiner0.23640.25270.20920.250.2120.1230.25
QMEANf
 3Drefine0.76 ± 0.090.70 ± 0.090.69 ± 0.090.77 ± 0.090.66 ± 0.090.82 ± 0.060.80 ± 0.09
 GalaxyWebrefine0.74 ± 0.090.69 ± 0.090.71 ± 0.090.75 ± 0.090.68 ± 0.090.82 ± 0.060.79 ± 0.09
 Modrefine0.73 ± 0.090.40 ± 0.090.84 ± 0.060.76 ± 0.090.66 ± 0.090.78 ± 0.090.79 ± 0.09
 DeepRefiner0.75 ± 0.090.71 ± 0.090.66 ± 0.090.76 ± 0.090.68 ± 0.090.80 ± 0.060.79 ± 0.09
MolProbityg
 3Drefine3.091.623.211.381.851.431.13
 GalaxyWebrefine1.461.51.751.510.921.430.72
 Modrefine2.472.862.182.261.871.841.93
 DeepRefiner2.992.662.692.682.652.762.67
Clash scoreh
 3Drefine20.584.822.636.867.541.383.43
 GalaxyWebrefine4.83.437.548.231.371.380.69
 Modrefine45.356.9747.9237.0621.2821.9627.45
 DeepRefiner182.03160.05171.66164.17153.23185.93160.71
 Aligned lengthi85892783802979
 RFj93.33%95.56%93.33%98.89%93.33%100.00%100.00 %
 Overall factork100.00%89.29%97.62%100.00%96.34%92%100.00 %

a Servers: The computational protein structure prediction and refinement tools.

b RMSD (root mean square deviation): Measures the average deviation between the predicted and reference structures, with lower values indicating better accuracy.

c TM-score (template modeling score): Assesses the similarity between the predicted and native structures, where values closer to 1 indicate higher accuracy.

d GDT-TS (Global Distance Test-Total Score): Evaluates the accuracy of structural alignment by considering the fraction of residues within a certain distance threshold from the reference structure.

e GDT-HA (Global Distance Test-High Accuracy): A more stringent version of GDT-TS, focusing on higher precision in structural alignment.

f QMEAN (Qualitative Model Energy Analysis): A composite score reflecting the overall quality of the predicted structure based on statistical potentials.

g MolProbity: A structural validation score considering atomic clashes, bond angles, and steric hindrances, where lower values indicate better quality.

h Clash score: The number of atomic clashes per 1000 atoms, with lower values suggesting fewer steric conflicts.

i Aligned length: The number of residues successfully aligned between the predicted and reference structures.

j RF (residue frequency): The percentage of correctly predicted residues compared to the reference structure.

k Overall factor: A combined score reflecting the overall reliability of the predicted model

Table 9

3D-Structure prediction of UCH-L1 protein for the main protein

Serversa \ ScoresI-TasserLometsRobettaPhyre2Swiss-ModelAlphaFold
RMSDb
 3Drefine3.212.933.083.213.273.36
 GalaxyWebrefine3.462.863.123.233.273.34
 Modrefine3.212.933.173.23.353.36
 DeepRefiner3.3333.13.343.273.25
TM-scorec
 3Drefine0.99580.99650.99660.99660.99230.9963
 GalaxyWebrefine0.99600.99640.99750.99560.99600.9956
 Modrefine0.99960.99880.99940.99910.99930.9997
 DeepRefiner0.99680.99490.99320.98910.74430.9946
GDT-TSd
 3Drefine0.0930.09530.09190.09190.09420.0897
 GalaxyWebrefine0.09420.09530.09420.0930.09420.0919
 Modrefine0.09420.09530.09530.0930.09420.0908
 DeepRefiner0.09080.09680.09420.0930.09420.0886
GDT-HAe
 3Drefine0.04930.05160.04820.04820.05160.0493
 GalaxyWebrefine0.05040.05160.05040.04820.05160.0504
 Modrefine0.05040.05040.04930.04820.04930.0493
 DeepRefiner0.0460.05180.04930.0460.05160.046
QMEANf
 3Drefine0.88 ± 0.060.82 ± 0.060.81 ± 0.060.86 ± 0.060.87 ± 0.060.86 ± 0.06
 GalaxyWebrefine0.86 ± 0.060.81 ± 0.060.81 ± 0.060.87 ± 0.060.87 ± 0.060.86 ± 0.06
 Modrefine0.84 ± 0.060.80 ± 0.060.79 ± 0.060.85 ± 0.060.86 ± 0.060.84 ± 0.06
 DeepRefiner0.83 ± 0.060.87 ± 0.060.81 ± 0.060.85 ± 0.060.87 ± 0.060.85 ± 0.06
MolProbityg
 3Drefine2.741.661.561.761.141.76
 GalaxyWebrefine1.31.621.471.181.141.29
 Modrefine2.112.362.162.092.252.25
 DeepRefiner2.952.962.872.891.142.71
Clash scoreh
 3Drefine13.286.355.1917.321.75.48
 GalaxyWebrefine5.486.937.53.751.73.46
 Modrefine43.343.347.9238.3941.2840.99
 DeepRefiner188.54192.07190.72200.631.7176.94
 Aligned lengthi798987859191
 RFj93.67%95.48%95.93%98.19%96.83%100.00 %
 Overall factork94.88%83.57%91.63%94%98.09%92.99 %

a Servers: The computational protein structure prediction and refinement tools.

b RMSD (root mean square deviation): Measures the average deviation between the predicted and reference structures, with lower values indicating better accuracy.

c TM-score (template modeling score): Assesses the similarity between the predicted and native structures, where values closer to 1 indicate higher accuracy.

d GDT-TS (Global Distance Test-Total Score): Evaluates the accuracy of structural alignment by considering the fraction of residues within a certain distance threshold from the reference structure.

e GDT-HA (Global Distance Test-High Accuracy): A more stringent version of GDT-TS, focusing on higher precision in structural alignment.

f QMEAN (Qualitative Model Energy Analysis): A composite score reflecting the overall quality of the predicted structure based on statistical potentials.

g MolProbity: A structural validation score considering atomic clashes, bond angles, and steric hindrances, where lower values indicate better quality.

h Clash score: The number of atomic clashes per 1000 atoms, with lower values suggesting fewer steric conflicts.

i Aligned length: The number of residues successfully aligned between the predicted and reference structures.

j RF (residue frequency): The percentage of correctly predicted residues compared to the reference structure.

k Overall factor:A combined score reflecting the overall reliability of the predicted model

Table 10

Best template structure for GFAP, S-100B, and UCH-L1

ProteinaSubjectTm-ScoreRMSDbSequence identityCovc
GFAP7ogtB10.671.170.1040.683
S-100B1xk4L0.9470.640.3780.978
UCH-L12etlA0.9940.4211

a Protein ranking is based on the TM-score of the structural alignment between the query structure and known structures in the PDB library.

b RMSDa is the RMSD between residues that TM-align structurally aligns.

c Cov represents the coverage of the alignment by TM-align and is equal to the number of structurally aligned residues divided by the length of the query protein

Motifs prediction

Motif analysis

Utilizing MotifFinder and MotifScan, we analyzed motifs in GFAP, S-100B, and UCH-L1 proteins to uncover sequence patterns associated with specific functions. These tools, employing distinct algorithms, provide a comprehensive view of conserved motifs within these proteins. Motif analysis of GFAP, S-100B, and UCH-L1 using MotifFinder and MotifScan servers revealed critical insights. GFAP’s motifs, including Filament and Filament_head, serve as structural foundations for astrocytic integrity, while unknown motifs suggest potential novel functions. S-100B’s calcium-binding motifs, such as S_100 and EF-hand, are implicated in cellular regulatory mechanisms. UCH-L1’s peptidase_C12 motif is essential for ubiquitin-mediated protein turnover (Figure 5). These findings, presented in Tables 11 and 12, highlight the potential of these proteins as biomarkers in TBI pathophysiology and recovery processes.

Figure 5

A) Cartoon representation of the GFAP protein shows known and predicted motifs, where the Filament (68–376) is highlighted in purple, Filament_head (7–66) in red, DUF1664 (129–199) in blue, and DUF1664_2 (226–315) in green. B) The cartoon view showed known and predicted motifs of S-100B, where S_100 (4–47) is highlighted in purple, EF-hand_1 in red, EF-hand_7 in blue, EF-hand_6 in green, EF-hand_4 in orange, EF-hand_5 in brown, EF-hand_8 in yellow, and Spt20 in cyan. C) The cartoon view showed known and predicted motifs of UCH-L1 protein is displayed with the (3–221) region highlighted in red, while the remaining structure is in green

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g005_min.jpg
Table 11

Motif analysis of GFAP, S-100B and UCH-L1 protein using MotifFinder and MotifScan servers

ProteinsPfam_IDaDescriptionbPositioncE-valued
GFAPFilamentPF00038, Intermediate filament protein376..683.1e−109
Filament_headPF04732, Intermediate filament head (DNA binding) region66..72.1 e– 06
DUF1664PF07889, Protein of unknown function (DUF1664)129.199
315..226
0.23
0.22
S-100BS_100PF01023, S-100/ICaBP type calcium binding domain47..48.1 e– 22
EF-hand_1PF00036, EF-hand80..543.7 e– 06
EF-hand_7PF13499, EF-hand domain pair78..262 e– 05
EF-hand_6PF13405, EF-hand domain79..590.01
EF-hand_4PF12763, Cytoskeletal-regulatory complex EF-hand80..320.0035
EF-hand_5PF13202, EF-hand78..550.011
EF-hand_8PF13833, EF-hand domain pair79..430.045
Spt20PF12090, Spt20 family63..250.076
UCHL1Peptidase_C12PF01088, Ubiquitin carboxyl-terminal hydrolase, family 1204..51.1 e– 57

a Pfam ID: The identifier for the protein family in the Pfam database.

b Description: Briefly describe the pr otein family, including its function or characteristic features.

c Position: The range of amino acid positions in the protein associated with the respective Pfam ID.

d E-value: The statistical significance of the Pfam domain match; a lower value indicates a more significant match

Table 12

Post-translational modification site prediction of GFAP, S-100B, and UCH-L1 protein using PROSITE server

ProteinsaCategorybSignaturecMatching positionsd
GFAPRNAIF_ROD_269–377
AssociatedIntermediate filament (IF) rod domain profile
Protein
DomainIF_ROD_1363–371
PosttranslationalIntermediate filament (IF) rod domain signature
Modifications
S-100BRNAEF_HAND_249 – 84
AssociatedEF-hand calcium-binding domain profile
Protein
DomainS100_CABP57 – 78
PosttranslationalS-100/ICaBP type calcium-binding protein signature
ModificationsEF_HAND_162 – 74
EF-hand calcium-binding domain
U-CHL1RNAUCH_184–100
AssociatedUbiquitin carboxyl-terminal hydrolase family one cysteine active-site
Protein
Domain
Posttranslational
Modifications

a Proteins: The name of the protein being analyzed.

b Category: The classification of the protein domain or its associated modifications, such as RNA-associated or posttranslational modifications.

c Signature: The specific domain signature associated with the protein, including the domain profile and its functional description.

d Matching positions: The protein’s range of amino acid positions corresponds to the identified signature

Post-translational modification site prediction using PROSITE server

Different signatures were identified across various locations within the proteins using the PROSITE database. In GFAP, these include the Intermediate Filament (IF) rod domain profile site spanning positions 69–377 and the IF rod domain signature site at 363–371 (Table 12). In S-100B, the EF-hand calcium-binding domain profile site was detected at positions 49–84, along with the S-100/ICaBP-type calcium-binding protein signature site at 57–78 and another EF-hand calcium-binding domain site at 62–74. For UCH-L1, the ubiquitin carboxyl-terminal hydrolase family 1 cysteine active-site was identified between positions 84–100 (Table 12).

Identification, annotation, and analysis of domain architectures

The SMART server is an invaluable resource for exploring protein domain architecture and genetic modification. Our study, complemented by PredictProtein and SCOP data, analyzed the GFAP, S-100B, and UCH-L1 proteins, identifying disordered regions critical to their functionality.

GFAP is classified under SCOP’s superfamily of intermediate filament proteins, characterized by a coiledcoil region (Family: Intermediate filament protein, coiled-coil region). It shares a Fold known for its lefthanded parallel coiled-coil structure within the Class of all-alpha proteins. This classification includes proteins such as Prelamin-A/C, Vimentin, various Keratins, and Lamin-B.

S-100B belongs to the Family of S100 proteins, which adopt a Fold resembling a pair of EF-hands within the EFhand superfamily. This family includes S100-A4, S100-A8, S100-B, and other S100 variants, as well as Filaggrin.

UCH-L1 falls under the category of cysteine proteinases. The Superfamily of cysteine proteinases, with a Family specific to Ubiquitin carboxyl-terminal hydrolase UCH-L, includes two distinct Folds: the canonical cysteine proteinase catalytic core and a variant type. Notable proteins in this category include UCH-L1, UCH-L3, and YUH1 (Table 13).

Table 13

Structural classification of proteins using (SCOP SERVER)

ProteinaSuperfamilybFamilycFolddClasse
GFAPSuperfamily 3001560 — intermediate filament protein, coiled-coil regionFamily 4003819 — intermediate filament protein, coiled-coil regionFold 2000962 — Left-handed parallel coiled-coilClass 1000000 – all alpha proteins
S-100BSuperfamily 3001983 — EF-handFamily 4000919 — S100 proteinsFold 2000120 – pair of EF-hands-likeClass 1000000 – all alpha proteins
UCH-L1Superfamily 3001808 — cysteine proteinasesFamily 4000880 — ubiquitin carboxyterminal hydrolase UCH-LFold 2001107 — a canonical type of cysteine proteinases catalytic core
Fold 2001570 — variant types of cysteine proteinases catalytic core
Class 1000003 – alpha and beta proteins (a+b)

a Protein: The name of the protein being analyzed.

b Superfamily: The broader classification of the pr otein family based on structural and functional similarities.

c Family: The specific family within the superfamily, detailing the protein’s function.

d Fold: The s tructural classification of the protein, indicating the arrangement of its secondary structure elements.

e Class: The highest classification level is based on the p rotein structure

Table 14

Structural classification of GFAP, S-100B and UCH-L1 proteins using (SUPERFAMILY SERVER)

ProteinaClassification levelbClassificationcE-valued
GFAPSuperfamily294–372Intermediate filament protein, coiled-coil region2.09 e– 23
68–104Intermediate filament protein, coiled-coil region3.3 e– 11
116–210Myosin rod fragments1.31 e– 2
Family294–372Intermediate filament protein, coiled-coil region3.43 e– 5
68–104Intermediate filament protein, coiled-coil region6.6 e– 4
116–210Myosin rod fragments0.012
S-100BSuperfamily1–89EF-hand2.72 e– 24
Family1–89S100 proteins6.47 e– 5
UCH-L1Superfamily3–221Cysteine proteinases2.42 e– 76
Family3–221Ubiquitin carboxyl-terminal hydrolase UCH-L1.37 e– 9

a Protein: The name of the protein being analyzed.

b Classification level: The hierarchical level of structural classification, distinguishing between superfamily and family.

c Classification: The specific protein classification within its respective level indicates structural and functional similarities.

d E-value: The statistical significance of the classification, representing the likelihood of the match occurring by chance

The structural classification analysis of GFAP, S-100B, and UCH-L1 proteins revealed distinctive domain architectures and functional characteristics. For GFAP, multiple classifications were identified at both the superfamily and family levels. Within the superfamily classification, residues 294–372 were assigned to an intermediate filament protein with a coiled-coil region (E-value: 2.09e–23), while residues 68–104 exhibited a similar classification (E-value: 0.00000000000033). Another segment, spanning residues 116–210, was classified as myosin rod fragments (E-value: 0.0131). Consistent with these findings, the family classifications corroborated the intermediate filament protein classification for the same residue ranges, albeit with slightly different E-values.

For S-100B, a superfamily classification covering residues 1–89 identified an EF-hand motif (E-value: 2.72e–24), while the family classification recognized S100 proteins within the same range (E-value: 0.0000647). In the case of UCH-L1, a superfamily classification spanning residues 3–221 indicated cysteine proteinases (E-value: 2.42e–76), with the family classification assigning the protein as ubiquitin carboxyl-terminal hydrolase UCH-L within the same range (E-value: 0.00000000137) (Table 14).

This synthesis underscores the structural categorization and significance of these proteins within their respective superfamilies and families, highlighting their disordered regions that are critical for functionality. The study demonstrates the utility of domain architecture analysis in understanding protein function and potential genetic modification strategies. These insights, derived from the integration of SMART, PredictProtein, and SCOP data, provide a detailed understanding of the structural and functional aspects of these proteins, which are essential for future genetic research and manipulation. Structural classification tools such as CATH and SCOP were instrumental in establishing structure–function and evolution links to the GFAP protein and other proteins analyzed in this research. By analyzing domain architectures and understanding the roles of specific domains, researchers can gain valuable insights into the functions and mechanisms of these proteins, contributing to a broader understanding of traumatic brain injury biomarkers such as GFAP, S-100B, and UCH-L1.

Pathway and systems biology analysis

The STRING analysis revealed a network of interactions among GFAP, S-100B, and UCH-L1, with direct connections supported by multiple lines of evidence (Figure 6). The interaction between GFAP and S-100B exhibited the highest confidence (combined score: 0.925), driven by co-expression patterns (score: 0.239) and experimentally validated interactions (score: 0.087). This pairing is biologically significant in the context of neuroinflammation, as both proteins are enriched in pathways such as Signaling by ERBB4 (Reactome: HSA-1236394) and Neuroinflammation (WikiPathways: WP5083), which play a crucial role in TBI-induced glial activation.

Figure 6

Protein–protein interaction network of GFAP, S-100B, and UCH-L1, generated using the STRING database, illustrates the functional relationships among these proteins in the context of traumatic brain injury (TBI). Nodes represent proteins, while edges represent interaction confidence scores, with thicker lines indicating higher confidence. GFAP and S-100B exhibit the strongest interaction (combined score: 0.925), supported by coexpression and experimental evidence, while S-100B and UCH-L1 show moderate interaction (combined score: 0.699). GFAP and UCH-L1 are linked with a lower confidence score (0.590), primarily supported by text mining. The network highlights the involvement of these proteins in TBI-related pathways, including neuroinflammation and ubiquitination. Pathway annotations are color-coded for clarity

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g006_min.jpg

S-100B and UCH-L1 demonstrated a moderate interaction (combined score: 0.699), primarily supported by text mining (score: 0.671) and coexpression (score: 0.121). This interaction aligns with S-100B’s role in calcium signaling and UCH-L1’s function in ubiquitin-mediated proteolysis, as evidenced by its association with the Deubiquitination pathway (Reactome: HSA-5688426). The link between GFAP and UCH-L1, though weaker (combined score: 0.590), suggests a potential regulatory mechanism connecting GFAP’s structural role in astrocytes to UCH-L1’s protein degradation functions, particularly in pathways like Autophagy (Reactome: HSA-9612973) and Parkinson Disease (WikiPathways: WP2371), which are relevant to protein aggregation in TBI.

Pathway enrichment analysis highlighted the central role of neuroinflammation and ubiquitination in the network. GFAP and S-100B were strongly associated with immune response pathways, including Tolllike Receptor Cascades (Reactome: HSA-168898) and Glial Cell Differentiation (GO:0010001), while UCH-L1 was linked to protein homeostasis mechanisms such as the Ubiquitin-Proteasome System (KEGG: hsa05012).

Discussion

This study establishes a foundational computational framework for analyzing GFAP, S-100B, and UCH-L1 as TBI biomarkers, leveraging state-of-the-art in silico tools to generate structural and functional insights. While the absence of experimental validation is acknowledged, the computational predictions align with recent advancements in structural biology, including AI-driven protein modeling, exemplified by the Nobel Prize-winning work on AlphaFold. This underscores the growing reliability of such methods in guiding biomedical research.

Advanced bioinformatics tools facilitate the structural analysis of GFAP, S-100B, and UCH-L1, revealing intricate details of their secondary structures and functional motifs. Structural bioinformatics analysis of GFAP identified a predominantly alpha-helical architecture (65%, 281 residues), complemented by minor beta-strands (6.2%, 25 residues) and coils (28.7%, 120 residues). This configuration underscores its role as a stable cytoskeletal protein essential for maintaining astrocytic integrity. Two conserved domains were identified: the Pfam00038 intermediate filament domain (residues 68–376; E-value: 1.12e–127) and the Pfam04732 filament head domain (residues 4–66; E-value: 2.51e–08). These domains, along with motifs such as Filament_head and DUF1664, highlight GFAP’s involvement in synaptic plasticity and axonal transport.

PTMs, including the IF rod domain (residues 69–377) and a bipartite nuclear localization signal (NLS), were computationally predicted, suggesting roles in DNA repair and nuclear shuttling during traumatic injury. Solvent accessibility analysis indicated that 63.89% of residues are buried, conferring proteolytic resistance and explaining GFAP’s persistence in biofluids postTBI.

Clinically, GFAP’s α-helix-rich structure aligns with its use in FDA-approved assays (e.g., BANYAN GFAP test) to reduce unnecessary neuroimaging in mild TBI cases. This structural stability (Gogishvili et al. 2024) enables reliable detection in serum and CSF. The study’s novel identification of the DUF1664 motif, previously uncharacterized in GFAP, opens avenues for investigating its role in neuroinflammation. AlphaFold-predicted models (TM-score: 0.92 ± 0.06; RMSD: 2.7 ± 2.0 Å) surpassed prior homology-based structures, offering atomic-level insights into GFAP’s interaction with inflammatory mediators such as IL-6 and TNF-α. These findings bridge structural predictions with experimental validations, including murine models demonstrating GFAP’s nuclear translocation during DNA damage (Posti et al. 2017).

S-100B exhibited a highly α-helical structure (67.39%, 63 residues) with no β-strands and 32.61% coils, consistent with its role as a calcium-sensing protein. The cd05027 S100B domain (residues 2–89; E-value: 1.68e–47) and EF-hand motifs (residues 49–84) were critical for Ca2+ binding and TLR4-mediated neuroinflammatory signaling. Buried residues (66.30%) stabilized Ca2+-binding pockets, while solvent-exposed regions mediated interactions with inflammatory receptors. Structural refinement using DeepRefiner yielded highaccuracy models (RMSD: 3.45 Å; TM-score: 0.99), resolving ambiguities in earlier SWISS-MODEL templates. A noncanonical coiled-coil region (residues 26–78), identified via ThreaDom analysis, suggests a scaffold for Ca2+-dependent oligomerization, a mechanism not previously described (Moreira et al. 2021; Michetti et al. 2023).

S-100B’s clinical relevance is underscored by its association with blood-brain barrier disruption, as demonstrated in multicenter studies (Mondello et al. 2021). Its EF-hand motifs align with experimental evidence showing S100B activation of TLR4/NF-κB pathways, exacerbating neuroinflammation in rodent models (Gupta et al. 2021). The study’s prediction of S100B’s coiledcoil domain provides a structural basis for its oligomerization, a feature implicated in amplifying inflammatory cascades. Furthermore, therapeutic targeting of S-100B using pentamidine, which reduces IL-6 and TNF-α in vivo, validates the computational insights into its Ca2+-binding pockets as druggable sites (Gupta et al. 2021).

UCH-L1 exhibited a balanced secondary structure, with 40.81% α-helices (91 residues), 17.94% β-strands (40 residues), and 41.26% coils (92 residues). The cd09616 Peptidase_C12 domain (residues 5–219; E-value: 3.16e–127) harbored a catalytic triad (Cys90, His161, Asp176) essential for ubiquitin hydrolysis. Solvent-exposed residues (42.60%) facilitated interactions with ubiquitinated proteins such as tau and α-synuclein, while buried regions stabilized the protease core. I-TASSER simulations revealed conformational flexibility in the ubiquitin-binding domain (residues 84–100), a feature undetected in static X-ray structures (PDB: 2ETL). Phylogenetic analysis positioned UCH-L1 within the cysteine protease superfamily, resolving its evolutionary divergence from UCHL-3 (bootstrap value: 98%) (Puri et al. 2024).

UCH-L1’s dual role in TBI – neuroprotection via aggregate clearance and neurotoxicity through excessive proteolysis – was corroborated by clinical studies. Its rapid release postinjury (detectable within 1 h) supports its inclusion in the ALERT-TBI diagnostic panel, which achieves 97% sensitivity for detecting intracranial lesions (Papa et al. 2016). The study’s dynamic modeling of UCH-L1’s catalytic triad provides mechanistic insights into its therapeutic modulation. For instance, inhibitors like 6-AMA have been shown to reduce axonal degeneration in vitro, aligning with computational predictions of UCH-L1’s role in tau aggregation (Yu et al. 2018). These findings highlight UCH-L1’s potential as a therapeutic target for mitigating secondary injury in TBI.

The focus on these three biomarkers was intentional, as they are well-characterized and clinically validated indicators of TBI, providing a targeted basis for future expansion to additional molecules. Though static models were employed, they offer critical preliminary insights into solvent accessibility, conserved regions, and PTMs, which can inform experimental designs for dynamic or microenvironment-specific studies. These computational findings serve as a roadmap for complementary experimental studies and clinical validation, ensuring a balanced integration of in silico and empirical approaches in advancing TBI biomarker research.

We recommend prioritizing experimental validation of the predicted structural and functional features to confirm their biological relevance. Future studies should expand the biomarker panel to include additional TBI-associated molecules, enhancing diagnostic specificity. Additionally, integrating multiomics approaches – including genomics, transcriptomics, and proteomics – could provide a holistic understanding of TBI mechanisms, effectively bridging computational predictions with clinical insights.

Conclusions

This study leverages an integrated proteomic and bioinformatic framework to elucidate the structural and functional nuances of key TBI biomarkers: GFAP, S-100B, and UCH-L1. Through meticulous analysis, we have identified conserved regions, secondary structures, solvent accessibility, and PTM sites, enhancing our understanding of their structural models and domain architectures. Domain analysis positioned each protein within specific superfamilies, shedding light on domain-specific functions. The predominance of alpha-helices in GFAP and S-100B and a balanced mix of structural elements in UCH-L1 were confirmed, with solvent accessibility profiles indicating a majority of buried regions for GFAP and S-100B, whereas UCH-L1 displayed a more exposed structure (Figures 79).

Figure 7

Integrative map of predicted results for GFAP. In line 1, the conserved regions are specified at 66–130, 157–175, 177–194, 251–276 and 367–398. Alpha Helix, Beta Strand, Other, Exposed and Buried from the secondary structure prediction servers, are specified in lines 2, 3, 4, 5, and 6. The result of protein binding site is specified in line 7. In the last, predicted motifs are specified in line 8

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g007_min.jpg
Figure 8

Integrative map of predicted results for S-100B. In line 1, the conserved regions are specified at (22–38). Alpha Helix, Beta Strand, Other, Exposed and Buried from the secondary structure prediction servers, is specified at line 2, 3, 4, 5, and 6. The result of protein binding site is specified in line 7. In the last, predicted motifs are specified in line 8

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g008_min.jpg
Figure 9

Integrative map of predicted results for UCH-L1. In line 1, the conserved regions are specified at (61–92, 108–123, 125–144, 146–186, and 196–223). Alpha Helix, Beta Strand, Other, Exposed and Buried from the secondary structure prediction servers, are specified in Lines 2, 3, 4, 5, and 6. The result of protein binding site is specified at line 7. In the last, predicted motifs are specified in line 8

https://www.biotechnologia-journal.org/f/fulltexts/202470/BTA-106-2-202470-g009_min.jpg

Advanced bioinformatics servers facilitated the identification of protein binding motifs and structural features, with AlphaFold and I-TASSER providing the most accurate full-length tertiary structure predictions. Domain architecture analysis across various databases confirmed GFAP’s affiliation with the intermediate filament superfamily, S-100B’s with the EF-hand superfamily, and UCH-L1’s with the cysteine proteinase superfamily. These findings offer profound insights into the functional roles of these proteins in TBI pathophysiology.

The integrative approach adopted in this study not only deepens our comprehension of TBI biomarkers but also paves the way for the development of targeted diagnostic and therapeutic strategies, ultimately enhancing patient care.

Advanced bioinformatics servers facilitated the identification of protein binding motifs and structural features, with Alphafold and I-TASSER providing the most accurate full-length tertiary structure predictions. The domain architecture, analyzed through various databases, revealed GFAP’s affiliation with the intermediate filament superfamily, S-100B’s with the EF-hand superfamily, and UCH-L1’s with the cysteine proteinase superfamily. These findings offer profound insights into the proteins’ functional roles in TBI pathophysiology.

The integrative approach adopted in this study deepens our comprehension of TBI biomarkers and paves the way for the development of targeted diagnostic and therapeutic strategies, enhancing patient care.