We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

q-bio.BM

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Quantitative Biology > Biomolecules

Title: Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

Abstract: Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Cite as: arXiv:2404.15805 [q-bio.BM]
  (or arXiv:2404.15805v1 [q-bio.BM] for this version)

Submission history

From: Wei Chen [view email]
[v1] Wed, 24 Apr 2024 11:09:43 GMT (888kb,D)

Link back to: arXiv, form interface, contact.