City Research Online

Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge

Dong, F., Chen, M., Zhou, J. , Shi, Y., Chen, Y., Dong, M., Wang, Y., Li, D., Yang, X., Zhu, R. ORCID: 0000-0002-9944-0369, Dick, R., Lv, Q., Yang, F., Lu, T., Gu, N. & Shang, L. (2024). Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge. In: Globerson, A., Mackey, L., Belgrave, D. , Fan, A., Paquet, U., Tomczak, J. & Zhang, C. (Eds.), Advances in Neural Information Processing Systems. NeurIPS 2024, 10-15 Dec 2024, Vancouver, Canada.

Abstract

Language models (LMs) only pretrained on a general and massive corpus usually cannot attain satisfying performance on domain-specific downstream tasks, and hence, applying domain-specific pretraining to LMs is a common and indispensable practice. However, domain-specific pretraining can be costly and time-consuming, hindering LMs' deployment in real-world applications. In this work, we consider the incapability to memorize domain-specific knowledge embedded in the general corpus with rare occurrences and “long-tail” distributions as the leading cause for pretrained LMs' inferior downstream performance. Analysis of Neural Tangent Kernels (NTKs) reveals that those long-tail data are commonly overlooked in the model's gradient updates and, consequently, are not effectively memorized, leading to poor domain-specific downstream performance. Based on the intuition that data with similar semantic meaning are closer in the embedding space, we devise a Cluster-guided Sparse Expert (CSE) layer to actively learn long-tail domain knowledge typically neglected in previous pretrained LMs. During pretraining, a CSE layer efficiently clusters domain knowledge together and assigns long-tail knowledge to designate extra experts. CSE is also a lightweight structure that only needs to be incorporated in several deep layers. With our training strategy, we found that during pretraining, data of long-tail knowledge gradually formulate isolated, “outlier” clusters in an LM's representation spaces, especially in deeper layers. Our experimental results show that only pretraining CSE-based LMs is enough to achieve superior performance than regularly pretrained-finetuned LMs on various downstream tasks, implying the prospects of domain-specific-pretraining-free language models.

Publication Type: Conference or Workshop Item (Paper)
Additional Information: Copyright, the authors, 2025.
Subjects: H Social Sciences > HD Industries. Land use. Labor
Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Departments: Bayes Business School
Bayes Business School > Actuarial Science & Insurance
SWORD Depositor:
[thumbnail of once-read-is-enough.pdf]
Preview
Text - Published Version
Download (1MB) | Preview

Export

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Downloads

Downloads per month over past year

View more statistics

Actions (login required)

Admin Login Admin Login