Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge

Dong, F.; Chen, M.; Zhou, J.; Shi, Y.; Chen, Y.; Dong, M.; Wang, Y.; Li, D.; Yang, X.; Zhu, R.; Dick, R.; Lv, Q.; Yang, F.; Lu, T.; Gu, N.; Shang, L.

Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge

Dong, F., Chen, M., Zhou, J. , Shi, Y., Chen, Y., Dong, M., Wang, Y., Li, D., Yang, X., Zhu, R. ORCID: 0000-0002-9944-0369, Dick, R., Lv, Q., Yang, F., Lu, T., Gu, N. & Shang, L. (2024). Once Read is Enough: Domain-specific Pretraining-free Language Models with Cluster-guided Sparse Experts for Long-tail Domain Knowledge. In: Globerson, A., Mackey, L., Belgrave, D. , Fan, A., Paquet, U., Tomczak, J. & Zhang, C. (Eds.), Advances in Neural Information Processing Systems. NeurIPS 2024, 10-15 Dec 2024, Vancouver, Canada.

Abstract

Language models (LMs) only pretrained on a general and massive corpus usually cannot attain satisfying performance on domain-specific downstream tasks, and hence, applying domain-specific pretraining to LMs is a common and indispensable practice. However, domain-specific pretraining can be costly and time-consuming, hindering LMs' deployment in real-world applications. In this work, we consider the incapability to memorize domain-specific knowledge embedded in the general corpus with rare occurrences and “long-tail” distributions as the leading cause for pretrained LMs' inferior downstream performance. Analysis of Neural Tangent Kernels (NTKs) reveals that those long-tail data are commonly overlooked in the model's gradient updates and, consequently, are not effectively memorized, leading to poor domain-specific downstream performance. Based on the intuition that data with similar semantic meaning are closer in the embedding space, we devise a Cluster-guided Sparse Expert (CSE) layer to actively learn long-tail domain knowledge typically neglected in previous pretrained LMs. During pretraining, a CSE layer efficiently clusters domain knowledge together and assigns long-tail knowledge to designate extra experts. CSE is also a lightweight structure that only needs to be incorporated in several deep layers. With our training strategy, we found that during pretraining, data of long-tail knowledge gradually formulate isolated, “outlier” clusters in an LM's representation spaces, especially in deeper layers. Our experimental results show that only pretraining CSE-based LMs is enough to achieve superior performance than regularly pretrained-finetuned LMs on various downstream tasks, implying the prospects of domain-specific-pretraining-free language models.

Publication Type:	Conference or Workshop Item (Paper)
Additional Information:	Copyright, the authors, 2025.
Subjects:	H Social Sciences > HD Industries. Land use. Labor Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Departments:	Bayes Business School Bayes Business School > Faculty of Actuarial Science & Insurance
SWORD Depositor:	Symplectic Administrator

Preview

Text - Published Version
Download (1MB) | Preview

Export

Downloads

Downloads per month over past year

View more statistics

Metadata

Altmetric

View Altmetric information about this item.

Funder Information

CORE (COnnecting REpositories)

Actions (login required)

Admin Login

Creators:	Dong, F. Chen, M. Zhou, J. Shi, Y. Chen, Y. Dong, M. Wang, Y. Li, D. Yang, X. Zhu, R. ORCID: 0000-0002-9944-0369 Dick, R. Lv, Q. Yang, F. Lu, T. Gu, N. Shang, L.
Event Title:	NeurIPS 2024
Event Type:	Conference
Event Location:	Vancouver, Canada
Event Dates:	10-15 Dec 2024
Status:	Published
Refereed:	Yes
Journal or Publication Title:	Advances in Neural Information Processing Systems
ISBN:	9798331314385
ISSN:	1049-5258
URI:	https://openaccess.city.ac.uk/id/eprint/35348
Date available in CRO:	13 Jun 2025 10:40
Date deposited:	13 June 2025
Dates:	Date Event 28 October 2024 Published 28 October 2024 Published Online 26 September 2024 Accepted