City Research Online

A Probabilistic Approach for Chinese Information Retrieval: Theory, Analysis and Experiments

Huang, X. (2001). A Probabilistic Approach for Chinese Information Retrieval: Theory, Analysis and Experiments. (Unpublished Doctoral thesis, City, University of London)

Abstract

Using probabilistic methods to retrieve information has always been a challenging task in the area of information retrieval. A key issue in probabilistic retrieval methods is the design of query term weighting functions. In this thesis, we provide a comprehensive description of the probabilistic retrieval model and propose several new weighting functions, which include both single unit weighting and compound unit weighting functions. Detailed analysis and evaluation of these new weighting functions are also provided.

This thesis provides a large number of empirical results for comparing different weighting methods in Chinese word-based and character-based retrieval systems. The results show that (1) compound unit weighting is useful for improving the system performance; (2) a newly designed single unit weighting function, BM26, contributes to the improvement of Chinese information retrieval; (3) the character based system outperforms the word-based system in terms of average precision.

The thesis makes three original contributions to modern information retrieval. First, it demonstrates that probabilistic compound unit weighting is useful for Chi nese information retrieval systems. Second, it proposes a new probabilistic single unit weighting function, BM26, that considers document lengths when assigning weights to documents, and it demonstrates that the new function outperforms the function that it evolved from. Third, this thesis reports the results of large scale experiments that compare Chinese word-based and character-based retrieval systems.

In summary, the thesis combines a comprehensive description of the probabilis tic model of retrieval with some new designs of probabilistic weighting formulae and new systematic experiments on the Chinese TREC Programme material. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust for Chinese text retrieval, and that it responses appropriately, with major improvements in performance, to key features of retrieval situations in Chinese text retrieval.

Publication Type: Thesis (Doctoral)
Subjects: Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources
Departments: School of Communication & Creativity > Media, Culture & Creative Industries > Library & Information Science
School of Communication & Creativity > School of Communication & Creativity Doctoral Theses
Doctoral Theses
[thumbnail of Huang thesis 2001 PDF-A.pdf]
Preview
Text - Accepted Version
Download (14MB) | Preview

Export

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Downloads

Downloads per month over past year

View more statistics

Actions (login required)

Admin Login Admin Login