A Probabilistic Approach for Chinese Information Retrieval: Theory, Analysis and Experiments
Huang, X. (2001). A Probabilistic Approach for Chinese Information Retrieval: Theory, Analysis and Experiments. (Unpublished Doctoral thesis, City, University of London)
Abstract
Using probabilistic methods to retrieve information has always been a challenging task in the area of information retrieval. A key issue in probabilistic retrieval methods is the design of query term weighting functions. In this thesis, we provide a comprehensive description of the probabilistic retrieval model and propose several new weighting functions, which include both single unit weighting and compound unit weighting functions. Detailed analysis and evaluation of these new weighting functions are also provided.
This thesis provides a large number of empirical results for comparing different weighting methods in Chinese word-based and character-based retrieval systems. The results show that (1) compound unit weighting is useful for improving the system performance; (2) a newly designed single unit weighting function, BM26, contributes to the improvement of Chinese information retrieval; (3) the character based system outperforms the word-based system in terms of average precision.
The thesis makes three original contributions to modern information retrieval. First, it demonstrates that probabilistic compound unit weighting is useful for Chi nese information retrieval systems. Second, it proposes a new probabilistic single unit weighting function, BM26, that considers document lengths when assigning weights to documents, and it demonstrates that the new function outperforms the function that it evolved from. Third, this thesis reports the results of large scale experiments that compare Chinese word-based and character-based retrieval systems.
In summary, the thesis combines a comprehensive description of the probabilis tic model of retrieval with some new designs of probabilistic weighting formulae and new systematic experiments on the Chinese TREC Programme material. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust for Chinese text retrieval, and that it responses appropriately, with major improvements in performance, to key features of retrieval situations in Chinese text retrieval.
Download (14MB) | Preview
Export
Downloads
Downloads per month over past year