- Accepted Version
Download (457kB) | Preview
– Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semi‐structured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi.
– First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections.
– Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 sub‐collections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show sub‐millisecond run times, demonstrating minimal overhead for XML search.
– Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable.
– The paper provides useful information on a method for XML indexing based on the IR system Okapi.
|Additional Information:||This article is (c) Emerald Group Publishing and permission has been granted for this version to appear here http://openaccess.city.ac.uk/. Emerald does not grant permission for this article to be further copied/distributed or hosted elsewhere without the express permission from Emerald Group Publishing Limited. - See more at: http://www.emeraldgrouppublishing.com/authors/writing/author_rights.htm#sthash.yfJeWhhm.dpuf|
|Uncontrolled Keywords:||Information retrieval, Data structures, Extensible markup language, Indexing, Resource efficiency|
|Subjects:||Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
|Divisions:||School of Informatics > Centre for Human Computer Interaction Design|
Actions (login required)
Downloads per month over past year