City Research Online

Distributed Inverted Files and Performance: A Study of Parallelism and Data Distribution Methods in IR

MacFarlane, A. ORCID: 0000-0002-8057-0737 (2000). Distributed Inverted Files and Performance: A Study of Parallelism and Data Distribution Methods in IR. (Unpublished Doctoral thesis, City, University of London)

Abstract

The study investigates the performance of parallel information retrieval (IR) algorithms on different data distribution methods for Inverted files to identify which is the best for the requirements of specific IR tasks. We define a data distribution method as a way of distributing Inverted file data to local disks on a parallel machine. A data distribution method may be on-the-fly (with one copy of the index held), replication (all nodes have all of the index) or partitioning (data for index is split amongst nodes). Partitioning of inverted file data can be done in many ways but we consider only two: by term (Termld) and by document (Dodd). Termld partitioning is a type of partitioning which distributes unique word data to a single partition, while D odd partitioning distributes unique document data to a single partition. We consider the issue of improving the performance of standard IR algorithms on these data distribution methods by looking at sequential job service not concurrent job service, e.g. we consider the issue of sequential query service not concurrent query service. This methodology rules out some distribution methods for some tasks studied. We consider the following main tasks of IR: indexing, search, passage retrieval, inverted file update and query optimisation for routing /filtering. We produce a synthetic performance model for each of these tasks for the purposes of comparison. We have two subsidiary aims; one was to demonstrate portability of our implemented data structures and algorithms on different parallel machines. Secondly, we also study the possibility of increased retrieval effectiveness by examining a larger section of the search space for both passage retrieval and routing/filtering. We consider the implications of concurrency in updates on Inverted files. Our theoretical and empirical results show that in most cases the D odd partitioning method is the best data distribution method apart from routing/filtering where replication was found to be superior.

Publication Type: Thesis (Doctoral)
Subjects: Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Departments: School of Communication & Creativity
School of Communication & Creativity > Media, Culture & Creative Industries > Library & Information Science
School of Communication & Creativity > School of Communication & Creativity Doctoral Theses
Doctoral Theses
[thumbnail of Macfarlane thesis 2000 PDF-A.pdf]
Preview
Text - Accepted Version
Download (16MB) | Preview

Export

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Downloads

Downloads per month over past year

View more statistics

Actions (login required)

Admin Login Admin Login