City Research Online

Secondary use of electronic medical records for early identification of raised condition likelihoods in individuals: a machine learning approach

Turner, Jonathan (2019). Secondary use of electronic medical records for early identification of raised condition likelihoods in individuals: a machine learning approach. (Unpublished Doctoral thesis, City, University of London)


With many symptoms being common to multiple diseases, there is a challenge in producing an initial diagnosis or recommendation for diagnostic tests from a set of symptoms that could have been produced by a number of diseases. Often the initial choice of diagnosis or testing is based on a clinician’s impression of the likelihood of that condition in a general population; however the opportunity may exist for modification of these likelihoods based on individuals’ recorded medical histories. This data-driven approach utilises existing data and is thus cheap and non-invasive. A method is proposed by which an individual’s likelihoods of having specified medical conditions are modified by the similarity of that individual’s medical history to the medical histories of other individuals, comparing the prevalence of conditions in those other individuals’ records who are similar to the individual of interest versus the prevalence of the conditions in those individuals who are dissimilar. In order to maximise the number of records available for analysis, a process was developed for the merging of data from disparate sources that used different clinical coding systems, including extensive development of a technique for semi automatically mapping clinical events coded in ICD9-CM to Clinical Terms Version 3 (CTV3), for which no existing mapping table was found. Semantically similar fields in the source code sets were identified and retained in the combined data set. ‘Codelists’ comprising multiple CTV3 codes for a variety of conditions were built that defined the presence of those conditions within individual records. The hierarchical structure of the CTV3 code table was utilised as a method of identifying codes that differed in structure but had clinically similar or related meaning. The optimum degree of granularity of the coded data to use in identifying similar records was investigated and used in subsequent analysis.

Two methods were used for discovering groups of similar and dissimilar individuals: the ‘nearest neighbours’ method and the grouping of records using a clustering process. Altered likelihoods for a range of conditions were investigated and results for the nearest-neighbours approach compared to the clustering approach. Results for adjusted condition likelihoods for 18 conditions are reported, together with a discussion of possible reasons for a change, or otherwise, in the condition likelihood, and a discussion of the clinical significance and potential use of information about such a change. logistic regressions performed on a selection of conditions KNN performed better than logistic regression when judged by F-score (or sensitivity and specificity separately), however situation more nuanced when looking at likelihood ratios: Logistic regression produced higher (better) positive likelihood ratios, but KNN produced lower (better) negative likelihood ratios. Logistic regression produced higher odds ratios.

Publication Type: Thesis (Doctoral)
Subjects: Q Science > Q Science (General)
T Technology > T Technology (General)
Departments: Doctoral Theses
School of Science & Technology
School of Science & Technology > Computer Science
[thumbnail of Turner, Jonathan_Redacted.pdf]
Text - Accepted Version
Download (9MB) | Preview


Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email


Downloads per month over past year

View more statistics

Actions (login required)

Admin Login Admin Login