Detecting beats in the photoplethysmogram: benchmarking open-source algorithms

Abstract The photoplethysmogram (PPG) signal is widely used in pulse oximeters and smartwatches. A fundamental step in analysing the PPG is the detection of heartbeats. Several PPG beat detection algorithms have been proposed, although it is not clear which performs best. Objective: This study aimed to: (i) develop a framework with which to design and test PPG beat detectors; (ii) assess the performance of PPG beat detectors in different use cases; and (iii) investigate how their performance is affected by patient demographics and physiology. Approach: Fifteen beat detectors were assessed against electrocardiogram-derived heartbeats using data from eight datasets. Performance was assessed using the F 1 score, which combines sensitivity and positive predictive value. Main results: Eight beat detectors performed well in the absence of movement with F 1 scores of ≥90% on hospital data and wearable data collected at rest. Their performance was poorer during exercise with F 1 scores of 55%–91%; poorer in neonates than adults with F 1 scores of 84%–96% in neonates compared to 98%–99% in adults; and poorer in atrial fibrillation (AF) with F 1 scores of 92%–97% in AF compared to 99%–100% in normal sinus rhythm. Significance: Two PPG beat detectors denoted ‘MSPTD’ and ‘qppg’ performed best, with complementary performance characteristics. This evidence can be used to inform the choice of PPG beat detector algorithm. The algorithms, datasets, and assessment framework are freely available.


Introduction
The photoplethysmogram (PPG) signal is acquired by a range of clinical and consumer devices, from pulse oximeters to smartwatches (Allen 2007, Charlton andMarozas 2022). It exhibits a pulse wave for each heartbeat, caused by the ejection of blood from the heart into the circulation. A wealth of physiological information can be deduced from the timing and shape of PPG pulse waves . Consequently, a fundamental step in analysing the PPG is to detect individual pulse waves, corresponding to individual heartbeats. Indeed, several beat detection algorithms have been developed for the PPG, although it is not yet known how their performance compares.
It is important to assess the performance of beat detectors in different use cases where PPG signals can have different morphologies and levels of artifact . Specifically, pulse oximeters acquire PPG signals at the finger close to major arteries, often with little motion artifact. In contrast, smart wearables such as smartwatches and fitness bands acquire the PPG at the wrist further from major arteries, often in challenging conditions such as during exercise. Assessing the performance of beat detectors across different use cases would allow one to select the best beat detector for a particular use case, and to understand its expected performance.
It is also important to investigate the impact of patient demographics and physiology on performance. First, it is important to assess performance during arrhythmias, since the PPG is now being used to identify atrial fibrillation (AF) (Perez et al 2019). Second, performance should be compared between ethnicities, as the performance of pulse oximeters has been found to be related to ethnicity (Sjoding et al 2020). Third, it is important to assess whether performance differs in babies, who have higher heart rates (HRs) than adults (Fleming et al 2011). Assessing the impact of patient demographics and physiology on performance could highlight areas for future algorithm development.
This study aimed to: (i) develop an assessment framework with which to design and test PPG beat detectors; (ii) assess the performance of several beat detectors in different use cases; and (iii) investigate how their performance is affected by patient demographics and physiology. Fifteen open-source beat detectors were assessed against reference beats from electrocardiogram (ECG) signals in eight freely available datasets. This study builds on previous work which assessed the performance of four beat detectors on a single dataset (Kotzen et al 2021), whereas this study assessed fifteen beat detectors across eight datasets.

Materials and methods
Ethical approval was not required for this study as it used pre-existing, anonymised data.

Datasets
The datasets used in this study are summarised in table 1, and are now described.
For each dataset, the table indicates the duration of recordings and the total number of beats used in the analysis (shown for the MPSTD beat detector).

Hospital monitoring
A total of six datasets were used to assess performance during hospital monitoring: the CapnoBase and BIDMC datasets (which contain high-quality data), and four novel datasets extracted from the MIMIC Database (which contain real-world data).
The CapnoBase and BIDMC datasets were originally designed for developing and assessing PPG signal processing algorithms. They contain high-quality ECG and PPG signals with little artifact. Therefore, the performance of beat detectors on these datasets represents the best possible performance that could be expected in hospital monitoring. CapnoBase (Karlen et al 2013) contains data from 42 paediatric and adult subjects undergoing elective surgery and anaesthesia. BIDMC (Pimentel et al 2017) contains data from 53 adults receiving critical care on a Medical Intensive Care Unit (46 subjects), Coronary Care Unit (6), or Surgical Intensive Care Unit (1). The BIDMC dataset was originally derived from the MIMIC-II Database (Goldberger et al 2000, Saeed et al 2011. In addition, four novel datasets were extracted from the MIMIC-III Database (Goldberger et al 2000, Johnson et al 2016 for this study. These are named the 'MIMIC PERform' Datasets, as they contain (P) PPG, (E) ECG and (R) Respiration signals. These datasets were designed to be representative of real-world critical care data: their signals contain motion artifact and some low-quality periods. The MIMIC PERform Training and Testing Datasets each contain data 10 minutes of data from 200 patients, consisting of 100 adults and 100 neonates. The MIMIC PERform Testing Dataset was used to compare performance between adults and neonates in this study. The MIMIC PERform AF Dataset contains 20 minutes of data from 19 patients in AF, and 16 patients in normal sinus rhythm (non-AF). It was used to compare performance between AF and normal sinus rhythm. Labels of AF were obtained from manual annotations by cardiologists (Bashar et al 2019, Bashar 2020. The MIMIC PERform Ethnicity Dataset contains 10 minutes of data from 100 Black and 100 White subjects. It was used to compare performance between Black and White subjects, in keeping with (Sjoding et al 2020). All MIMIC PERform Datasets were extracted from the MIMIC-III Waveform Database, except for the Ethnicity Dataset, which was extracted from the MIMIC-III Matched Waveform Database (Moody et al 2020). Data were extracted by searching for MIMIC records which met the following criteria: (i) contain the required signals (PPG, ECG, and for all except the AF Dataset, respiration); (ii) are of sufficient duration (10 minutes in the case of the Training, Testing and Ethnicity Datasets, and 20 minutes in the case of the AF Dataset); and (iii) contain minimal flat line segments (indicating sensor disconnection or saturation). The MIMIC Perform Datasets are available in (Charlton 2022b).

PPG beat detection
First, any PPG signals sampled at over 100 Hz were resampled at this frequency to reduce the time for computational analysis. For signals sampled at multiples of 100 Hz, this was performed using downsampling, and for other signals it was performed using resampling with an antialiasing lowpass filter. Second, signals were band-pass filtered between 0.67 and 8.0 Hz to eliminate non-cardiac frequencies. Third, beats were detected using fifteen open-source PPG beat detectors in turn, as demonstrated for two beat detectors in figure 1. The beat detectors are described in table 2. Beat detection was performed on 20 s windows of PPG signal, overlapping by 5 s. Repeated beat detections due to overlapping windows were eliminated. This approach ensured that beat detectors were not penalised for missing beats at the start or end of a window. Fourth, windows were excluded if they contained a flat line lasting more than 0.2 s (typically caused by sensor disconnection or signal 'clipping').
The beat detectors are available in (Charlton 2022a). For consistency, each beat detector's annotations were used to obtain the corresponding middle-amplitude point of the systolic upslope on each detected PPG pulse wave (Peralta et al 2019), which was used for analysis. This point has been found to provide more accurate timings than peaks or onsets (Peralta et al 2019).

Reference ECG beat detection
The CapnoBase and PPG-DaLiA datasets contain manual beat annotations which were used as reference beats. In the remaining datasets reference beats were obtained from simultaneous ECG signals by: (i) detecting beats using two separate ECG beat detectors; (ii) identifying 'correct' beats as those which both beat detectors detected within 150 ms of each other; and (iii) excluding from the analysis any 20 s windows in which the two beat detectors did not agree. The two beat detectors were: the 'jqrs' ECG beat detector, which is based on the Pan and Tompkins method (Behar et al 2014, Johnson et al 2014 and the 'rpeakdetect' ECG beat detector (Clifford).

Aligning PPG beats with reference ECG Beats
PPG and ECG signals were not necessarily precisely aligned, so the timings of PPG-derived beats and reference ECG-derived beats were aligned as follows. The time difference between each ECG-derived beat and its closest PPG-derived beat was calculated. Those ECG-derived beats for which the absolute time difference was <150 ms were determined to be correctly identified. This process was repeated when offsetting the beats by lags of −10 to 10 s, in increments of 20 ms. The lag which resulted in the highest proportion of beats being correctly identified was accepted as the true lag and used to synchronise the timings of beats. Figure 2(a) shows an example of this time-alignment.

Statistical analysis
The ability of beat detectors to detect beats was assessed by comparing PPG-derived beats with reference beats. Reference beats were determined to be correctly identified if the closest PPG-derived beat was within ±150 ms of a reference beat, as shown in figure 2(b). For each recording, the numbers of reference beats (n ref ), PPG-derived beats (n PPG ), and correctly identified beats (n correct ) were used to calculate the following: Beat detectors were ranked according to the F 1 score, which is the harmonic mean of sensitivity and PPV. The accuracy of PPG-derived heart rates (HRs) was assessed by comparing PPG-derived HRs to reference ECG-derived HRs. A HR (in beats per minute, bpm) was calculated at the time of each PPG-derived beat, from the number of PPG-derived beats in the preceding 8 s window (n beats ), as where t denotes the times of PPG-derived beats. Each HR signal was interpolated using sample-and-hold interpolation at 50 Hz. Performance was assessed as the mean absolute percentage error (MAPE) between time series. A median MAPE of <10% was deemed to be acceptable for HR monitoring. This was based on the acceptable limits of ±10% stated in the AAMI standard (ANSI/AAMI 2002) and implemented using the MAPE statistic in (Consumer Technology Association 2018), although we note that the true threshold of acceptability is likely to vary between applications (Mühlen et al 2021). Performance statistics are reported as median (25th-75th percentiles). The Wilcoxon rank sum test was used to compare performances between groups, at a significance level of α = 0.05. A Holm-Sidak correction was made to correct for multiple comparisons.

Results
The main results are summarised in table 3. This table reports the performance of beat detectors (F 1 score) and their performance for HR monitoring (HR MAPE). Results are provided for the best-performing beat detectors (found to be MSPTD and qppg, as detailed in section 3.2), and all beat detectors (reported as the range in performance metrics from the worst to the best performance).

Performance of beat detectors in different use cases
The performance of beat detectors is presented in figure 3 using the F 1 score, and in figure 4 using the HR MAPE. Additional results are provided in appendix A for sensitivity and PPV (figures A1 and A2 respectively). The key findings are as follows.
First, eight beat detectors performed very well across all datasets with low levels of movement: AMPD, MSPTD, qppg, PWD, ERMA, SPAR, ABD, and HeartPy. These had median F 1 scores of: 99% on the hospital monitoring datasets containing high-quality data (CapnoBase and BIDMC); 90% on the hospital monitoring datasets containing real-world data (MIMIC PERform Training and Testing Datasets); and 90% on the wearable datasets with low levels of movement (WESAD (meditation) and PPG-DaLiA (sitting)). The remainder of the Results will focus on these eight beat detectors. Figure 5(a) shows an example of (mostly) accurate beat detection during low levels of movement. Of note, the Pulses beat detector performed less well on the PPG-DaLiA (sitting) dataset because its assumed duration of the systolic upslope was no longer valid in these wrist signals acquired at rest.
Second, performance decreased during activities associated with more movement. The eight beat detectors which performed well on data with low levels of movement had median F 1 scores of 93%-96% on PPG-DaLiA (sitting). This performance decreased to 70%-91% on PPG-DaLiA (cycling), 60%-77% on PPG-DaLiA  The PPG is bandpass filtered between 0.5 and 20 Hz. Troughs are identified as local minima which are below an adaptive threshold. The adaptive threshold increases from the value of the previous trough, at a rate related to the PPG amplitude. Any troughs occuring within a period of 0.6 times the previous inter-beat-interval are excluded. The 'Vmin' implementation of this beat detector was used, as it performed slightly better than the 'Vmax' implementation in initial testing. Peaks are identified in the differentiated PPG using an adaptive filter set to the amplitude of the previous peak, and decreases for a period after that peak at a rate dependent on previous inter-beat intervals. Beats are identified as maxima in the PPG within 300ms of each peak in the differentiated PPG.

W. Zong
Systolic upslopes are detected from a signal generated with a slope sum function, which sums the magnitudes of the PPG upslopes in the previous 0.17 s. Adaptive thresholding is used to identify systolic upslopes in this signal. The 'qppgfast' implementation of this beat detector was used, after testing showed it performed similarly to the original 'qppg' implementation.

SPAR: Symmetric Projection Attractor
Reconstruction ( The PPG is decomposed using the Stationary Wavelet Transform. Multi-scale sum and products of selected detail subbands are calculated to emphasise systolic upslopes. An envelope is then extracted by: adaptive thresholding to reduce the influence of noise; calculating the Shannon entropy; and smoothing the result. Finally, beats are identified in the envelope using a Gaussian derivative filter. WFD: Wavelet Foot Delineation (Conn and Borkholder 2013) The PPG is bandpass filtered between 0.5 and 8 Hz, and interpolated to 250 Hz. It is decomposed using a wavelet transform, retaining the fifth wavelet scale for analysis. This signal is rectified and squared to eliminate values below zero. Regions containing beats are identified as those where the signal exceeds a low-pass filtered version of the signal. The timing of the beat within each region is identified as the first zero-crossing of the third derivative, or failing that, the maximum in the second derivative.
(walking), and 55%-72% on PPG-DaLiA (stair climbing). Performance was also poorer during stress, as shown by median F 1 scores of 59%-70% on WESAD (stress) compared to 71%-80% on WESAD (baseline). This was primarily due to beat detectors missing beats, rather than falsely detecting beats, as shown by the generally lower sensitivities than positive predictive values on PPG-DaLiA (walking) and WESAD (stress) datasets (see appendix A, figures A1 and A2). Third, the variability in performance between subjects was low during activities associated with low levels of movement, as shown by the relatively low inter-quartile ranges of F 1 scores (indicated by the heights of boxes) on WESAD (meditation) and PPG-DaLiA (sitting). However, performance varied much more between subjects in more challenging datasets, e.g., WESAD (stress) and PPG-DaLiA (walking).

Best-performing beat detectors
To identify the best-performing beat detectors, we focused on results from the MIMIC PERform Testing and PPG-DaLiA (working) datasets, since these are representative of real-world performance in critical care and daily life respectively. On the MIMIC PERform Testing Dataset, the top scoring beat detectors were MSPTD, AMPD, gppq, ABD, and Pulses (all with F 1 scores of 96.6%-97.5%, whereas the remainder scored 95.6%). On PPG-DaLiA (working), the top scorers were PWD, MPSTD, AMPD, ABD, gppq, and WFD (all with F 1 scores of 80.0%-81.4%, whereas the remainder scored <79.0%). In addition, MSPTD was the best performing beat detector on 5 out of the 12 WESAD and PPG-DaLiA datasets, and qppg was the best performing beat detector on 4 of these datasets. Therefore, we suggest that MSPTD and qppg performed best, although we note that this is subjective, and that some other beat detectors also performed well (notably ABD and AMPD).
The best-performing beat detectors have complementary performance characteristics: MSPTD tended to have a higher positive predictive value, whereas qppg tended to have higher sensitivity (see appendix A, figures A1 and A2). Figure 5 shows examples of this: qppg sometimes detected additional beats during noise (see figure 5(a) at 0.5 s), whereas MSPTD sometimes missed beats (see figure 5(h)).

Acceptability for heart rate monitoring
The performance of beat detectors was deemed to be acceptable for HR monioring in some use cases but not others (see figure 4). All eight beat detectors which had been found to perform well on data with low levels of Figure 2. Comparing PPG-derived beats with reference beats: (a) Time-alignment of electrocardiogram (ECG) and photoplethysmogram (PPG) signals. The time lag between ECG and PPG signals (0.68 s in this case) was automatically identified from ECG and PPG beat timings. (b) Assessing the ability of a beat detector to detect beats in the PPG. Those beats detected in the PPG (red circles) which occured within ± 150ms of time-aligned reference ECG beats were deemed to be correct. movement also had acceptable HR MAPEs of <10% on datasets associated with low and moderate levels of movement (the hospital monitoring datasets, and WESAD (meditation, amusement, baseline) and PPG-DaLiA (sitting, working)). At least some of these beat detectors did not perform acceptably on each of the remaining datasets. None of the eight beat detectors produced acceptable HR errors during stress (see WESAD (stress)). Five of the eight beat detectors (MSPTD, qppg, ABD, AMPD, and ERMA) also produced acceptable errors during less intensive activities (PPG-DaLiA (lunch break), and PPG-DaLiA (car driving)). Only qppg performed acceptably on PPG-DaLiA (cycling). None of the beat detectors performed acceptably during more intensive exercise (PPG-DaLiA (walking), PPG-DaLiA (stair climbing), and PPG-DaLiA (table soccer)).

Association between performance and patient physiology and demographics
The associations between beat detector performance and the assessed factors are shown in figure 6.
The performance of beat detectors was poorer in AF ( figure 6(a)). The eight beat detectors which performed well at rest achieved F 1 scores between 99.4%-99.7% in sinus rhythm (non-AF), compared to 91.8%-97.1% in AF. This was primarily because beat detectors missed beats during AF (see appendix B, figures A3(a) and A4(a)), similarly to performance in movement. Performance was worse in AF subjects than non-AF subjects for all eight beat detectors at the 5% significance level, and four of these differences remained significant after accounting for multiple comparisons (0.2% significance level).
All eight beat detectors performed worse on neonates than adults, as shown in ( figure 6(b)). Seven of these differences remained significant after accounting for multiple comparisons. The eight beat detectors achieved F 1 scores between 97.8%-98.5% in adults compared to 84.2%-95.9% in neonates. These beat detectors missed beats, as shown by their lower sensitivities (see appendix B, figure A3(b)). The lower performance in neonates may be because the neonatal PPG signals were of lower quality, as shown by them having lower SNRs (−10.9 (−12.2 to −8.8) dBc in neonates compared to −5.9 (−9.6 to −1.6) dBc in adults). In addition, some beat detectors may have been designed for use with adults, who typically have HRs between 60 and 100 bpm, whereas neonates typically have HRs between 110 and 160 bpm (Fleming et al 2011). Five of the eight beat detectors had lower F 1 scores on White subjects than Black subjects, as shown in (figure 6(c)), although none of these differences were significant after accounting for multiple comparisons. Table 4 presents the proposed assessment framework. The MIMIC PERform datasets are recommended for developing and testing algorithms, and for comparing performance between adults and neonates. Out of the wearable datasets, WESAD is recommended for training and PPG-DaLiA for testing, as the latter allows performance to be assessed during several activities of daily living. The MIMIC PERform AF Dataset is recommended for assessing performance in AF, although it would benefit from inclusion of additional subjects in the future. The CapnoBase and BIDMC datasets were designated as 'preliminary design' datasets as all beat detectors achieved F 1 scores of >93% on these datasets, so it is unlikely they could be used to substantially improve beat detector design.

Discussion
This study assessed the performance of several open-source PPG beat detectors across a range of datasets. Most beat detectors performed well on hospital data and at rest, but performed worse during movement, stress, AF, and in neonates. The study provides a standardised framework with which to develop and test beat detectors.
The findings could inform PPG-based monitoring strategies and directions for algorithm development. The poorer performance of beat detectors during movement is reflected in current monitoring strategies. For instance, smartwatches which use the PPG to check for an irregular pulse often only do so whilst the subject is stationary (Perez et al 2019) -a strategy which is supported by this study. Future work should investigate how best to use a simultaneous accelerometry signal to identify periods in which the subject is stationary and therefore beats can be accurately detected. The poorer performance in neonates and during AF indicates areas for development (Han et al 2022). Future work could also assess performance in other situations which impact the pulse wave, such as during ectopic beats, hypoperfusion, and vascular disease. This study also provides motivation for strategies to improve beat detection and exclude unreliable data from analyses, such as motion artifact cancellation and signal quality assessment.
The beat detectors used in this study are indicative of the range of approaches proposed in the literature to detect beats in the PPG. As detailed in table 2, approaches included: (i) identifying peaks in the original PPG signal (HeartPy and COppg); (ii) identifying systolic upslopes using the original signal (IMS) or first derivative (qppg, ABD, PWD and Pulses); (iii) using the local maxima scalogram to identify peaks across several scales (MSPTD and AMPD); and (iv) representing the PPG in phase space (SPAR). The MSPTD and qppg beat detectors performed best in this study. MSPTD searches for peaks without using any prior knowledge of the characteristics of PPG pulse waves. In contrast, qppg searches for systolic upslopes based on their expected characteristics. In the future, different approaches could be combined to improve performance.
The algorithms, datasets, and assessment framework used in this study are all freely available. This has several benefits. Firstly, it ensures that the study is reproducible. Secondly, it allows others to assess the performance of their own beat detection or quality assessment algorithms. Thirdly, the framework provides a basis with which to design (using the training datasets) and test such algorithms. Since the training datasets contain a variety of challenges, such as different use cases and populations, we expect that developers will benefit from using this framework for algorithm development. The framework cannot be considered to be exhaustive, and datasets recorded in additional settings and from further patient populations, could be added in the future. These resources and corresponding documentation are archived at Charlton (2022aCharlton ( , 2022b, whilst the most up to date version can be obtained at: https://github.com/peterhcharlton/ppg-beats. The key limitations are as follows. First, the study is limited to open-source beat detectors, rather than all those reported in the literature (see  for a description of additional beat detectors). Second, no attempt was made to improve the algorithms, but rather this study established the performance of existing algorithms. Third, some datasets were relatively small: WESAD and PPG-DaLiA contain data from 15 subjects, and the MIMIC PERform AF Dataset contains data from 35 patients. Fourth, the framework assumes that pulse arrival time (PAT) is constant within a subject's recording, which is reasonable for the short recordings in this study, but changes in PAT should be accounted for if using longer recordings (Kotzen et al 2021).

Conclusions
This study demonstrated the high performance of the MSPTD and qppg beat detectors across a range of use cases. Most beat detectors performed well in the absence of movement, whereas performance was poorer during stress, activities of daily living, in neonates, and during AF. The results inform key directions for future work: (i) improving performance in neonates and during AF; (ii) investigating whether motion artifact cancellation improves performance; and (iii) investigating whether algorithms to assess signal quality can distinguish between periods in which beats can or cannot be accurately detected. The algorithms, datasets, and assessment framework used in this study are all publicly available in Charlton (2022aCharlton ( , 2022b. Figure A1. Box plots showing the performance of beat detectors, expressed as the sensitivity. Each graph shows the results for each of the beat detectors on a particular dataset. Performance is shown as the median (circles), inter-quartile range (boxes), and 10th and 90th percentiles (whiskers) across subjects. See table 2 in the main text for definitions of beat detectors.

Appendix B. Association between PPG beat detector performance and patient demographics and physiology
Associations between PPG beat detector performance and patient demographics and physiology were presented in figure 5 in the main text, using the F 1 score to describe performance. Additional results are shown in: figure  A3, which shows the sensitivity of beat detectors; and figure A4, which shows their positive predictive value. Figure A2. Box plots showing the performance of beat detectors, expressed as the positive predictive value. Each graph shows the results for each of the beat detectors on a particular dataset. Performance is shown as the median (circles), inter-quartile range (boxes), and 10th and 90th percentiles (whiskers) across subjects. See table 2 in the main text for definitions of beat detectors. Figure A3. Box plots showing the associations between beat detector performance and patient physiology and demographics, expressed as the sensitivity. Each graph shows the results for each of the beat detectors on a particular dataset. Performance is shown as the median (circles), interquartile range (boxes), and 10th and 90th percentiles (whiskers) across subjects. See table 2 in the main text for definitions of beat detectors. Figure A4. Box plots showing the associations between beat detector performance and patient physiology and demographics, expressed as the positive predictive value. Each graph shows the results for each of the beat detectors on a particular dataset. Performance is shown as the median (circles), inter-quartile range (boxes), and 10th and 90th percentiles (whiskers) across subjects. See