Scene representation networks, or implicit neural representations (INR) have seen a range of success in numerous image and video applications. However, being universal function fitters, they fit all variations in videos without any selectivity. This is particularly a problem for tasks such as remote plethysmography, the extraction of heart rate information from face videos. As a result of low native signal to noise ratio, previous signal processing techniques suffer from poor performance, while previous learning-based methods have improved performance but suffer from hallucinations that mitigate generalizability. Directly applying prior INRs cannot remedy this signal strength deficit, since they fit to both the signal as well as interfering factors. In this work, we introduce an INR framework that increases this plethysmograph signal strength. Specifically, we leverage architectures to have selective representation capabilities. We are able to decompose the face video into a blood plethysmograph component, and a face appearance component. By inferring the plethysmograph signal from this blood component, we show state-of-the-art performance on out-of-distribution samples without sacrificing performance for in-distribution samples. We implement our framework on a custom-built multiresolution hash encoding backbone to enable practical dataset-scale representations through a 50x speed-up over traditional INRs. We also present a dataset of optically challenging out-of-distribution scenes to test generalization to real-world scenarios.
Sinusoidal Representation Networks (SRNs) are able to represent both $\mathcal{A}$ and
$\mathcal{B}$-functions, while phase-based methods can only represent the $\mathcal{A}$-function. This
provides hints for $\mathcal{A}$-$\mathcal{B}$ decomposition. SRNs are able to capture the
plethysmograph color variations almost perfectly, while phase-based motion representations are unable to
capture it.
To enable fast $\mathcal{A}$-$\mathcal{B}$ decomposition, we use implicit neural representations as
decomposing function fitters. Training is done sequentially: first, the cascaded appearance model learns
the $\mathcal{A}$-function. Then, the appearance model is frozen and the residual model learns the
$\mathcal{B}$-function, thereby completing the decomposition. The use of multiresolution hash encodings
makes dataset-scale decomposition viable.
Using the implicitly decomposed $\mathcal{B}$-function along with the original video, we learn
high-fidelity neural signal strength masks. The network takes the original RGB frames and the
$\mathcal{B}$-function estimate as inputs and returns a spatial strength mask. Training is supervised
through an auxiliary 1-D CNN whose training target is the prediction of an accurate plethysmograph. The
1-D CNN is discarded post-training and the learned mask model is used at inference time to generate
weight masks to estimate a plethysmograph signal from the $\mathcal{B}$-function.
Across challenging OOD optical settings, the proposed method is able to capture details of the
plethysmograph waveform when compared to prior methods.
@inproceedings{chari2024implicit,
title={Implicit Neural Models to Extract Heart Rate from Video},
author={Chari, Pradyumna and Harish, Anirudh Bindiganavale and Armouti, Adnan and Vilesov, Alexander and Sarda, Sanjit and Jalilian, Laleh and Kadambi, Achuta},
booktitle={European conference on computer vision},
year={2024},
organization={Springer}
}