We address the problem of estimating the visual focus of attention (VFOA), e.g. who is looking at whom? This is of particular interest in human-robot interactive scenarios, e.g. when the task requires to identify targets of interest over time. The paper makes the following contributions. We propose a Bayesian temporal model that connects VFOA to gaze direction and to head pose. Model inference is then cast into a switching Kalman filter formulation, which makes it tractable. The model parameters are estimated via training based on manual annotations. The method is tested and benchmarked using a publicly available dataset. We show that both the gaze and the VFOA of several persons can be reliably and simultaneously estimated over time from observed head poses as well as from people and object locations. On average, our method compares favorably with two other methods.