Task manipulation is direct evidence of understanding, and speakers adjust their utterances that are in progress by monitoring listener’s task manipulation. Aiming at developing animated agents that control multimodal instruction dialogues by monitoring users’ task manipulation, this paper presents a probabilistic model of fine-grained timing dependencies among multimodal communication behaviors. Our preliminary evaluation demonstrated that our model quite accurately judges whether the user understand the agent’s utterances and predicts user’s successful mouse manipulation, suggesting that the model is useful in estimating user’s understanding and can be applied to determining the next action of an agent.