This paper discusses the use of a mobile robot following a person. It focuses on the less researched interaction with the human attitude through robot movements. The reward, which indicates the attitude of the human, is used to train the network so that the robot learns an appropriate position relative to the person. The algorithm presented in this study overcomes the difficulty that the feedback reward score given by the human has no gradient throughout large parts of the input space. This network works online and has the ability to adapt to unpredictable changes in the person's preference