Neural Representations of Dialogical History for Improving Upcoming Turn Acoustic Parameters Prediction
- Acoustic features
Predicting the acoustic and linguistic parameters of an upcoming conversational turn is important for dialogue systems aiming to include low-level adaptation with the user. It is known that during an interaction speakers could influence each other speech production. However, the precise dynamics of the phenomena is not well-established, especially in the context of natural conversations. We developed a model based on an RNN architecture that predicts speech variables (Energy, F0 range and Speech Rate) of the upcoming turn using a representation vector describing speech information of previous turns. We compare the prediction performances when using a dialogical history (from both participants) vs. monological history (from only upcoming turn's speaker). We found that the information contained in previous turns produced by both the speaker and his interlocutor reduce the error in predicting current acoustic target variable. In addition the error in prediction decreases as increases the number of previous turns taken into account.