User Simulation for Conversational Information Access

An annotated bibliography of papers (2020-)

This page is maintained by Nolwenn Bernard and Krisztian Balog. We welcome suggestions via email.

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions, Abbasiantaeb et al., WSDM 2024
- TL;DR: Proposes a simulation framework involving two large language models (LLMs) acting as a questioner and an answerer engaged in a conversation. The main objectives are to investigate the effectiveness of LLMs in simulating question-answering conversations and to compare the generated conversations against human-human conversations with regards to various characteristics. The analysis shows that the LLMs tend to generate longer questions and answers than humans, and these provide better coverage of the topic in focus.
Identifying Breakdowns in Conversational Recommender Systems using User Simulation, Bernard and Balog, CUI 2024
- TL;DR: Proposes a methodology to systematically test conversational recommender systems with regards to conversational breakdowns. It consists of analyzing conversations generated between the system and a user simulator to identify pre-defined types of breakdowns. A case study demonstrates that the methodology can be applied to make an existing conversational recommender system more robust to conversation breakdowns.
Towards a Formal Characterization of User Simulation Objectives in Conversational Information Access, Bernard and Balog, ICTIR 2024
- TL;DR: Formally characterizes the distinct objectives of user simulators: (1) training aims to maximize behavioral similarity to real users and (2) evaluation focuses on the accurate prediction of real-world conversational agent performance. An empirical study shows that optimizing for one objective does not necessarily lead to improved performance on the other. This finding highlights the need for distinct design considerations during the development of user simulators.
An Evaluation Framework for Conversational Information Retrieval Using User Simulation, Fu et al., UM-CIR 2024
- TL;DR: Proposes a new user simulator prototype to perform simulation-based evaluation of CIR systems. The user simulator comprises two modules: (1) action predictor and (2) response generator. The action predictor determines the next action based on the context, the actions available depends on the dataset used. The response generator uses the conversational context, user profile, and previously predicted action to output a realistic and personalised response. The assessment of success is based on the study of stopping strategies.
Concept–An Evaluation Protocol on Conversation Recommender Systems with System-and User-centric Factors, Huang et al., arXiv 2024
- TL;DR: Proposes a new evaluation protocol for conversational recommendation systems that considers both system- and user-centric factors that influence the user experience and engagement. The protocol identifies and defines six abilities that relate to three factors, in addition to corresponding metrics. Some metrics are computed based on scores given by a large language model. The authors apply the protocol to evaluate off-the-self conversational recommender systems and demonstrate its comprehensiveness.
What Else Would I Like? A User Simulator using Alternatives for Improved Evaluation of Fashion Conversational Recommendation Systems, Vlachou and Macdonald, arXiv 2024
- TL;DR: Proposes a meta user simulator that can provide knowledge on alternative targets in the context of conversational recommendations in fashion. Based on a patience parameter, the target item is replaced by the closest alternative (i.e., the one with the highest visual similarity). The experiments show that it leads to shorter conversations as users are inclined to change their minds and accept an alternative target; another positive consequence is an improved success rate of the conversational recommender system.
Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation, Yoon et al., NAACL 2024
- TL;DR: Proposes a new protocol to evaluate LLM-based user simulators for conversational recommendation scenario. The protocol comprises five evaluation tasks: choosing items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendation, and giving feedback. The objective of these tasks is to discover distorsion between simulators and human behaviors. The experiments show that LLM-based simulators exhibit differences with humans such as low diversity in items discussed, low correlation with the representation/expression of preferences, a lack of personalization, and incoherent feedback.
How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation, Zhu et al., arXiv 2024
- TL;DR: Performs an analysis of the limitations of a LLM-based user simulator, iEvaLM, and proposes a new user simulator to mitigate the identified limitations. The analysis of iEvaLM with two datasets reveals that (1) it leads to data leakage which inflate the performances, (2) the simulated responses are not the main factor in getting successful recommendations, and (3) controlling the simulated responses via a single prompt is complex. To address these limitations, the authors propose SimpleUserSim which does not know the name of the target item during the conversation and uses a different prompt for each possible user action. Using the same experimental setting, they show that SimpleUserSim is less prone to data leakage and produces more impactful responses.
A LLM-based Controllable, Scalable, Human-Involved User Simulator Framework for Conversational Recommender Systems, Zhu et al., arXiv 2024
- TL;DR: Introduces a LLM-based user simulator that is controllable, scalable, and allows human intervention during user profile creation. The dialogue generation process is divided into stages: profile initialization, preference initialization, and message handling. The user behavior for each stage is controlled with specific plugins such as user preferences summary and intent understanding. These plugins can easily be modified, extended, or replaced to change the behavior of the user simulator. The authors perform experiments in two scenarios, with/out human annotation in the dataset, to showcase the adaptability of the user simulator and its ability to effectively simulate user preferences.
User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue, Davidson et al., arXiv 2023
- TL;DR: Proposes a user simulator built with an LLM using in-context learning instead of fine-tuning. The main objective is to generate linguistically diverse and human-like utterances. The use of goal success rate as a metric to evaluate user simulator is criticized as humans tend to have non-optimal behavior.
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulators to Enhance Dialogue System, Hu et al., CIKM 2023
- TL;DR: Proposes a new optimization approach that leverages simulated user satisfaction from a large language model to enhance task-oriented dialogue systems. It integrated simulated user satisfaction into the reward function of proximal policy optimization used to optimize a fine-tuned task-oriented dialogue system. Empirical experiments with fine-tuned Flan-T5 (dialogue system) and ChatGPT (user simulator) on two benchmark datasets show the potential of the proposed approach in a case where user satisfaction annotations are not available.
One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems, Liu et al., ACL 2023
- TL;DR: Proposes to train task-oriented dialogue systems using multiple user simulators. They define the problem as a multi-armed bandit where each arm corresponds to one user simulator. It allows to balance how much each simulator is used during optimisation steps and tackle catastrophic forgetting. The experimental results show that the performances are improved compared to baseline agents trained with a single user simulator in a single domain scenario; the agents trained with their framework are more robust to unseen domains. While the results are promising, they admit that experiments in a multi-domain scenario are needed.
Exploiting Simulated User Feedback for Conversational Search: Ranking, Rewriting, and Beyond, Owoicho et al., SIGIR 2023
- TL;DR: Proposes a user simulator able to answer clarifying questions and give direct feedback to a conversational search system. This simulator is initialized with a given information need and can interact with conversational search systems over multiple turns while using natural language and staying coherent. It also integrates the notion of patience, and will stop a conversation when patience runs out. Crowd workers assess the quality of the generated answers by the simulator w.r.t. naturalness and usefulness when used to perform experiments with the TREC CAsT dataset. The experiments show the benefits of using simulated user feedback to improve conversational search systems.
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems, Sun et al., arXiv 2023
- TL;DR: Presents a metaphorical user simulator, MetaSim, that uses historical conversation strategies as metaphors for a current conversation. It improves the simulator’s abilities at dialogue reasoning and generalizing to new domains, and its realism. The authors also propose a tester-based evaluation framework to evaluate user simulators and task-oriented dialogue systems; a manual evaluation shows that it is a promising solution for automatic evaluation.
In-Context Learning User Simulators for Task-Oriented Dialog Systems, Terragni et al., arXiv 2023
- TL;DR: Proposes an approach to build an in-context learning user simulator using a LLM. The user simulator is given a prompt comprising the task description, example dialogues, user goal, and dialogue history to generate responses. It comprises an evaluation component that tracks the goal completion and assesses the system’s actions. The experiments show that the in-context learning abilities of LLMs are valuable to generate diverse dialogues (exploration of many dialogue paths) but suffer from limitations like unpredictability and hallucinations.
User Behavior Simulation with Large Language Model based Agents, Wang et al., arXiv 2023
- TL;DR: Introduces a simulation environment where agents can interact with a recommender system, other agents, and “social media”. An agent is based on a LLM (ChatGPT in particular) and comprises three modules: profile, memory (inspired from cognitive neuroscience), and action. Two main questions need to be considered when leveraging LLMs: (1) what behavior to simulate and (2) how to design prompts.
Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models, Wang et al., EMNLP 2023
- TL;DR: Proposes an interactive evaluation approach, iEvaLM, using LLM-based user simulators. The approach is validated through experimentation on two public datasets (ReDial and OpenDialKG). The simulated user is given a persona based on preferences established from ground truth items, and the allowed behaviors, i.e., taking about a preference, providing feedback, and completing the conversation, are defined in the prompt. The evaluation considers two types of metric that are objective with recall and subjective with persuasiveness (which is scored using an LLM). The authors also mention some of the limitations of this approach, mostly related to the LLM.
A Multi-Task Based Neural Model to Simulate Users in Goal Oriented Dialogue Systems, Kim and Lipani, SIGIR 2022
- TL;DR: Proposes a user simulator, based on a generative model, that predicts users’ satisfaction scores, actions, and utterances in a multi-task learning setting. The authors perform an ablation study to show that the three tasks help each other to better simulate users. Note that the proposed user simulator does not represent users’ knowledge and mental status.