Showing 34 of 34 papers
CSL 2025
Authors
Algherairy and Ahmed

Prompting large language models for user simulation in task-oriented dialogue systems

TL;DR

Investigates the abilities of LLM-based user simulators to train the dialogue policy of task-oriented dialogue systems (TOD) in a reinforcement learning context. The experiments consist of training a policy using a LLM-based user simulator and evaluating it with other LLM-based simulators to study generalisation abilities. The results show that the ChatGPT-based simulator tends to perform better than the Llama-based simulator, which exhibits limitations when dealing with a complex goal.

task-orientedtraining
WWW 2025
Authors
Chen et al.

RecUserSim: A Realistic and Diverse User Simulator for Evaluating Conversational Recommender Systems

TL;DR

Proposes RecUserSim, a LLM-based user simulator comprising profile, memory, action, and response refinement modules. The profile module builds fine-grained and diverse user personas, the memory is used to track history and preferences to ensure consistency, the action module is responsible for the decision making (inspired by the bounded rationality theory), and the refinement module is used to align generated responses to the user persona. On a food recommendation task, RecUserSim shows better performances than other LLM-based simulators, and aligns with human evaluators.

recommendationevaluation
EMNLP 2025
Authors
Gromada et al.

Evaluating Conversational Agents with Persona-driven User Simulations based on Large Language Models: A Sales Bot Case Study

TL;DR

Proposes an evaluation framework relying on a single LLM-based user simulator to evaluate conversational agents in sales scenarios. The simulator's prompt includes a LLM-generated persona (including demographics, technology profile and relationship with company), an objective, and scenario based on precision of needs and assertiveness. The generated personas are manually verified to ensure coherence, diversity, and realism for the use case. The authors evaluate the simulator based on persona, objective, and scenario compliance using both automatic (LLM-based) and human evaluation. The results shows that the simulator performs well across all dimensions (scores predominantly above 4.5/5), indicating its effectiveness at following assigned role and objectives.

task-orientedevaluation
EMNLP 2025
Authors
Kim et al.

Stop Playing the Guessing Game! Target-free User Simulation for Evaluating Conversational Recommender Systems

TL;DR

Proposes a new evaluation protocol for conversational recommender systems that uses a simulator relying on general preferences extracted from real user data (interactions with items and reviews) rather than a target item to find. This protocol also includes measures to assess the preference elicitation capabilities of the system. Experiments comparing target-biased and target-free user simulators show the effectiveness of the PEPPER and empirically highlights limitations of traditional evaluation protocols and metrics.

recommendationevaluation
TASPL 2025
Authors
Luo et al.

Utterance Alignment of Language Models for Effective User Simulation in Task-Oriented Dialogues

TL;DR

Introduces AlignUS, a simulator combining a small and a large language model for task-oriented dialogue systems. The small language model is trained to generate dialogue actions and corresponding natural language utterance, while the large language model is used as a validator of the dialogue actions generated and refining/enriching natural language. A post-training step using the refined and enriched natural language utterances is employed to align the small language model and reduce invocation of the LLM. The experiments show that AlignUS outperforms other LLM-based user simulators on the MultiWOZ benchmark, while producing diverse and rich natural language.

task-oriented
arXiv 2025
Authors
Mehri et al.

Goal Alignment in LLM-Based User Simulators for Conversational AI

TL;DR

Introduces the User Goal State Tracking (UGST) framework (inspired by dialogue state tracking) to support the development LLM-based user simulators capable of generating goal-aligned and personalized utterances during multi-turn task-oriented conversations. The user goal state comprises distinct, modular sub-components, where each represents an independent, self-contained aspect of the overall goal (e.g., requirements, preferences, and profile). During the conversation, these sub-components are assessed w.r.t. their completion using a LLM. The methodology to develop a LLM-based user simulator leveraging UGST comprises three steps: (1) generation of synthetic conversations with goal tracking information, (2) supervised fine-tuning of LLM-based simulator with synthetic data (learn to track goal implicitly), and (3) RL-based optimization of the simulator using reward signals from UGST. Experiments show improvements of goal alignment over traditional LLM-based user simulators.

task-oriented
SIGIR 2025
Authors
Wang et al.

Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User

TL;DR

Proposes a user simulator based on a generative reward model for conversational recommendation. The simulator provides two types of feedback: (1) items critique based on their attributes and (2) item scoring using the probability of the yes or no token. This feedback aims to assist the conversational recommender system to refine their recommendations. The simulator training consists of fine-tuning a large language model using instruction data built based on the feedback actions description. At inference, the recommender component of the CRS can leverage the simulated user interactions to improve its recommendations.

recommendation
arXiv 2025
Authors
Zhao et al.

Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

TL;DR

Introduces a LLM-based personality-aware user simulation framework to analyze to influence of personality traits on conversational recommendation outcomes. It uses in-context learning to configure the CRS and the personality traits of the simulated user, based on the Big Five Personality Traits. Experiments show that simulation with current state-of-the-art LLMs lead to simulated utterances aligned with given personality traits, and exhibit different user behavior based these traits, hence, impacting the conversation outcomes.

recommendationevaluation
WSDM 2024
Authors
Abbasiantaeb et al.

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

TL;DR

Proposes a simulation framework involving two large language models (LLMs) acting as a questioner and an answerer engaged in a conversation. The main objectives are to investigate the effectiveness of LLMs in simulating question-answering conversations and to compare the generated conversations against human-human conversations with regards to various characteristics. The analysis shows that the LLMs tend to generate longer questions and answers than humans, and these provide better coverage of the topic in focus.

qageneration
CUI 2024
Authors
Bernard and Balog

Identifying Breakdowns in Conversational Recommender Systems using User Simulation

TL;DR

Proposes a methodology to systematically test conversational recommender systems with regards to conversational breakdowns. It consists of analyzing conversations generated between the system and a user simulator to identify pre-defined types of breakdowns. A case study demonstrates that the methodology can be applied to make an existing conversational recommender system more robust to conversation breakdowns.

recommendationevaluation
ICTIR 2024
Authors
Bernard and Balog

Towards a Formal Characterization of User Simulation Objectives in Conversational Information Access

TL;DR

Formally characterizes the distinct objectives of user simulators: (1) training aims to maximize behavioral similarity to real users and (2) evaluation focuses on the accurate prediction of real-world conversational agent performance. An empirical study shows that optimizing for one objective does not necessarily lead to improved performance on the other. This finding highlights the need for distinct design considerations during the development of user simulators.

ciatrainingevaluation
EMNLP 2024
Authors
Ferreira et al.

Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants

TL;DR

Proposes new method to create adaptable multi-trait simulators. Multi-trait adaptive decoding allows to combine multiple user traits to generate utterance reflective these traits, e.g., low fluency and high cooperativeness. The experiments, performed on conversations synthetized from the Alexa Prize Taskbot Challenge 1, show that training a trait-specific language model is better than jointly learning all the traits in one language model. It also facilitates the addition of new traits to the simulator.

task-oriented
UM-CIR 2024
Authors
Fu et al.

An Evaluation Framework for Conversational Information Retrieval Using User Simulation

TL;DR

Proposes a new user simulator prototype to perform simulation-based evaluation of CIR systems. The user simulator comprises two modules: (1) action predictor and (2) response generator. The action predictor determines the next action based on the context, the actions available depends on the dataset used. The response generator uses the conversational context, user profile, and previously predicted action to output a realistic and personalised response. The assessment of success is based on the study of stopping strategies.

searchevaluation
arXiv 2024
Authors
Huang et al.

Concept–An Evaluation Protocol on Conversation Recommender Systems with System-and User-centric Factors

TL;DR

Proposes a new evaluation protocol for conversational recommendation systems that considers both system- and user-centric factors that influence the user experience and engagement. The protocol identifies and defines six abilities that relate to three factors, in addition to corresponding metrics. Some metrics are computed based on scores given by a large language model. The authors apply the protocol to evaluate off-the-self conversational recommender systems and demonstrate its comprehensiveness.

recommendationevaluation
ECIR 2024
Authors
Kiesel et al.

Simulating Follow-Up Questions in Conversational Search

TL;DR

Investigates the abilities of large language models to simulate follow-up questions in a conversational search context. Experiments with GPT4, Llama2, and Alpaca models (fine-tuned or not) show that the different models can simulate follow-up questions. Specifically, automatic evaluation shows that simulated questions are semantically similar to human questions, while human evaluation finds that the synthetic questions are valid. Moreover, results on using simple prompt modifications to simulation persona-based questions are inconclusive, indicating that more advanced approaches might be better suited.

searchgeneration
LREC-COLING 2024
Authors
Luo et al.

DuetSim: Building User Simulator with Dual Large Language Models for Task-Oriented Dialogues

TL;DR

Introduces DuetSim, a simulator that employs two large language models (LLMs) in tandem to interact with task-oriented dialogue systems. The first LLM is used for the generation of utterances following a chain-of-thought approach, while the second one is used to verify them. The generation of utterances is an iterative process where the generator (first LLM) generates dialogue acts which are used by the verifier (second LLM) to check for potential error and misalignments with the requirements, upon approval of the dialogue acts the natural language utterance is generated. Experiments on the MultiWoZ benchmark show that DuetSim enhances the quality and correctness of the utterances using both automatic and human evaluation.

task-oriented
SIGIR-AP 2024
Authors
Sekulić et al.

Simulating Conversational Search Users with Parameterized Behavior

TL;DR

Proposes a parametrised LLM-based user simulator that incorporates behavioural traits (i.e., patience, cooperativeness, and assertiveness) to evaluate conversational search systems. At each turn, a Bernoulli variable is sampled to determine what specific trait to exhibit for the next utterance, the description of that trait is then used to build the next utterance generation prompt. The experiments show variation in performance of a system when interacting with simulated users with different behavioral traits.

searchevaluation
TIST 2024
Authors
Sekulić et al.

Analysing Uterances in LLM-based User Simulation for Conversational Search

TL;DR

Provides an analysis of LLM-based utterance generation in the context of conversational search, specifically for clarification questions answering. The authors compare two LLM-based (GTP-2 and GPT-3) utterance generation approaches with regard to automatic NLG metrics, a crowdsourcing study, and the outcomes of a qualitative analysis. The results of the experiments show that both LLM-based approaches generate natural and coherent answers with an advantage for GPT-3. The analysis also highlights differences in the types of utterance reformulation used by humans and the LLM-based user simulators, motivating further research on parametrized user simulators.

search
SCI-CHAT 2024
Authors
Sekulić et al.

Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems

TL;DR

Introduces a Domain-Aware user simulator for the evaluation of task-oriented dialogue systems that uses a fine-tuned large language model. The experiments, on two domains, show that fine-tuning a large language model (Llama-2) allows to reduce hallucinations in the simulator's responses, while maintaining a high lexical diversity. Two main limitations of this approach are: (1) user goal annotation is required in the dialogue data and (2) the user simulator does not generalize well to unseen tasks.

task-orientedevaluation
TORS 2024
Authors
Shen et al.

Multi-Interest Multi-Round Conversational Recommendation System with Fuzzy Feedback based User Simulator

TL;DR

Introduces a new policy learning framework for conversation recommender systems which can leverage the newly proposed user simulator UUSFF. This simulator answers attribute questions with different types of feedback, including yes, no, and fuzzy (e.g., “alright”, “I don't know”). Experiments shows that this user simulator provides a more comprehensive and realistic feedback than traditional yes/no responses, which can expose weakness of CRSs not able to handle fuzzy feedback.

recommendationtraining
arXiv 2024
Authors
Vlachou and Macdonald

What Else Would I Like? A User Simulator using Alternatives for Improved Evaluation of Fashion Conversational Recommendation Systems

TL;DR

Proposes a meta user simulator that can provide knowledge on alternative targets in the context of conversational recommendations in fashion. Based on a patience parameter, the target item is replaced by the closest alternative (i.e., the one with the highest visual similarity). The experiments show that it leads to shorter conversations as users are inclined to change their minds and accept an alternative target; another positive consequence is an improved success rate of the conversational recommender system.

recommendationevaluation
WWW 2024
Authors
Wang et al.

An In-depth Investigation of User Response Simulation for Conversational Search

TL;DR

Provides an analysis of simulated answers to clarification questions in a conversational search scenario. The authors find that small fine-tuned language model can compete with, even outperform, LLMs on the simulation of answers to clarifications questions. The experiments and analysis highlight points that should be considered to have a better simulator and limitations of existing evaluation metrics.

search
NAACL 2024
Authors
Yoon et al.

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

TL;DR

Proposes a new protocol to evaluate LLM-based user simulators for conversational recommendation scenario. The protocol comprises five evaluation tasks: choosing items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendation, and giving feedback. The objective of these tasks is to discover distorsion between simulators and human behaviors. The experiments show that LLM-based simulators exhibit differences with humans such as low diversity in items discussed, low correlation with the representation/expression of preferences, a lack of personalization, and incoherent feedback.

recommendationevaluation
arXiv 2024
Authors
Zhu et al.

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

TL;DR

Performs an analysis of the limitations of a LLM-based user simulator, iEvaLM, and proposes a new user simulator to mitigate the identified limitations. The analysis of iEvaLM with two datasets reveals that (1) it leads to data leakage which inflate the performances, (2) the simulated responses are not the main factor in getting successful recommendations, and (3) controlling the simulated responses via a single prompt is complex. To address these limitations, the authors propose SimpleUserSim which does not know the name of the target item during the conversation and uses a different prompt for each possible user action. Using the same experimental setting, they show that SimpleUserSim is less prone to data leakage and produces more impactful responses.

recommendationevaluation
arXiv 2024
Authors
Zhu et al.

A LLM-based Controllable, Scalable, Human-Involved User Simulator Framework for Conversational Recommender Systems

TL;DR

Introduces a LLM-based user simulator that is controllable, scalable, and allows human intervention during user profile creation. The dialogue generation process is divided into stages: profile initialization, preference initialization, and message handling. The user behavior for each stage is controlled with specific plugins such as user preferences summary and intent understanding. These plugins can easily be modified, extended, or replaced to change the behavior of the user simulator. The authors perform experiments in two scenarios, with/out human annotation in the dataset, to showcase the adaptability of the user simulator and its ability to effectively simulate user preferences.

recommendationgeneration
arXiv 2023
Authors
Davidson et al.

User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue

TL;DR

Proposes a user simulator built with an LLM using in-context learning instead of fine-tuning. The main objective is to generate linguistically diverse and human-like utterances. The use of goal success rate as a metric to evaluate user simulator is criticized as humans tend to have non-optimal behavior.

task-orientedevaluation
CIKM 2023
Authors
Hu et al.

Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulators to Enhance Dialogue System

TL;DR

Proposes a new optimization approach that leverages simulated user satisfaction from a large language model to enhance task-oriented dialogue systems. It integrated simulated user satisfaction into the reward function of proximal policy optimization used to optimize a fine-tuned task-oriented dialogue system. Empirical experiments with fine-tuned Flan-T5 (dialogue system) and ChatGPT (user simulator) on two benchmark datasets show the potential of the proposed approach in a case where user satisfaction annotations are not available.

task-orientedtraining
ACL 2023
Authors
Liu et al.

One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems

TL;DR

Proposes to train task-oriented dialogue systems using multiple user simulators. They define the problem as a multi-armed bandit where each arm corresponds to one user simulator. It allows to balance how much each simulator is used during optimisation steps and tackle catastrophic forgetting. The experimental results show that the performances are improved compared to baseline agents trained with a single user simulator in a single domain scenario; the agents trained with their framework are more robust to unseen domains. While the results are promising, they admit that experiments in a multi-domain scenario are needed.

task-orientedtraining
SIGIR 2023
Authors
Owoicho et al.

Exploiting Simulated User Feedback for Conversational Search: Ranking, Rewriting, and Beyond

TL;DR

Proposes a user simulator able to answer clarifying questions and give direct feedback to a conversational search system. This simulator is initialized with a given information need and can interact with conversational search systems over multiple turns while using natural language and staying coherent. It also integrates the notion of patience, and will stop a conversation when patience runs out. Crowd workers assess the quality of the generated answers by the simulator w.r.t. naturalness and usefulness when used to perform experiments with the TREC CAsT dataset. The experiments show the benefits of using simulated user feedback to improve conversational search systems.

searchevaluation
arXiv 2023
Authors
Sun et al.

Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems

TL;DR

Presents a metaphorical user simulator, MetaSim, that uses historical conversation strategies as metaphors for a current conversation. It improves the simulator's abilities at dialogue reasoning and generalizing to new domains, and its realism. The authors also propose a tester-based evaluation framework to evaluate user simulators and task-oriented dialogue systems; a manual evaluation shows that it is a promising solution for automatic evaluation.

task-orientedevaluation
arXiv 2023
Authors
Terragni et al.

In-Context Learning User Simulators for Task-Oriented Dialog Systems

TL;DR

Proposes an approach to build an in-context learning user simulator using a LLM. The user simulator is given a prompt comprising the task description, example dialogues, user goal, and dialogue history to generate responses. It comprises an evaluation component that tracks the goal completion and assesses the system's actions. The experiments show that the in-context learning abilities of LLMs are valuable to generate diverse dialogues (exploration of many dialogue paths) but suffer from limitations like unpredictability and hallucinations.

task-orientedgeneration
arXiv 2023
Authors
Wang et al.

User Behavior Simulation with Large Language Model based Agents

TL;DR

Introduces a simulation environment where agents can interact with a recommender system, other agents, and "social media". An agent is based on a LLM (ChatGPT in particular) and comprises three modules: profile, memory (inspired from cognitive neuroscience), and action. Two main questions need to be considered when leveraging LLMs: (1) what behavior to simulate and (2) how to design prompts.

recommendationtrainingevaluation
EMNLP 2023
Authors
Wang et al.

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

TL;DR

Proposes an interactive evaluation approach, iEvaLM, using LLM-based user simulators. The approach is validated through experimentation on two public datasets: ReDial and OpenDialKG. The simulated user is given a persona based on preferences established from ground truth items, and the allowed behaviors, i.e., taking about a preference, providing feedback, and completing the conversation, are defined in the prompt. The evaluation considers two types of metric that are objective with recall and subjective with persuasiveness (which is scored using an LLM). The authors also mention some of the limitations of this approach, mostly related to the LLM.

recommendationevaluation
SIGIR 2022
Authors
Kim and Lipani

A Multi-Task Based Neural Model to Simulate Users in Goal Oriented Dialogue Systems

TL;DR

Proposes a user simulator, based on a generative model, that predicts users' satisfaction scores, actions, and utterances in a multi-task learning setting. The authors perform an ablation study to show that the three tasks help each other to better simulate users. Note that the proposed user simulator does not represent users' knowledge and mental status.

task-oriented

This page is maintained by Nolwenn Bernard and Krisztian Balog.

We welcome suggestions via email.