Tutorial on User Simulation for Evaluating Information Access Systems on the Web @TheWebConf’24

Half-day tutorial at the 2024 ACM Web Conference, Singapore, May 2024.

Slides

Although Web information access systems, such as search engines, recommender systems, and conversational assistants, are used by millions on a daily basis, how to appropriately evaluate those systems remains an open scientific challenge. For example, the weak correlation of online and offline evaluation results makes it hard to choose the best algorithm to deploy in a production environment, while inaccurate evaluation of algorithms would result in misleading conclusions and thus hinder progress in research.

The emergence of large language models (LLMs) such as ChatGPT makes information access on the Web increasingly more interactive and conversational. It is especially challenging to evaluate an interactive system’s overall effectiveness in helping a user finish a task via interactive support, because the utility of such a system can only be assessed by a user interacting with the system. Moreover, the fact that users vary significantly in terms of their behaviour and preferences makes it very difficult to perform system evaluation with reproducible experiments.

User simulation has the potential to enable repeatable and reproducible evaluations at low cost, without using invaluable user time (human assessor time or online experimentation bandwidth). Further, simulation can augment traditional evaluation methodologies by providing insights into how system performance changes under different conditions and user behaviour. Relevant research work, however, is scattered across multiple research communities, including information retrieval, recommender systems, dialogue systems, and user modeling. This tutorial aims to synthesize this extensive body of research into a coherent framework with a focus on applications of user simulation to evaluate Web information access systems.

Target Audience and Prerequisites

Since the question of how to accurately evaluate a search engine, a recommender system, or a conversational assistant is important to both practitioners who would like to assess the utility of their product systems and researchers who would like to know whether their new algorithms are truly more effective than the existing ones, we expect our tutorial to be broadly appealing to many participants of the Web Conference, including undergraduate and graduate students, academic and industry researchers, practitioners from the industry, and government policy/decision makers. As the tutorial is mostly self-contained with only minimum pre-required background knowledge, it is expected to be accessible to most attendants of the Web Conference.

Participants of the tutorial can expect to learn what user simulation is, why it is important to use it for evaluation, how existing user simulation techniques can already be useful for evaluating interactive Web information access systems, how to develop new user simulators, and how to use user simulation broadly to evaluate assistive AI systems. They can also expect to learn about associated challenges and where additional research is still needed.

Scope and Outline

Background: Evaluation of Web Information Access Systems [20 min]

We first describe the spectrum of Web information access tasks. Next, we briefly discuss the goals of evaluation and general methodologies of evaluation (reusable test collections, user studies, and online evaluation). We then highlight the challenges involved in evaluating Web information access systems and how user simulation can help address those challenges.
Overview of User Simulation [15 min]

This part provides a brief historical account on the use of simulation techniques, and highlight how various research communities focused on different but complementary areas of evaluation and user simulation. This includes early work on simulation in information retrieval and studies in interactive information retrieval pointing out discrepancies between interactive and non-interactive evaluation results. In dialogue systems research, simulation-based techniques have been used for dialogue policy learning, and to a limited extent also for evaluation. User simulation can be regarded as developing a complete and operational user model, which makes work on search tasks and intent, information seeking models, cognitive models of users, and economic IR models highly relevant to us.
Simulation-based Evaluation Frameworks [25 min]

We make the key observation that traditional evaluation measures used in IR may be viewed as naive user simulators, and discuss how to interpret Precision, Recall, and NDCG@k from an user simulation perspective. Next, we discuss metrics based on explicit models of user behavior, based on (1) the assumed user task, (2) the assumed user behavior when interacting with results, (3) the measurement of the reward a user would receive from examining a result, and (4) the measurement of the effort a user would need to make in order to receive the reward. Specifically, we cover the RBP, ERR, EBU, and the time-biased gain measures, as well as the more general frameworks of C/W/L, C/W/L/A, and the model-based framework by Carterette. Finally, we present a general simulation-based evaluation framework and the Interface Card Model, which can be used to evaluate an interactive information access system with a computationally generated dynamic browsing interface using user simulation.
User Simulation and Human Decision-making [15 min]

In this part, we provide a high-level overview of research on conceptual models that can provide theoretical guidance for modeling processes and decisions from an individual’s perspective. We cover models of search behavior within three main categories: (1) cognitive models, focusing on the cognitive processes underlying the information-seeking activity, (2) process models, representing the different stages and activities during the search process, and (3) strategic models, describing tactics that users employ when searching for information. Then, we discuss how to model decision-making processes mathematically using Markov decision processes (MDP). The MDP framework provides a general formal framework for constructing user simulators, which we will use to discuss specific user simulation techniques in the next two sections.
Simulating Interactions with Search and Recommender Systems [45 min]

We start by presenting models that describe interaction workflows, that is, specify the space of user actions and system responses, and possible transitions between them. Then, we discuss specific user actions: query formulation, scanning behavior, clicks, effort involved in processing documents, and stopping. We also provide an overview of toolkits and resources and discuss approaches to validating simulators.
Simulating Interactions with Conversational Assistants [30 min]

We begin with a conceptualization of conversational information access in terms of intents and dialogue structure, and discuss two fundamentally different simulator architectures: modular and end-to-end systems. There is a solid body of work within dialogue systems research on simulating user decisions to build on, including the widely used agenda-based simulation and more recent sequence-to-sequence models. This is followed by the discussion of simulation approaches developed specifically for conversational information access. We review toolkits and resources, followed by a discussion on how simulators themselves can be evaluated.
Conclusion and Future Challenges [15 min]

We conclude by highlighting open issues and providing several potential research directions. We discuss how simulation technologies can help foster collaboration between academia and industry. We also argue that some of the major challenges that remain require research from multiple subject areas, including information science, information retrieval, recommender systems, machine learning, natural language processing, knowledge representation, human-computer interaction, and psychology, making user simulation a truly interdisciplinary area for research.
Discussion [15 min]

We dedicate the last bit of the tutorial to open-ended discussion and feedback from participants.

Presenters

Krisztian Balog is a full professor at the University of Stavanger and a staff research scientist at Google. His general research interests lie in the use and development of information retrieval, natural language processing, and machine learning techniques for intelligent information access tasks. His current research concerns novel evaluation methodologies, and conversational and explainable search and recommendation methods. Balog regularly serves on the senior programme committee of SIGIR, WSDM, WWW, CIKM, and ECIR. He previously served as general co-chair of ICTIR’20 and ECIR’22, program committee co-chair of ICTIR’19 (full papers), CIKM’21 (short papers), and SIGIR’24 (resource and reproducibility), Associate Editor of ACM Transactions on Information Systems, and coordinator of IR benchmarking efforts at TREC and CLEF. Balog is the recipient of the 2018 Karen Spärck Jones Award. He has previously given tutorials at WWW’13, SIGIR’13, WSDM’14, ECIR’16, SIGIR’19, CIKM’23, and AAAI’24.

ChengXiang Zhai is a Donald Biggar Willett Professor in Engineering of Department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests include intelligent information retrieval, text mining, natural language processing, machine learning, and their applications. He serves as a Senior Associate Editor of ACM Transactions on Intelligent Systems and Technology and previously served as Associate Editors of ACM TOIS, ACM TKDD, and Elsevier’s IPM, and Program Co-Chair of NAACL-HLT’07, SIGIR’09, and WWW’15. He is an ACM Fellow and a member of the ACM SIGIR Academy. He received the ACM SIGIR Gerard Salton Award and ACM SIGIR Test of Time Award (three times). He has previously given tutorials at HLT-NAACL’04, SIGIR’05, SIGIR’06, HLT-NAACL’07, ICTIR’13, SIGIR’14, KDD’17, SIGIR’17, SIGIR’18, SIGIR’20, SIGIR’21, CIKM’23, and AAAI’24.