Skip to main content

Kei Tateno Publications

Discussion Paper
Abstract

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

Discussion Paper
Abstract

Automated decision-making algorithms drive applications in domains such as recommendation systems and search engines. These algorithms often rely on off-policy contextual bandits or off-policy learning (OPL). Conventionally, OPL selects actions that maximize the expected reward within an existing action set. However, in many real-world scenarios, actions—such as news articles or video content—change continuously, and the action space evolves over time compared to when the logged data was collected. We define actions introduced after deploying the logging policy as new actions and focus on the problem of OPL with new actions. Existing OPL methods cannot learn and select new actions because no relevant data are logged. To address this limitation, we propose a new OPL method that leverages action features. In particular, we first introduce the Local Combination PseudoInverse (LCPI) estimator for the policy gradient, generalizing the PseudoInverse estimator initially proposed for off-policy evaluation of slate bandits. LCPI controls the trade-off between reward-modeling condition and the condition for data collection regarding the action features, capturing the interaction effects among different dimensions of action features. Furthermore, we propose a generalized algorithm called Policy Optimization for Effective New Actions (PONA), which integrates LCPI, a component specialized for new action selection, with Doubly Robust (DR), which excels at learning within existing actions. We define PONA as a weighted sum of the LCPI and DR estimators, optimizing both the selection of existing and new actions, and allowing the proportion of new action selections to be adjusted by controlling the weight parameter.

Proceedings of the AAAI Conference on Artificial Intelligence
Abstract

Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper thus studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic. Note that complete version with technical appendix is available on arXiv: http://arxiv.org/abs/2211.13904.

Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD)
Abstract

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems. Since many OPE estimators have been proposed and some of them have hyperparameters to be tuned, there is an emerging challenge for practitioners to select and tune OPE estimators for their specific application. Unfortunately, identifying a reliable estimator from results reported in research papers is often difficult because the current experimental procedure evaluates and compares the estimators’ accuracy on a narrow set of hyperparameters and evaluation policies. Therefore, we cannot know which estimator is safe and reliable to use in general practice. In this work, we develop Interpretable Evaluation for Offline Evaluation (IEOE), an experimental procedure to evaluate OPE estimators’ sensitivity to the choice of hyperparameters and possible changes in evaluation policies in an interpretable manner. We also build open-source Python software, pyIEOE, to streamline the evaluation with the IEOE protocol. With this software, researchers can use IEOE to compare different OPE estimators in their research, and practitioners can select an appropriate estimator for the given practical situation. Then, using the IEOE procedure, we perform extensive re-evaluation of a wide variety of existing estimators on public datasets. We show that, surprisingly, simple estimators that have fewer hyperparameters are more reliable than other advanced estimators because advanced estimators need environment specific hyperparameter tuning to perform well. Finally, we apply IEOE to real-world e-commerce platform data and demonstrate how to use our protocol in practice.