About me
My goal is to create AI agents that can make decisions in complex environments using evaluative feedback, which is formalized under the reinforcement learning (RL) framework. Currently, I work on using RL algorithms to help large language models (LLMs) better interact with human and environment.
Experiences
- 2022.08 - Now: Applied Scientist, Senior Applied Scientist at Amazon. I am working on RL, LLM and agents. I built the RL(HF) finetuning of Amazon Titan models and Rufus.
- 2021.08 - 2022.08: Research Scientist at ByteDance. I worked on RL for recommendation problems (in arguably the largest scale recommender system).
- 2021: Ph.D. in Computer Science at Stanford, advised by Emma Brunskill. I worked on theory and algorithms of reinforcement learning, especially in the offline settings. Our work provide some early theoretical results on batch RL with function approximation, the pessimistic value estimates principle, as well as real world application of batch RL in healthcare and education.
- 2016: B.S. in Machine Intelligence at Peking University.
Preprints and Publications
-
Teaching Large Language Models to Reason through Learning and Forgetting
-
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences
-
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens
TMLR
-
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
ICLR 2025
-
EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data
CoRL 2024
-
Learning the Target Network in Function Space
ICML 2024
-
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
ICLR 2024
-
Budgeting Counterfactual for Offline RL
NeurIPS 2023
-
TD Convergence: An Optimization Perspective
NeurIPS 2023
-
Reinforcement Learning Tutor Better Supported Lower Performers in a Math Task
Machine Learning Journal
-
Provably Sample-Efficient RL with Side Information about Latent Dynamics
NeurIPS 2022
-
Offline Policy Optimization with Eligible Actions
UAI 2022
-
Provably Good Batch Reinforcement Learning Without Great Exploration
NeurIPS 2020
-
Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling
ICML 2020
-
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions
ICML 2020
-
All-Action Policy Gradient Methods: A Numerical Integration Approach
-
Off-Policy Policy Gradient with State Distribution Correction
UAI 2019 (Oral) -
Combining Parametric and Nonparametric Models for Off-Policy Evaluation
ICML 2019 (Oral) -
Representation Balancing MDPs for Off-Policy Policy Evaluation
NeurIPS 2018
-
When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
EWRL 2018
-
Behaviour Policy Estimation in Off-Policy Evaluation: Calibration Matters
ICML 2018 Workshops
-
Switched Trajectories for Off-Policy Learning
ICML 2018 Workshops
-
Model Selection for Off-Policy Policy Evaluation
RLDM 2017, Extended Abstract
-
PAC Continuous State Online Multitask Reinforcement Learning with Identification
AAMAS 2016
-
Local Orthogonality Preserving Alignment for Nonlinear Dimensionality Reduction
Journal of Computer Science and Technology, 31(3): 512-524, 2016.
Professional Service
Journal Reviewing: JMLR, IEEE TPAMI, Machine Learning, Artificial Intelligence, Biometrika
Conference Reviewing: NeurIPS, ICLR, ICML, AISTATS, UAI, AAAI