Richard Zhuang

(Last Updated: 2026.06)

Welcome to my personal space! I am currently a first-year Master’s student in Computer Science at Stanford University, where I am a core contributor of the OpenThoughts-Agent project with Prof. Ludwig Schmidt, working on data recipe for post-training agents. Before Stanford, I graduated from UC Berkeley double majoring in Applied Math and Computer Science. During my time at Cal, I researched on LLM routing (EmbedLLM) with Jiantao Jiao and Tianhao Wu, as well as LLM + Game (PokerBench) with Akshat Gupta. I have also interned at Bespoke Labs in Spring 2025 where I worked on enhancing tool-use capability of LLM agents through RL (blog).

I’m broadly interested in understanding and improving the capabilities of Large Language Models (LLMs) in a data-centric way. Specifically, I’m intrigued by how certain data “foster” skills that are essential for LLM agents (e.g. reasoning and planning). I have also had a long-standing passion in Sports Analytics.

Outside the realm of AI, you will usually find me playing basketball🏀 or immersing myself in Chinese Hip-hop music🔥.

News

Jun 25, 2026	Announcing OpenThoughts-Agent and OpenThinkerAgent-32B — the strongest Qwen-3-based open-data agentic model for terminal use and coding, reaching 44.8% average accuracy across seven agentic benchmarks. We openly share the full stack: paper, model, data, and code. Read the X thread for the highlights.

Selected Work

OpenThoughts-Agent: Data Recipes for Agentic Models (500K+ views on X)

Negin Raoof*, Richard Zhuang*, Marianna Nezhurina*, Etash Guha*, and 46 more authors

2026

Abs arXiv X Blog

We present OpenThoughts-Agent, an open data curation pipeline for training agentic models. Through more than 100 controlled ablation experiments, we study how each stage of the pipeline and the diversity of training tasks shape downstream agent performance. Fine-tuning Qwen3-32B on 100K curated examples reaches 44.8% average accuracy across seven agentic benchmarks, a 3.9 percentage point improvement over the previous best open model of the same size. We release our training sets, data pipeline, experimental data, and models to advance open research on agentic AI.
Improving Multi-Turn Tool Use with Reinforcement Learning (200K+ Views on X)

Richard Zhuang*, Trung Vu*, Alex Dimakis, and 1 more author

2025

Abs HTML

Recently, OpenAI has demonstrated that RL can be used to train a research agent that uses tools to carry out complex, multi-step workflows. However, they did not disclose a lot of details about their training recipe. Meanwhile, experiments in the open-source and research community are often single-turn, lacking back-and-forth interaction with an environment. We are excited to share some recent findings which show that RL can improve multi-step tool use capabilities, a first milestone. Using GRPO, the core algorithm behind DeepSeek-R1, we improved Qwen2.5-7B-Instruct’s tool use performance by 23% on a subset of the BFCL benchmark using only 100 training samples. The tasks in this benchmark require an agent to orchestrate multiple tools to complete multi-step tasks in a simulated environment, such as booking air travel using credit cards and placing orders in a trading system.
EmbedLLM: Learning Compact Representations of Large Language Models (ICLR 2025 Spotlight🌟)

Richard Zhuang, Tianhao Wu, Zhaojin Wen, and 3 more authors

In The Thirteenth International Conference on Learning Representations (ICLR) , 2025

Abs HTML

With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embedding, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing. Additionally, we demonstrate that our method can forecast a model’s performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
PokerBench: Training Large Language Models to become Professional Poker Players (AAAI 2025)

Richard Zhuang, Akshat Gupta, Richard Yang, and 3 more authors

In The 39th Annual AAAI Conference on Artificial Intelligence (AAAI) , 2025

Abs HTML

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios.