The Homepage of Banghua Zhu

I am a principal research scientist at Nvidia. I work on star-Nemotron post-training, with a focus on reinforcement learning, agentic systems, and science of model evaluation.

I’m also an incoming assistant professor at University of Washington. I lead the Foundation Model and Reinforcement Learning Research Lab (FMRL2) at UW.

Prior to that, I co-founded Nexusflow AI in 2023, which provides reliable AI agent solutions for enterprise use-cases.

I received my PhD from the Department of EECS, UC Berkeley. I am very fortunate to have been advised by Prof. Jiantao Jiao and Prof. Michael I. Jordan. I am a recipient of the 2023 David J. Sakrison Memorial Prize from Berkeley EECS for truly outstanding PhD research.

News: Checkout our new short course on Post-training of LLMs, co-taught with Andrew Ng on Deeplearning.ai!

Research Interests

I’m currently interested in the theoretical foundations, training, serving, evaluation, and applications of foundation models. In the past, I have also been working on statistics, information theory, and machine learning, with applications in game theory, robust statistics, reinforcement learning, and human-AI interactions in the past.

Training

Starling-7B:
Check out our open 7B model, Starling-7B, which ranks first among all existing Mistral-based 7B models according to human evaluation in Chatbot Arena!
- Starling-7B is trained with our open-source high-quality preference dataset, Nectar, using our new reward-training and policy-finetuning algorithms.
Athene Series:
- Athene-70B: Our first chat model fine-tuned from Llama-3-70B, which increased 30+ ELO on Chatbot Arena and greatly improved its multi-lingual capability.
- Athene-V2-72B-Chat: Fine-tuned from Qwen-2.5-72B. It ranks only behind Deepseek V3 & R1 (671B) among all non-reasoning open models on Chatbot Arena and is competitive with GPT-4o on benchmarks like MMLU-Pro, GPQA, AIME, IFEval, BigcodeBench, LiveBench, etc.
- Athene-V2-72B-Agent: An agent model specializing in function calling and agentic use cases, surpassing GPT-4o in complex function-calling tasks, especially in parallel and nested calls.

Evaluation

Huggingface Function Calling Leaderboard: Used in the Llama-3.1 technical report for evaluating function-calling capabilities.
Chatbot Arena: One of the most reliable platforms for evaluating models with human preferences.
Arena-Hard-Auto: An automatic benchmark creation pipeline that uses LLM-as-a-judge to quickly evaluate model performance.
Preference Proxy Evaluations: A high-quality evaluation pipeline for reward models in RLHF that correlates very well with downstream RL performance.
MMMG: A comprehensive and reliable evaluation suite for Multitask Multimodal Generation.

Theoretical Foundations

Fundamental Limits of RLHF:
We identify the fundamental limits of RLHF and develop near-optimal algorithms with improved sample complexity for reward training [ZJJ23]. We also propose an alternative to Proximal Policy Optimization (PPO) for policy optimization that is more stable and sample-efficient [ZSFDZJJ23].
LLM Watermarking:
We recently proposed a statistically near-optimal algorithm for LLM watermarking.

Serving

Model Routing and Caching: We analyze and propose near-optimal algorithms for caching and model multiplexing for serving large models, significantly enhancing the efficiency of inference in LLMs [ZSZBJJ23].
S-Lora: We also proposed S-Lora, the algorithm and framework for serving thousands of LoRA adaptors.

Additional Research Areas

Bandit and Reinforcement Learning
- We study online learning and offline learning, off-policy evaluation, and inverse RL
  [RZMJR21, MZJW22].
Information-theoretic Lower Bounds
- We investigate achieving fundamental limits in noisy searching, sorting, and computing tasks using information-theoretic tools
  [WGZW22, ZWGJW23].
Statistics & Robustness
- We explore techniques to enhance the resilience of AI models against malicious attacks, extending the theory in high-dimensional robust statistics [ZJS22].
- We propose efficient algorithms for outlier detection, robust mean estimation, robust covariance estimation, and robust linear regression [ZJS21], as well as Byzantine-robust distributed learning / distributed systems [ZPWWJSJ23].
- We design doubly-robust estimators that outperform traditional self-training pipelines in computer vision and autonomous driving [ZDJWZJJ23].
- We conduct theoretical analyses of Generative Adversarial Networks (GANs), providing insights for practical implementations [ZJT19].
- We explore the interaction between ML systems and self-interested, strategic humans—a crucial topic in economics. By modeling and analyzing online learning in contract theory and the creator economy, I provide near-optimal regret bounds for both problems, addressing the longstanding challenge of sample complexity in online contract design
  [ZBYWJJ23, ZKJJ23].