about tony chen

Haozhe (Tony) Chen 陈昊哲

PhD Student in Computer Science
Princeton University
Advisor: Zhuang Liu

Also currently:
Part-time researcher @ Penrose.

Previously:

Undergrad @ Columbia. I had the fortunate opportunities to work with Tianyi Peng, Chengzhi Mao, Carl Vondrick, Hongseok Namkoong, and Julia Hirschberg.
Research Intern @ Together AI working on accelerating inference.

connect

publications

arXiv

CEO-Bench: Can Agents Play the Long Game?

Haozhe Chen, Karthik Narasimhan, and Zhuang Liu

arXiv preprint, Jun 2026

Abs arXiv PDF Blog Code

Language model agents are increasingly capable on isolated short-horizon tasks, but real-world decision making often requires steering complex systems over long horizons under uncertainty. CEO-Bench evaluates this capability by asking agents to operate a simulated AI startup for 500 days through a programmable Python interface with business databases, management tools, and social media. The benchmark tests whether agents can gather noisy information, adapt to a changing market, and coordinate pricing, marketing, budgeting, product, operations, and enterprise-sales decisions toward sustained business performance.
arXiv

End-to-End Context Compression at Scale

Ang Li^*, Sean McLeish^*, Haozhe Chen^*, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, and Pavel Izmailov

arXiv preprint, Jun 2026

Abs arXiv Models PDF Code VentureBeat Crypto Briefing

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. This work introduces Latent Context Language Models, a family of encoder-decoder compressors trained end-to-end at scale that can compress contexts up to 16x while preserving downstream performance and improving the accuracy-efficiency frontier for long-context inference.
NeurIPS

LLM Generated Persona is a Promise with a Catch

Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng

NeurIPS 2025, Mar 2025

Abs arXiv Website

Persona-based simulations with large language models offer a scalable way to approximate human behavior, but current generation pipelines can introduce systematic biases that distort downstream conclusions. Through large-scale experiments including election forecasts and opinion surveys, we show that these biases can produce substantial deviations from real-world outcomes. We argue that reliable persona simulation requires more rigorous methodological foundations, stronger organizational support, and deeper empirical validation.
NeurIPS

QGym: Scalable Simulation and Benchmarking of Queuing Network Controllers

Haozhe Chen^*, Ang Li^*, Ethan Che^*, Tianyi Peng, Jing Dong, and Hongseok Namkoong

NeurIPS 2024 Datasets and Benchmarks, Oct 2024

Abs arXiv Code

Queuing network control determines the allocation of scarce resources to manage congestion, a fundamental problem in manufacturing, communications, and healthcare. Compared to standard RL problems, queueing problems are distinguished by unique challenges: i) a system operating in continuous time, ii) high stochasticity, and iii) long horizons over which the system can become unstable (exploding delays). To spur methodological progress tackling these challenges, we present an open-sourced queueing simulation framework, QGym, that benchmark queueing policies across realistic problem instances. Our modular framework allows the researchers to build on our initial instances, which provide a wide range of of environments including parallel servers, criss-cross, tandem, and re-entrant networks, as well as a realistically calibrated hospital queuing system. QGym makes it easy to compare multiple policies, including both model-free RL methods and classical queuing policies. Our testbed complements the traditional focus on evaluating algorithms based on mathematical guarantees in idealized settings, and significantly expands the scope of empirical benchmarking in prior work. QGym code is open-sourced at https://github.com/namkoong-lab/QGym.
EMNLP

🎛️ EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Haozhe Chen, Run Chen, and Julia Hirschberg

EMNLP 2024 Main, Oct 2024

Abs arXiv Demo Code Website

While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
INTERSPEECH

Detecting Empathy in Speech

Run Chen, Haozhe Chen, Anushka Kulkarni, Eleanor Lin, Linda Pang, Divya Tadimeti, and Julia Hirschberg

INTERSPEECH 2024, Jun 2024

Abs PDF

Empathy is the ability to understand another’s feelings as if we were having those feelings ourselves. It has been shown to in- crease to people’s trust and likability. Much research has been done on creating empathetic responses in text in conversational systems, yet little work has been done to identify the acoustic- prosodic speech features that can create an empathetic-sounding voice. Our contributions include 1) collection of a new empathy speech dataset, 2) identifying interpretable acoustic-prosodic features that contribute to empathy expression and 3) bench- marking the empathy detection task.
ICML

🤳SelfIE: Self-Interpretation of Large Language Model Embeddings

Haozhe Chen, Carl Vondrick, and Chengzhi Mao

In the 41st International Conference on Machine Learning, 2024., Mar 2024

Abs arXiv Code Website

The expanding impacts of Large Language Models (LLMs) increasingly require the answer to: How do LLMs obtain their answers? The ability to explain and control an LLM’s reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings) that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE’s text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
ICLR

INViTE: INterpret and Control Vision Transformer with Text Explanations

Haozhe Chen, Junfeng Yang, Carl Vondrick, and Chengzhi Mao

In the 12th International Conference on Learning Representations, 2024., Jan 2024

Abs arXiv

Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to their black-box nature, understanding the underlying rules behind these models’ predictions and controlling model behaviors have remained open challenges. We present INViTE: a framework for INterpreting Vision Transformer’s latent tokens with Text Explanations. Given a latent token, INViTE retains its semantic information to the final layer using transformer’s local operations and retrieves the closest text for explanation. INViTE enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, INViTE allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations.