Chengrun Yang

Research Scientist, Google DeepMind
Mountain View, California, USA

Contact: chengrun at google dot com
Twitter | Google Scholar | GitHub | LinkedIn

About Me

I am a research scientist at Google DeepMind. I am broadly interested in optimization methods for machine learning, with specific focus in model self-improvement, including but not limited to:

optimization with foundation models [1]
model self-evaluation [2]
model selection and architecture search [3, 4]

I obtained my Ph.D. in the School of Electrical and Computer Engineering at Cornell University in May 2022, where my Ph.D. thesis focused on resource-constrained automated machine learning (AutoML). I was advised by Prof. Madeleine Udell and had Prof. Thorsten Joachims and Prof. Kilian Q. Weinberger on my committee. From Summer 2021 to Spring 2022, I was a student researcher at Google Brain. I received a B.S. degree in physics from Fudan University in 2016.

Papers

Show All Abstracts — Hide All Abstracts

Conferences and Journals

Long-Form Factuality in Large Language Models
Jerry Wei*, Chengrun Yang*, Xinying Song*, Yifeng Lu*, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
Neural Information Processing Systems (NeurIPS), 2024
[abstract] [arXiv] [code] [bib]

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall).
Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

@article{wei2024long,
title={Long-Form Factuality in Large Language Models},
author={Wei, Jerry and Yang, Chengrun and Song, Xinying and Lu, Yifeng and Hu, Nathan and Huang, Jie and Tran, Dustrin and Peng, Daiyi and Liu, Ruibo and Huang, Da and Du, Cosmo and Le, Quoc V},
journal={arXiv preprint arXiv:2403.18802},
year={2024}
}
*Lead contribution

Large Language Models as Optimizers
Chengrun Yang*, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen*
International Conference on Learning Representations (ICLR), 2024
[abstract] [arXiv] [code] [bib]

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks.

@article{yang2023large,
title={Large Language Models as Optimizers},
author={Yang, Chengrun and Wang, Xuezhi and Lu, Yifeng and Liu, Hanxiao and Le, Quoc V and Zhou, Denny and Chen, Xinyun},
journal={arXiv preprint arXiv:2309.03409},
year={2023}
}
*Equal contribution

Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization for Low-Rank Tensor Completion and Tensor Robust Principal Component Analysis
Jicong Fan, Lijun Ding, Chengrun Yang, Zhao Zhang, Madeleine Udell
Transactions on Machine Learning Research (TMLR), 2023
[abstract] [arXiv] [bib]

The nuclear norm and Schatten-p quasi-norm are popular rank proxies in low-rank matrix recovery. However, computing the nuclear norm or Schatten- quasi-norm of a tensor is hard in both theory and practice, hindering their application to low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). In this paper, we propose a new class of tensor rank regularizers based on the Euclidean norms of the CP component vectors of a tensor and show that these regularizers are monotonic transformations of tensor Schatten-p quasi-norm. This connection enables us to minimize the Schatten-p quasi-norm in LRTC and TRPCA implicitly via the component vectors. The method scales to big tensors and provides an arbitrarily sharper rank proxy for low-rank tensor recovery compared to the nuclear norm. On the other hand, we study the generalization abilities of LRTC with the Schatten-p quasi-norm regularizer and LRTC with the proposed regularizers. The theorems show that a relatively sharper regularizer leads to a tighter error bound, which is consistent with our numerical results. Particularly, we prove that for LRTC with Schatten-p quasi-norm regularizer on d-order tensors, p=1/d is always better than any p>1/d in terms of the generalization ability. We also provide a recovery error bound to verify the usefulness of small p in the Schatten-p quasi-norm for TRPCA. Numerical results on synthetic data and real data demonstrate the effectiveness of the regularization methods and theorems.

@article{fan2022euclidean,
title={Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization for Low-Rank Tensor Completion and Tensor Robust Principal Component Analysis},
author={Fan, Jicong and Ding, Lijun and Yang, Chengrun and Zhang, Zhao and Udell, Madeleine},
journal={Transactions on Machine Learning Research},
year={2022}
}

TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
Chengrun Yang, Gabriel Bender, Hanxiao Liu, Pieter-Jan Kindermans, Madeleine Udell, Yifeng Lu, Quoc V. Le, Da Huang
Neural Information Processing Systems (NeurIPS), 2022
[abstract] [arXiv] [code] [poster] [bib]

The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.

@article{yang2022tabnas,
title={TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets},
author={Yang, Chengrun and Bender, Gabriel and Liu, Hanxiao and Kindermans, Pieter-Jan and Udell, Madeleine and Lu, Yifeng and Le, Quoc and Huang, Da},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}

How Low Can We Go: Trading Memory for Error in Low-Precision Training
Chengrun Yang*, Ziyang Wu*, Jerry Chee, Christopher De Sa, Madeleine Udell
International Conference on Learning Representations (ICLR), 2022
[abstract] [arXiv] [code] [poster] [bib]

Low-precision arithmetic trains deep learning models using less energy, less memory and less time. However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.

@inproceedings{yang2022how,
title={How Low Can We Go: Trading Memory for Error in Low-Precision Training},
author={Chengrun Yang and Ziyang Wu and Jerry Chee and Christopher De Sa and Madeleine Udell},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=YpSxqy_RE84}
}

*Equal contribution

Robust Non-Linear Matrix Factorization for Dictionary Learning, Denoising, and Clustering
Jicong Fan, Chengrun Yang, Madeleine Udell
IEEE Transactions on Signal Processing (TSP), 2021
[abstract] [arXiv] [IEEE] [bib]

Low dimensional nonlinear structure abounds in datasets across computer vision and machine learning. Kernelized matrix factorization techniques have recently been proposed to learn these nonlinear structures for denoising, classification, dictionary learning, and missing data imputation, by observing that the image of the matrix in a sufficiently large feature space is low-rank. However, these nonlinear methods fail in the presence of sparse noise or outliers. In this work, we propose a new robust nonlinear factorization method called Robust Non-Linear Matrix Factorization (RNLMF). RNLMF constructs a dictionary for the data space by factoring a kernelized feature space; a noisy matrix can then be decomposed as the sum of a sparse noise matrix and a clean data matrix that lies in a low dimensional nonlinear manifold. RNLMF is robust to sparse noise and outliers and scales to matrices with thousands of rows and columns. Empirically, RNLMF achieves noticeable improvements over baseline methods in denoising and clustering.

@article{fan2020robust,
author={Fan, Jicong and Yang, Chengrun and Udell, Madeleine},
journal={IEEE Transactions on Signal Processing},
title={Robust Non-Linear Matrix Factorization for Dictionary Learning, Denoising, and Clustering},
year={2021},
volume={69},
number={},
pages={1755-1770},
doi={10.1109/TSP.2021.3062988}
}

TenIPS: Inverse Propensity Sampling for Tensor Completion
Chengrun Yang, Lijun Ding, Ziyang Wu, Madeleine Udell
International Conference on Artificial Intelligence and Statistics (AISTATS), 2021
Preliminary version at NeurIPS 2020 Workshop on Optimization for Machine Learning
[abstract] [arXiv] [code] [poster] [bib]

Tensors are widely used to represent multiway arrays of data. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a higher-order SVD approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach.

@inproceedings{yang2021tenips,
title={TenIPS: Inverse Propensity Sampling for Tensor Completion},
author={Chengrun Yang and Lijun Ding and Ziyang Wu and Madeleine Udell},
year={2021},
eprint={2101.00323},
archivePrefix={arXiv},
primaryClass={stat.ML},
booktitle={International Conference on Artificial Intelligence and Statistics (AISTATS)},
}

AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space
Chengrun Yang, Jicong Fan, Ziyang Wu, Madeleine Udell
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2020
[abstract] [pdf] [code] [ACM] [ACM version errata] [bib]

Data scientists seeking a good supervised learning model on a dataset have many choices to make: they must preprocess the data, select features, possibly reduce the dimension, select an estimation algorithm, and choose hyperparameters for each of these pipeline components. With new pipeline components comes a combinatorial explosion in the number of choices! In this work, we design a new AutoML system TensorOboe to address this challenge: an automated system to design a supervised learning pipeline. TensorOboe uses low rank tensor decomposition as a surrogate model for efficient pipeline search. We also develop a new greedy experiment design protocol to gather information about a new dataset efficiently. Experiments on large corpora of real-world classification problems demonstrate the effectiveness of our approach.

@inproceedings{yang2020automl,
title={{AutoML} Pipeline Selection: Efficiently Navigating the Combinatorial Space},
author={Yang, Chengrun and Fan, Jicong and Wu, Ziyang and Udell, Madeleine},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages={1446--1456},
year={2020}
}

Spectral Frank-Wolfe Algorithm: Strict Complementarity and Linear Convergence
Lijun Ding, Yingjie Fei, Qiantong Xu, Chengrun Yang
International Conference on Machine Learning (ICML), 2020
[abstract] [arXiv] [PMLR] [bib]

We develop a novel variant of the classical Frank-Wolfe algorithm, which we call spectral Frank-Wolfe, for convex optimization over a spectrahedron. The spectral Frank-Wolfe algorithm has a novel ingredient: it computes a few eigenvectors of the gradient and solves a small-scale SDP in each iteration. Such procedure overcomes slow convergence of the classical Frank-Wolfe algorithm due to ignoring eigenvalue coalescence. We demonstrate that strict complementarity of the optimization problem is key to proving linear convergence of various algorithms, such as the spectral Frank-Wolfe algorithm as well as the projected gradient method and its accelerated version.

@inproceedings{ding2020spectral,
title={Spectral Frank-Wolfe Algorithm: Strict Complementarity and Linear Convergence},
author={Ding, Lijun and Fei, Yingjie and Xu, Qiantong and Yang, Chengrun},
booktitle={International Conference on Machine Learning},
pages={2535--2544},
year={2020},
organization={PMLR}
}

OBOE: Collaborative Filtering for AutoML Model Selection
Chengrun Yang, Yuji Akimoto, Dae Won Kim, Madeleine Udell
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2019
Oral Presentation
Preliminary version at NeurIPS 2018 Workshop on Meta-Learning
[abstract] [arXiv] [pdf] [code] [ACM] [poster] [bib]

Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by non-experts. This paper introduces OBOE, a collaborative filtering method for time-constrained model selection and hyperparameter tuning. OBOE forms a matrix of the cross-validated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the low-dimensional feature vectors for the models and datasets that best predict the cross-validated errors. To find promising models for a new dataset, OBOE runs a set of fast but informative algorithms on the new dataset and uses their cross-validated errors to infer the feature vector for the new dataset. OBOE can find good models under constraints on the number of models fit or the total time budget. To this end, this paper develops a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that OBOE delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems. Moreover, the success of the bilinear model used by OBOE suggests that AutoML may be simpler than was previously understood.

@inproceedings{yang2019oboe,
title={{OBOE}: Collaborative Filtering for {AutoML} Model Selection},
author={Yang, Chengrun and Akimoto, Yuji and Kim, Dae Won and Udell, Madeleine},
booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
pages={1173--1183},
year={2019}
}

Preprints

SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Sercan Ö Arık
[abstract] [arXiv] [bib]

Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, conventional approaches such as repeated sampling with majority voting or reward model scoring, often face diminishing returns as test-time compute scales, in addition to requiring costly task-specific reward model training. In this paper, we present Self-Enhanced Test-Time Scaling (SETS), a novel method that leverages the self-verification and self-correction capabilities of recent advanced LLMs to overcome these limitations. SETS integrates sampling, self-verification, and self-correction into a unified framework, enabling efficient and scalable test-time computation for improved capabilities at complex tasks. Through extensive experiments on challenging planning and reasoning benchmarks, compared to the alternatives, we demonstrate that SETS achieves significant performance improvements and more favorable test-time scaling laws.

@article{chen2025sets,
title={SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling},
author={Chen, Jiefeng and Ren, Jie and Chen, Xinyun and Yang, Chengrun and Sun, Ruoxi and Ar{\i}k, Sercan {\"O}},
journal={arXiv preprint arXiv:2501.19306},
year={2025}
}

Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting
Yufei Li, John Nham, Ganesh Jawahar, Lei Shu, David Uthus, Yun-Hsuan Sung, Chengrun Yang, Itai Rolnick, Yi Qiao, Cong Liu
[abstract] [arXiv] [bib]

Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).

@article{li2025dr,
title={Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting},
author={Li, Yufei and Nham, John and Jawahar, Ganesh and Shu, Lei and Uthus, David and Sung, Yun-Hsuan and Yang, Chengrun and Rolnick, Itai and Qiao, Yi and Liu, Cong},
journal={arXiv preprint arXiv:2503.06781},
year={2025}
}

Reviewing

Conferences: NeurIPS, ICML, ICLR, AISTATS, NeurIPS 2018 workshop on AI in Financial Services
Journals: Transactions on Machine Learning Research (TMLR), TPAMI Special Issue on AutoML

Teaching (at Cornell)

ORIE 4741 (Fall 2017, Fall 2019, Fall 2020): Learning with Big Messy Data (Teaching Assistant)
ECE 4250 (Spring 2017): Digital Signal and Image Processing (Teaching Assistant)