(1) Distribution over rewards for pluralistic AI alignment & (2) Data Reliability Scoring
Date and Time
Location
Time & Location: Friday, November 21, 1:30pm - 2:30pm at SEC LL 2.221
Speaker 1: Itai Shapira (Harvard University)
Title 1: Distribution over rewards for pluralistic AI alignment
Abstract 1: Current alignment pipelines based on human feedback, typically learn a single scalar reward that reflects a presumed universal notion of desirable behavior. Yet human preferences often diverge across users, contexts, and cultures, so disagreement in the feedback collapses into a majority signal, minority perspectives are discounted, and downstream policy optimization amplifies this preference collapse. This talk treats reward learning explicitly as preference aggregation and shows that broad classes of loss based reward learning rules, including Bradley–Terry style objectives, cannot satisfy basic social choice axioms once rewards are constrained to a parametric family, revealing that the resulting single reward is often a mis-specified objective.
Motivated by this impossibility, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The central criterion is pairwise calibration: for every pair of candidate responses, the fraction of reward functions preferring one response matches the fraction of annotators with that preference. Our results show that even a small outlier free ensemble can accurately represent diverse preference distributions while remaining practical to train and deploy.
Speaker 2: Shi Feng (Harvard University)
Title 2: Data Reliability Scoring
Abstract 2: How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment -- a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.