BEGIN:VCALENDAR
VERSION:2.0
X-WR-CALNAME;VALUE=TEXT:1. A Simple, Statistically Robust Test of Discrimination & 2. Distortion of AI alignment:  Does preference optimization optimize for preferences?
PRODID:-//Harvard events data//EN
BEGIN:VEVENT
UID:event_1902726_0
SUMMARY:1. A Simple, Statistically Robust Test of Discrimination & 2. Distortion of AI alignment:  Does preference optimization optimize for preferences?
DESCRIPTION:<p><span><strong>Speaker 1: Johann Gaebler&nbsp;</strong>(Harvard University)</span></p><p><span><strong>Title 1:&nbsp;</strong>A Simple, Statistically Robust Test of Discrimination</span></p><p><span><strong>Abstract 1:</strong>&nbsp;In observational studies of discrimination, the most common statistical approaches consider either the rate at which decisions are made (benchmark tests) or the success rate of those decisions (outcome tests). Both tests, however, have well-known statistical limitations, sometimes suggesting discrimination even when there is none. Despite the fallibility of the benchmark and outcome tests individually, here we prove a surprisingly strong statistical guarantee: under a common non-parametric assumption, at least one of the two tests must be correct; consequently, when both tests agree, they are guaranteed to yield correct conclusions. We present empirical evidence that the underlying assumption holds approximately in several important domains, including lending, education, and criminal justice—and that our hybrid test is robust to the moderate violations of the assumption that we observe in practice. Applying this approach to 2.8 million police stops across California, we find evidence of widespread racial discrimination.</span></p><p><span><strong>Speaker 2: Paul Gölz&nbsp;</strong>(Cornell University)</span></p><p><span><strong>Title 2:</strong>&nbsp;Distortion of AI alignment: Does preference optimization optimize</span><br><span>for preferences?</span></p><p><span><strong>Abstract 2:&nbsp;</strong>After pre-training, large language models are aligned with human&nbsp;preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average — a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users’ comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.&nbsp;</span></p><p><span>The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of (1/2+o​(1))⋅β (for the BT temperature β), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer ≥(1−o​(1))⋅β distortion already without a KL constraint, and e^Ω​(β) or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.</span></p><p>&nbsp;</p>
LOCATION:SEC LL2.221
STATUS:CONFIRMED
DTSTART:20251212T183000Z
DTEND:20251212T200000Z
END:VEVENT
END:VCALENDAR