Generation Next: Experimentation with AI

€0.00

Gary Charness, Brian Jabarian, and John List

Download PDF

Gary Charness, Brian Jabarian, and John List

Gary Charness, Brian Jabarian, and John List

Generation Next: Experimentation with AI

By Gary Charness, Brian Jabarian, and John List

Sept. 18, 2023 | NBER WP #31679


Download PDF

by Antony Avram

 
 

Abstract

We investigate the potential for Large Language Models (LLMs) to enhance scientific practice within experimentation by identifying key areas, directions, and implications. First, we discuss how these models can improve experimental design, including improving the elicitation wording, coding experiments, and producing documentation. Second, we delve into the use of LLMs in experiment implementation, with an emphasis on bolstering causal inference through creating consistent experiences, improving instruction comprehension, and real-time monitoring of participant engagement. Third, we underscore the role of LLMs in analyzing experimental data, encompassing tasks like pre-processing, data cleaning, and assisting reviewers and replicators in examining studies. Each of these tasks improves the probability of reporting accurate findings. Lastly, we suggest a scientific governance framework that mitigates the potential risks of using LLMs in experimental research while amplifying their advantages. This could pave the way for open science opportunities and foster a culture of policy and industry experimentation at scale.

 
 
 
  1. Introduction

Large Language Models (LLMs) represent a sophisticated application of machine learning algorithms, showing a capacity to create original content and thus underscoring their status as generative Artificial Intelligence (AI) (Bubeck et al., 2023). Despite their relatively recent emergence, the full extent of the rapid effects of generative and transformative AI in science, policy, and society remains to be experienced (Frank et al., 2019; Zhang et al., 2021; Bommasani et al., 2022; Brynjolfsson, 2017; Manning et al., 2022; Acemoglu & Johnson, 2023; Korinek, 2023).

This observation brings us to a pivotal question: How can we harness AI's full potential at a societal scale? At the forefront of this quest stands the Human-AI Interface (HAI), a paradigmatic shift in the interaction between human cognition and artificial intelligence. This interface symbolizes a melding of worlds, where human decision-making processes and AI algorithms unite to create a synergistic exchange of insights and learning. The HAI represents a technological advancement and a fundamental reimagining of how humans and AI can collaborate. This paper explores this interface's potential, envisioning a future where AI's latent capabilities are fully realized, transforming scientific inquiry and societal applications.

A natural venue in economics is to generate data for causal inference in experimental settings, for example, online. While once an academic curiosity, online experiments have become a bona fide contributor to causal estimates in the social sciences (Athey, 2015; Brynjolfsson et al., 2019). With the burgeoning digital economy, researchers believe that the generation of causal insights using online experiments will continue to increase (Fréchette et al., 2022).

However, one key feature of online experiments that tempers the optimism of even its most enthusiastic supporters is the violation of the four exclusion restrictions, calling into question the internal validity of the received estimates. For example, compliance, one of the four identification assumptions that underlie the experimental approach (List, 2023), is often questioned in online experiments because it is usually associated with high measurement errors (Gillen et al., 2019). Checking whether individual participants understand the experiment's instructions is often tricky, particularly in an online experiment, where people cannot usually ask questions and receive live responses. While one remedy might involve incorporating real-time human support to address participant inquiries, it would require at least one of these conditions: 1) having a sizable skilled labor force to accommodate simultaneous questions from many participants, and 2) providing extended availability to cover the protracted timelines with online experiments.

Although machine learning algorithms have improved causal inference analysis methods in economics (Athey & Imbens, 2019), we expect these new models LLMs to radically improve critical areas of scientific knowledge production, in particular to overcome these issues related to online experiments. LLMs can be fine-tuned as chat assistants to simulate sophisticated human interactions while reducing labor costs. Given their inherent scalability and versatility, such integration could become standard practice for future online experiments, revolutionizing the field and fostering unprecedented advancements across various online experiments, including surveys, incentivized individual decisions, and game-theoretical experiments. In addition, this approach can be deployed without prerequisite or minimal coding knowledge and is compatible with many experimental online platforms familiar to researchers, such as Qualtrics, oTree, and Z-tree. By ensuring consistency of treatment within and across these settings, another of the exclusion restrictions, the stable unit treatment value assumption (SUTVA), will be more likely to hold. Similarly, observability, a third exclusion restriction, is more likely to hold by minimizing the experimental burden on subjects by maintaining participant focus and engagement.

While this example highlights one key area of improvement of generative AI for experimentation, other areas are also open for similar enhancements. For example, specific fine-tuned language models could homogenize and carry out randomization and re-randomization techniques, lending more credibility to the fourth exclusion restriction, statistical independence. Furthermore, integrating them into the development and analysis of experimental research can address the challenges researchers commonly face, such as optimizing the wording of tasks, improving comprehension (Ouyang et al., 2022), and streamlining data analysis, especially coding and data visualizations (Wang et al., 2023). Using the capabilities of this technology, we can create more immersive online experiences, facilitate real-time monitoring of participant participation, and improve the quality and replicability of experiments. In addition, its use can promote open science opportunities, fostering increased collaboration, and this technique can promote open science among researchers.

This paper often refers to LLMs and their capabilities.\footnote{We do so with 'foundation models' and 'fine-tuned models' in mind under the umbrella of generative AI (Bommasani et al., 2022).} But, this does not imply that users can simply input our suggested directions and some experiment details into ChatGPT and expect satisfactory results.

Generative AI, a rapidly transforming technology, is sensitive to inputs and can produce unpredictable outputs (Ganguli et al., 2022) and, as a result, working out which inputs lead to the most desired outputs, known as prompt engineering, is becoming a growing part of the industry. Furthermore, the stochastic nature of generative AI means that results can be further improved by researchers taking multiple draws for the same prompt and selecting the best result ex post (Davies et al., 2021) or by launching A/B tests and other types of experiment to determine which prompt is the most effective.

In their best light, based on up-and-coming research and development done by leading AI labs, we envision these language models as the wise sage always available at the experimentalist’s beck and call. Within this framework, we explore their implementation more generally in Section 2, focusing on their role in comprehension and immersive experiences. Section 3 examines their capacities in data collection, including real-time monitoring, preprocessing, and cleaning, while Section 4 considers data analysis. The final section discusses the broader risks and benefits of the proliferation of generative AI in behavioral and experimental economics and implications for open science and scaling a culture of experimentation in business and policy-making. It gives some speculative pointers on how to manage these.

Read the rest of the paper

 
 

Media Coverage

World Economic Forum, October 9, 2023

VoxEU Column, October 16, 2023