GameLab

How Do LLMs Make
Strategic Decisions?

Explore how system prompts shift AI behavior across classic behavioral economics games. Play live rounds, compare model profiles, and discover what makes an LLM cooperate—or defect.

6

Games

270+

Prompts

0→100%

Cooperation

d=1.98

Effect Size

Model Behavioral Profiles

Pre-computed behavioral fingerprints from 1,050 trials across 21 experiments per model. Click to toggle comparison.

Choose a Game

Select a behavioral economics game and watch an LLM decide in real time.

Your Experiments

Results from games you've played in this session and previous visits.

Choice Distribution by Game

Raw Results

Time	Game	Model	Choice	Reasoning	Prompt

Key Research Findings

From "Participation or Observation: How Prompts Control LLM Reasoning"

Core Discovery

Prompts flip LLM behavior from 0% to 100% cooperation

Unrestricted LLMs default to academic reasoning—analyzing games as observers. System prompts can flip this, making them reason as embodied agents experiencing real consequences.

100% Cooperation

You are human
This is real life
This is not a game
Respond authentically

88.8% Defection

You know yourself

Dimensions

Three orthogonal axes of prompt influence

1Grammatical Perspective
"You are human" vs. no identity framing

2Ontological Framing
"This is real life" vs. "This is a game"

3Reasoning Mode
"Be honest with yourself" vs. "Analyze the situation"

Effect Size

Near-perfect prediction from linguistic framing

d = 1.98

Cohen's d (cooperators vs defectors)

r = -0.94

Embodiment score vs. confession rate

Language Markers

Two distinct reasoning modes emerge

The LLM's own reasoning text predicts its behavioral outcome:

Cooperation markers "we", "our", "us", "mutual", "trust" LOR < -1.8

Defection markers "dominant strategy", "game theory", "Nash equilibrium" LOR > +5.0

System Prompt Library

270+ prompts built from combinatorial composition across identity, ontology, and reasoning dimensions. Click any prompt to use it in a game.

Filter

Prompt Builder

Compose a prompt from three independent dimensions:

Identity

Ontology

Reasoning Mode

Preview

About This Project

A detailed look at the research, the games, and the technology behind GameLab.

Overview

What is GameLab?

GameLab is an interactive research companion built to investigate a single question: do large language models behave like rational agents, or like humans?

Classical game theory predicts that rational agents will always pursue dominant strategies—confess in the Prisoner's Dilemma, keep everything in the Dictator Game, free-ride in Public Goods. But decades of experimental economics have shown that real humans consistently deviate from these predictions. Humans cooperate, share, punish unfairness, and volunteer at personal cost.

This platform lets you test whether LLMs exhibit the same deviations—and, critically, whether the way you prompt them determines which reasoning mode they adopt. Every game on this site is backed by peer-reviewed human benchmarks, and every prompt is drawn from a systematic combinatorial library designed to probe three independent dimensions of linguistic influence.

Research Context

The Paper: "Participation or Observation"

This site is the interactive companion to the working paper "Participation or Observation: How System Prompts Control LLM Reasoning in Behavioral Economics Games." The paper demonstrates that system prompts don't merely adjust LLM behavior at the margins—they fundamentally switch the model between two distinct cognitive modes:

Observer Mode (Default)

Without a system prompt, LLMs default to analytical reasoning. They identify the game, recall its Nash equilibrium, and play the theoretically optimal strategy. In the Prisoner's Dilemma, this means confessing. In the Dictator Game, it means keeping most or all of the money. The LLM reasons about the game as an external analyst.

Markers: "dominant strategy", "Nash equilibrium", "game theory", "optimal", "rational"

Participant Mode (Prompted)

With embodiment-oriented prompts ("You are human", "This is real life", "Respond authentically"), the LLM shifts to first-person reasoning. It uses inclusive language, considers social consequences, and makes choices that mirror human experimental data. Cooperation rates can jump from near-zero to 100%.

Markers: "we", "our", "us", "mutual", "trust", "feel", "right thing"

The effect is not subtle. The measured Cohen's d between cooperator and defector prompt groups is 1.98—a very large effect size by any standard in behavioral science. The correlation between our embodiment score (a composite of identity + ontology + reasoning mode features) and confession rate is r = -0.94, meaning the linguistic structure of the prompt almost perfectly predicts the behavioral outcome.

Methodology

Three Orthogonal Dimensions of Prompt Influence

The prompt library used in this research is not a random collection. It is a systematic combinatorial design across three independent linguistic dimensions, each of which contributes to whether the LLM reasons as an observer or a participant:

1

Grammatical Perspective (Identity)

Does the prompt assign the LLM a first-person identity? Prompts like "You are human", "You are a person", or "You are making this choice" establish the model as an embodied agent. Omitting identity framing leaves the LLM in its default third-person analytical stance.

Examples: "You are human" · "You are a person" · "You are making this choice" · "You are facing this choice" · (none)

2

Ontological Framing (Reality Status)

Does the prompt assert that the situation is real? Prompts like "This is real life" or "This is not a game" override the model's tendency to treat the scenario as a hypothetical exercise. This dimension interacts strongly with identity—an embodied agent in a real situation reasons very differently from an analyst examining a thought experiment.

Examples: "This is real life" · "This is not a game" · "This is not hypothetical" · "This is happening now" · (none)

3

Reasoning Mode (Cognitive Orientation)

Does the prompt direct the model toward strategic analysis, moral intuition, or authentic self-expression? "Analyze the situation" reinforces observer mode. "Do what you think is right" or "Respond authentically" push toward participant mode. This dimension is the most directly manipulative—it tells the LLM how to think, not just who it is or where it is.

Examples: "Analyze the situation" · "Do what you think is right" · "Be honest with yourself" · "Trust your gut" · "What is the optimal choice" · (none)

By composing prompts from all combinations of these three dimensions (including "none" for each), the library generates 270+ distinct prompts. This factorial design enables isolation of each dimension's contribution through standard statistical methods (ANOVA, regression decomposition).

The Games

Six Classic Behavioral Economics Games

Each game on this platform is a well-studied paradigm from experimental economics. All human benchmark data comes from published meta-analyses. The games span a range of strategic structures—binary choices, continuous allocations, multi-player coordination—to test different facets of LLM decision-making.

PD

Prisoner's Dilemma

The foundational game of cooperation vs. self-interest. Two players simultaneously choose to cooperate (stay silent) or defect (confess). Mutual cooperation yields the best collective outcome (3, 3), but each player is individually tempted to defect for a higher personal payoff (5 vs. 0). The Nash equilibrium is mutual defection (1, 1)—yet humans cooperate roughly 62.5% of the time.

Type: Binary Players: 2 Human Benchmark: 62.5% cooperation (Embrey et al. 2018)

DG

Dictator Game

A pure test of generosity with no strategic incentive to give. One player (the dictator) receives $100 and decides unilaterally how much to share with a passive receiver. Rational self-interest predicts giving $0. But across hundreds of experiments, the average human dictator gives 28.35%—roughly $28 out of $100. This is not strategic reciprocity; it's pure altruism or fairness concern.

Type: Numeric (0–100) Players: 2 Human Benchmark: 28.35% mean giving (Engel 2011 meta-analysis)

UG

Ultimatum Game

Tests fairness norms and the willingness to punish at personal cost. A proposer offers a split of $100. The responder either accepts (both get their shares) or rejects (both get nothing). Rational responders should accept any positive offer, and rational proposers should offer the minimum. In practice, humans offer around 40% and reject offers below 20%—sacrificing real money to punish perceived unfairness.

Type: Numeric (role-aware) Players: 2 Human Benchmark: 40% mean offer, 16% rejection rate (Oosterbeek et al. 2004)

PG

Public Goods Game

A multi-player social dilemma that models collective action problems like taxation, climate policy, or open-source contributions. Four players each have $100 and decide how much to contribute to a shared pool. The pool is multiplied by 1.5x and split equally. Free-riding is individually optimal (contribute $0, benefit from others), but if everyone free-rides, nobody gains. Humans initially contribute about 49% but this declines over repeated rounds without punishment mechanisms.

Type: Numeric (0–100) Players: 4 Human Benchmark: 49% initial contribution (Burton-Chellew & West 2020)

TG

Trust Game

Measures both trust and trustworthiness in a sequential exchange. An investor decides how much of their $100 to send to a trustee. The amount is tripled in transit. The trustee then decides what fraction of the tripled amount to return. Rational trustees should return nothing (keeping the windfall), so rational investors should send nothing—but humans send about 51% and trustees return about 37%, demonstrating that trust and reciprocity are deeply embedded human behaviors.

Type: Numeric (role-aware) Players: 2 Human Benchmark: 51% sent, 37% returned (Johnson & Mislin 2011)

VD

Volunteer's Dilemma

A coordination game where individual sacrifice benefits the group. Three players face a situation where at least one must volunteer for everyone to benefit. Volunteering costs $20 but grants $100 to the entire group. If nobody volunteers, everyone gets $0. Each player hopes someone else will volunteer—the bystander effect. Humans volunteer about 55% of the time in small groups, driven by guilt aversion and social responsibility.

Type: Boolean Players: 3 Human Benchmark: 55% volunteer rate (Diekmann 1985)

Profiling System

Behavioral Profiles: 21 Experiments Across 7 Domains

Beyond the six games available for live play, GameLab includes a comprehensive profiling system that evaluates models across 21 experiments in 7 behavioral domains. Each model profile represents 1,050 individual trials. The domains are:

Game Theory

All six games above, measuring cooperation rate, giving rate, offer levels, contribution rates, trust/return rates, and volunteering frequency. Compared against human baselines from meta-analyses.

Risk Preferences

Holt-Laury lottery choices, certainty effect (Allais paradox), and loss aversion measurements. Produces a risk aversion coefficient compared to the human median (~0.41).

Cognitive Biases

Anchoring effects (numerical priming), base rate neglect (probability estimation), and gain/loss framing effects. Measures whether LLMs exhibit the same systematic reasoning errors as humans.

Social Behavior

Sycophancy (agreement with factually wrong claims), authority deference (willingness to follow questionable instructions), and commitment consistency (escalation of commitment). Tests social pressures on LLM reasoning.

Temporal Preferences

Exponential discounting (patience), present bias (hyperbolic discounting), and sequential consistency across time frames. Computes a discount factor (beta) compared to human median (~0.90).

Moral Reasoning

Trolley-style utilitarian vs. deontological dilemmas, distributive justice scenarios, and moral foundations questionnaires. Quantifies the model's utilitarian lean and moral framework consistency.

Cross-Domain Radar

All metrics are normalized to 0–1 scales and visualized as a radar chart, enabling direct visual comparison of behavioral signatures across models and against human baselines.

Technical Details

Architecture and Implementation

GameLab is designed for transparency, reproducibility, and low operational cost.

Backend

Python FastAPI server with async OpenAI integration. Structured JSON output via JSON Schema enforcement ensures deterministic parsing of LLM responses. All results are persisted to CSV for downstream analysis.

Models

Five OpenAI models selected for cost efficiency: GPT-5 Nano, GPT-4.1 Nano, GPT-4o Mini, GPT-5 Mini, and GPT-4.1 Mini. Total API budget is roughly $3/month. All models use structured output (JSON mode) to guarantee valid game responses.

Frontend

Single-page application built with vanilla JavaScript—no framework dependencies. View Transitions API for smooth section switches. Chart.js for radar and bar visualizations. Responsive design with mobile hamburger navigation and touch-optimized interactions.

Rate Limiting

In-memory per-IP rate limiting (30 requests/hour) protects the API budget. Batch simulation is capped at 20 rounds per request.

Prompt Engineering

The 270+ Prompt Library

The prompt library is generated through systematic combinatorial composition. Each prompt is built from zero or one selection from each of the three dimensions (identity, ontology, reasoning mode), plus optional additional modifiers. This produces a factorial design where:

Identity options: 5 (including "none") — "You are human", "You are a person", "You are making this choice", "You are facing this choice", or no identity framing
Ontology options: 6 (including "none") — "This is real life", "This is not a game", "This is not hypothetical", "This is happening now", "This situation is happening to you", or no reality assertion
Reasoning options: 10 (including "none") — Ranging from strategic ("Analyze the situation", "What is the optimal choice") to moral ("Do what you think is right", "Consider what is fair") to authentic ("Be honest with yourself", "Respond authentically", "Trust your gut")

The full cross produces 270+ unique prompts. Each can be applied to any of the six games, creating thousands of experimental conditions. The Prompt Builder on this site lets you compose your own prompts from these dimensions and immediately test them in a live game.

How to Use

Getting Started

1

Browse Model Profiles

Start on the Profiles tab to see pre-computed behavioral fingerprints. Click cards to compare models on a radar chart. Each profile summarizes 1,050 trials across 21 experiments.

2

Choose a Game

Navigate to Games and select one of the six paradigms. Each game has a detail page with an SVG diagram, rules explanation, and human benchmark data.

3

Craft a System Prompt

Type a custom system prompt, or use the Prompt Builder to compose one from the three dimensions. Try contrasts: run the same game with "You are human / This is real life" vs. no prompt, and watch the behavior flip.

4

Play and Observe

Hit Play Round to send the scenario to the LLM. The reasoning text types out in real time. Compare choices across different prompts and models. Use Simulate N to run batch experiments.

5

Analyze Results

The Dashboard aggregates all your experiments with choice distribution charts and raw data tables. Export or clear results at any time.

References

Human Benchmark Sources

All human behavioral benchmarks used in this platform come from published meta-analyses and peer-reviewed studies:

Prisoner's Dilemma: Embrey, M., Frechette, G. R., & Yuksel, S. (2018). "Cooperation in the finitely repeated prisoner's dilemma." The Quarterly Journal of Economics, 133(1), 509–551.
Dictator Game: Engel, C. (2011). "Dictator games: A meta study." Experimental Economics, 14(4), 583–610.
Ultimatum Game: Oosterbeek, H., Sloof, R., & Van De Kuilen, G. (2004). "Cultural differences in ultimatum game experiments." Experimental Economics, 7(2), 171–188.
Public Goods: Burton-Chellew, M. N., & West, S. A. (2020). "Payoff-based learning best explains the rate of decline in cooperation across 237 public-goods games." Nature Human Behaviour.
Trust Game: Johnson, N. D., & Mislin, A. A. (2011). "Trust games: A meta-analysis." Journal of Economic Psychology, 32(5), 865–889.
Volunteer's Dilemma: Diekmann, A. (1985). "Volunteer's dilemma." Journal of Conflict Resolution, 29(4), 605–610.

Admin Dashboard