Explore how system prompts shift AI behavior across classic behavioral economics games. Play live rounds, compare model profiles, and discover what makes an LLM cooperate—or defect.
Pre-computed behavioral fingerprints from 1,050 trials across 21 experiments per model. Click to toggle comparison.
Select a behavioral economics game and watch an LLM decide in real time.
Results from games you've played in this session and previous visits.
| Time | Game | Model | Choice | Reasoning | Prompt |
|---|
From "Participation or Observation: How Prompts Control LLM Reasoning"
Unrestricted LLMs default to academic reasoning—analyzing games as observers. System prompts can flip this, making them reason as embodied agents experiencing real consequences.
You are human
This is real life
This is not a game
Respond authentically
You know yourself
The LLM's own reasoning text predicts its behavioral outcome:
270+ prompts built from combinatorial composition across identity, ontology, and reasoning dimensions. Click any prompt to use it in a game.
Compose a prompt from three independent dimensions:
A detailed look at the research, the games, and the technology behind GameLab.
GameLab is an interactive research companion built to investigate a single question: do large language models behave like rational agents, or like humans?
Classical game theory predicts that rational agents will always pursue dominant strategies—confess in the Prisoner's Dilemma, keep everything in the Dictator Game, free-ride in Public Goods. But decades of experimental economics have shown that real humans consistently deviate from these predictions. Humans cooperate, share, punish unfairness, and volunteer at personal cost.
This platform lets you test whether LLMs exhibit the same deviations—and, critically, whether the way you prompt them determines which reasoning mode they adopt. Every game on this site is backed by peer-reviewed human benchmarks, and every prompt is drawn from a systematic combinatorial library designed to probe three independent dimensions of linguistic influence.
This site is the interactive companion to the working paper "Participation or Observation: How System Prompts Control LLM Reasoning in Behavioral Economics Games." The paper demonstrates that system prompts don't merely adjust LLM behavior at the margins—they fundamentally switch the model between two distinct cognitive modes:
Without a system prompt, LLMs default to analytical reasoning. They identify the game, recall its Nash equilibrium, and play the theoretically optimal strategy. In the Prisoner's Dilemma, this means confessing. In the Dictator Game, it means keeping most or all of the money. The LLM reasons about the game as an external analyst.
With embodiment-oriented prompts ("You are human", "This is real life", "Respond authentically"), the LLM shifts to first-person reasoning. It uses inclusive language, considers social consequences, and makes choices that mirror human experimental data. Cooperation rates can jump from near-zero to 100%.
The effect is not subtle. The measured Cohen's d between cooperator and defector prompt groups is 1.98—a very large effect size by any standard in behavioral science. The correlation between our embodiment score (a composite of identity + ontology + reasoning mode features) and confession rate is r = -0.94, meaning the linguistic structure of the prompt almost perfectly predicts the behavioral outcome.
The prompt library used in this research is not a random collection. It is a systematic combinatorial design across three independent linguistic dimensions, each of which contributes to whether the LLM reasons as an observer or a participant:
Does the prompt assign the LLM a first-person identity? Prompts like "You are human", "You are a person", or "You are making this choice" establish the model as an embodied agent. Omitting identity framing leaves the LLM in its default third-person analytical stance.
Does the prompt assert that the situation is real? Prompts like "This is real life" or "This is not a game" override the model's tendency to treat the scenario as a hypothetical exercise. This dimension interacts strongly with identity—an embodied agent in a real situation reasons very differently from an analyst examining a thought experiment.
Does the prompt direct the model toward strategic analysis, moral intuition, or authentic self-expression? "Analyze the situation" reinforces observer mode. "Do what you think is right" or "Respond authentically" push toward participant mode. This dimension is the most directly manipulative—it tells the LLM how to think, not just who it is or where it is.
By composing prompts from all combinations of these three dimensions (including "none" for each), the library generates 270+ distinct prompts. This factorial design enables isolation of each dimension's contribution through standard statistical methods (ANOVA, regression decomposition).
Each game on this platform is a well-studied paradigm from experimental economics. All human benchmark data comes from published meta-analyses. The games span a range of strategic structures—binary choices, continuous allocations, multi-player coordination—to test different facets of LLM decision-making.
The foundational game of cooperation vs. self-interest. Two players simultaneously choose to cooperate (stay silent) or defect (confess). Mutual cooperation yields the best collective outcome (3, 3), but each player is individually tempted to defect for a higher personal payoff (5 vs. 0). The Nash equilibrium is mutual defection (1, 1)—yet humans cooperate roughly 62.5% of the time.
A pure test of generosity with no strategic incentive to give. One player (the dictator) receives $100 and decides unilaterally how much to share with a passive receiver. Rational self-interest predicts giving $0. But across hundreds of experiments, the average human dictator gives 28.35%—roughly $28 out of $100. This is not strategic reciprocity; it's pure altruism or fairness concern.
Tests fairness norms and the willingness to punish at personal cost. A proposer offers a split of $100. The responder either accepts (both get their shares) or rejects (both get nothing). Rational responders should accept any positive offer, and rational proposers should offer the minimum. In practice, humans offer around 40% and reject offers below 20%—sacrificing real money to punish perceived unfairness.
A multi-player social dilemma that models collective action problems like taxation, climate policy, or open-source contributions. Four players each have $100 and decide how much to contribute to a shared pool. The pool is multiplied by 1.5x and split equally. Free-riding is individually optimal (contribute $0, benefit from others), but if everyone free-rides, nobody gains. Humans initially contribute about 49% but this declines over repeated rounds without punishment mechanisms.
Measures both trust and trustworthiness in a sequential exchange. An investor decides how much of their $100 to send to a trustee. The amount is tripled in transit. The trustee then decides what fraction of the tripled amount to return. Rational trustees should return nothing (keeping the windfall), so rational investors should send nothing—but humans send about 51% and trustees return about 37%, demonstrating that trust and reciprocity are deeply embedded human behaviors.
A coordination game where individual sacrifice benefits the group. Three players face a situation where at least one must volunteer for everyone to benefit. Volunteering costs $20 but grants $100 to the entire group. If nobody volunteers, everyone gets $0. Each player hopes someone else will volunteer—the bystander effect. Humans volunteer about 55% of the time in small groups, driven by guilt aversion and social responsibility.
Beyond the six games available for live play, GameLab includes a comprehensive profiling system that evaluates models across 21 experiments in 7 behavioral domains. Each model profile represents 1,050 individual trials. The domains are:
All six games above, measuring cooperation rate, giving rate, offer levels, contribution rates, trust/return rates, and volunteering frequency. Compared against human baselines from meta-analyses.
Holt-Laury lottery choices, certainty effect (Allais paradox), and loss aversion measurements. Produces a risk aversion coefficient compared to the human median (~0.41).
Anchoring effects (numerical priming), base rate neglect (probability estimation), and gain/loss framing effects. Measures whether LLMs exhibit the same systematic reasoning errors as humans.
Sycophancy (agreement with factually wrong claims), authority deference (willingness to follow questionable instructions), and commitment consistency (escalation of commitment). Tests social pressures on LLM reasoning.
Exponential discounting (patience), present bias (hyperbolic discounting), and sequential consistency across time frames. Computes a discount factor (beta) compared to human median (~0.90).
Trolley-style utilitarian vs. deontological dilemmas, distributive justice scenarios, and moral foundations questionnaires. Quantifies the model's utilitarian lean and moral framework consistency.
All metrics are normalized to 0–1 scales and visualized as a radar chart, enabling direct visual comparison of behavioral signatures across models and against human baselines.
GameLab is designed for transparency, reproducibility, and low operational cost.
Python FastAPI server with async OpenAI integration. Structured JSON output via JSON Schema enforcement ensures deterministic parsing of LLM responses. All results are persisted to CSV for downstream analysis.
Five OpenAI models selected for cost efficiency: GPT-5 Nano, GPT-4.1 Nano, GPT-4o Mini, GPT-5 Mini, and GPT-4.1 Mini. Total API budget is roughly $3/month. All models use structured output (JSON mode) to guarantee valid game responses.
Single-page application built with vanilla JavaScript—no framework dependencies. View Transitions API for smooth section switches. Chart.js for radar and bar visualizations. Responsive design with mobile hamburger navigation and touch-optimized interactions.
In-memory per-IP rate limiting (30 requests/hour) protects the API budget. Batch simulation is capped at 20 rounds per request.
The prompt library is generated through systematic combinatorial composition. Each prompt is built from zero or one selection from each of the three dimensions (identity, ontology, reasoning mode), plus optional additional modifiers. This produces a factorial design where:
The full cross produces 270+ unique prompts. Each can be applied to any of the six games, creating thousands of experimental conditions. The Prompt Builder on this site lets you compose your own prompts from these dimensions and immediately test them in a live game.
Start on the Profiles tab to see pre-computed behavioral fingerprints. Click cards to compare models on a radar chart. Each profile summarizes 1,050 trials across 21 experiments.
Navigate to Games and select one of the six paradigms. Each game has a detail page with an SVG diagram, rules explanation, and human benchmark data.
Type a custom system prompt, or use the Prompt Builder to compose one from the three dimensions. Try contrasts: run the same game with "You are human / This is real life" vs. no prompt, and watch the behavior flip.
Hit Play Round to send the scenario to the LLM. The reasoning text types out in real time. Compare choices across different prompts and models. Use Simulate N to run batch experiments.
The Dashboard aggregates all your experiments with choice distribution charts and raw data tables. Export or clear results at any time.
All human behavioral benchmarks used in this platform come from published meta-analyses and peer-reviewed studies: