AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with LLMs

We introduce AlphaForgeBench, a benchmark for evaluating LLMs as quantitative researchers — generating executable alpha factors and trading strategies, evaluated via standardized backtesting across seven assets and six frontier models.

We introduce AlphaForgeBench, accepted at KDD 2026, a benchmark that repositions LLMs from direct trading agents to quantitative researchers — tasked with generating executable alpha factors and strategy code, evaluated via rigorous standardized backtesting.

The Problem: LLMs as Trading Agents Are Unstable

Prior work on LLM-based financial evaluation often asks models to emit trading actions directly. This approach suffers from a fundamental problem: behavioral instability during sequential decision-making. LLMs manifest extreme run-to-run variance, inconsistent action sequences under deterministic decoding, and irrational action reversals in adjacent decision steps. These instabilities make fair benchmarking nearly impossible.

AlphaForgeBench solves this by separating financial reasoning from execution: LLMs generate strategy code and alpha factors rather than individual trades, and a deterministic backtesting engine handles execution.

Framework

AlphaForgeBench Framework

The benchmark is built on a two-stage dataset construction pipeline:

Stage 1 — Real-world queries (633 samples): Natural-language financial queries paired with ground-truth alpha factors and trading strategies sourced from real-world quantitative research.

Stage 2 — Augmented queries (270 samples): LLM-generated queries following a 3×3 level-grade taxonomy that progresses from direct translation to open-ended strategy design under controlled difficulty. This systematic structure enables measuring how model capability scales with task complexity.

Evaluation Pipeline

StepDescription
Query submissionNatural-language strategy request sent to the model
Code generationModel produces executable alpha factor / strategy code
BacktestingStandardized engine runs code across 7 assets and multiple market regimes
MetricsSR, ARR, MDD, Calmar Ratio, Sortino Ratio, Volatility

The benchmark covers 7 assets (cryptocurrency and US equities), 6 frontier LLMs, and 35,190 total implementations — making it the most comprehensive LLM trading evaluation to date.

Results

Stage 1 — Real-World Queries

ModelSR ↑ARR ↑MDD ↓Calmar ↑
gemini-3-pro-preview0.4490.1710.1741.411
gemini-3-flash-preview0.3880.1420.1381.504
claude-sonnet-4.50.3780.1380.1381.456
grok-4.1-fast0.3660.1350.1421.396
gpt-5.20.3420.1230.1221.534
deepseek-v3.20.3290.1160.1141.575

Stage 2 — LLM-Augmented Queries (τ = 0.7)

ModelSR ↑ARR ↑MDD ↓Calmar ↑
gemini-3-pro-preview0.6270.2090.1881.639
gemini-3-flash-preview0.5300.1650.1511.658
claude-sonnet-4.50.5080.1620.1471.634
grok-4.1-fast0.4290.1350.1271.692
deepseek-v3.20.4240.1300.1231.570
gpt-5.20.4170.1300.1171.660

Multi-metric radar profiles

Aligned cumulative return curves

Key Findings

Temperature invariance: Results at τ = 0 and τ = 0.7 are near-identical across models, validating that the code-generation framing eliminates the run-to-run variance that plagues direct action-generation approaches.

Systematic difficulty scaling: Inter-model performance spread widens monotonically from Level 1 (direct translation) through Level 3 (open-ended design), confirming the benchmark effectively stress-tests different capability levels.

Cross-level ranking reversals: Models strong at translating existing strategies don’t necessarily excel at open-ended strategy design — suggesting the two skills are distinct and worth evaluating separately.

Three persistent risk profiles: Aggressive (high return, high drawdown), balanced-stable, and conservative-rigid archetypes emerge across models and persist across evaluation conditions.


© 2026. Wentao Zhang. All rights reserved.

Powered by Hydejack v9.2.1