Continual Learning Through Introspection: Training Language Models to Generate Transferable Self-Knowledge via Reinforcement Learning
tags, with compositional generalisation emerging naturally from sampling over the learned distribution. We describe desirable properties of introspection outputs (cross-input transfer, compositionality) and report preliminary experiments on a 4B parameter model showing that per-question introspection advantage is achievable, that training only on winner-generated introspection tokens produces insufficient gradient signal, and that losers—models shown the correct answer after failure—generate more informative learning points than winners.Introduction
Large language models deployed as autonomous agents encounter recurring failure modes: ambiguous instructions, poorly structured data, subtle domain conventions that conflict with training priors. In multi-turn interactions, users correct these failures through conversational exchanges—iterative feedback loops that eventually steer the model toward the desired behaviour. However, these corrections are ephemeral. The same model, presented with a structurally identical problem in a new context, will repeat the same error and require the same correction.
Prompting-based approaches to self-reflection (e.g., chain-of-thought, self-critique) ask the model to perform introspection in a single forward pass, which amounts to pattern-matching against what introspection looks like in training data rather than genuine reflection over experience [shinn2023reflexion]. Real introspection is a process over time: notice something, revisit with new context, gradually refine understanding.
We propose training language models to produce introspection outputs—explicit textual artefacts that, when injected into context, reconstruct the internal state that would otherwise require a full multi-turn exchange. The key distinction from prior work on self-summarisation [yang2025cursor] and self-distillation [hubotter2025sdpo] is that introspection produces a portable, reusable artefact with strict output invariance: must match the output obtained through the full corrective exchange.
Problem Formulation
Definitions
An exchange is a multi-turn conversation where the model is corrected from an initial wrong output to the final correct output. An exchange is defined by the underlying lesson being corrected, not the surface form—two different conversations correcting the same type of mistake constitute the same exchange.
An introspection is a generated text such that for input and exchange :
Desirable Properties
Cross-input transfer. If is generated from exchange on input , it should also work on a different input that involves the same underlying error type:
Compositionality. Tasks can be compositions of subtasks. If input is composed of sub-problems and , then:
Generalisation. Ideally, introspections should generalise: a single introspection should help on many questions sharing the same underlying error type. This can be expressed as:
Spontaneous recall. Given a new question , the model should autonomously generate a block containing the introspection most relevant to —or a novel composition of previously learned introspections—before generating its answer. The recalled text enters the token context and directly conditions the model’s subsequent generation:
should confer no advantage and thus receive no reinforcement.
Method
Introspection as Internal State Reconstruction
Introspection is the ability for a model to generate outputs that recreate its internal states and replicate behaviours associated with those states, without going through the full interaction exchange. This perspective explains a known phenomenon: “don’t argue with the model, just start a new chat”—the model gets stuck in a low-energy internal state that additional prompting cannot escape. The exchange has put the model into a state basin, and no amount of further input within that exchange can shift it.
The model requires training on two fronts: generation (producing outputs that stimulate specific internal states) and usage (consuming those outputs and entering the corresponding states). This creates a handshake—the model learns both sides jointly, meaning introspection outputs are a private protocol between the model’s generation and consumption capabilities.
GRPO Training Procedure
We adapt Group Relative Policy Optimisation (GRPO) [shao2024grpo] for introspection training. For each training step on randomly sampled questions (batch size rollouts per question):
- Rollout 1 — Initial attempt. Generate rollouts across questions. Split into winners (correct answer) and losers (incorrect).
- Rollout 2 — Learning point generation. Winners are asked to reflect on what they got right and generate a learning point. Losers are given the correct answer and asked to identify what they were missing. Both produce introspection text within the same conversation.
- Rollout 3 — Verification. Each learning point is injected into a fresh rollout on the same question. The result indicates whether the learning point was sufficient to produce the correct answer.
- Compute introspection advantage: for each learning point, compute advantage relative to the group mean. Advantages are applied across all tokens in the trajectory, giving dense gradient signal.
- Policy gradient: backpropagate through the trajectory log-probabilities, weighted by introspection advantage.
This procedure addresses two problems identified in preliminary experiments (Section [sec:contrastive]): the contrastive signal problem (losers now contribute learning points that articulate what was missing, rather than being discarded) and the compute-to-signal problem (gradient flows through the full trajectory, not just introspection tokens).
The reward structure trains three capabilities simultaneously:
- When to introspect: rollouts on questions within base capability produce zero-advantage introspections, so the model learns not to introspect unnecessarily.
- What to introspect about: introspections capturing the critical missing knowledge get positive advantage; surface-level summaries get zero.
- How to format introspection: the model discovers what representation is most consumable by itself (the handshake).
Cost. Each training step requires three inference-only rollouts of batch size , followed by one gradient update. The inference rollouts require no gradients and can be optimised with an inference framework. The final gradient step is structurally identical to SFT: compute loss over tokens, backpropagate.
SFT for Introspection Selection via Recall
The GRPO phase (Section 3.2) produces a library of validated introspections with ground truth effectiveness mappings: for each introspection , we know which question types it helps and its measured advantage score. This gives us a dataset of triples. The second training phase uses supervised fine-tuning to teach the model to select the right introspection for a given task.
Given a problem statement, the model should generate tags containing the introspection most relevant to the current task. The SFT training data is constructed directly from Phase 1: for each question , the target is the introspection that achieved the highest introspection advantage on .
Compositional generalisation through sampling. SFT trains the model to produce introspections token-by-token from a learned distribution. At inference time, the model samples from this distribution—it need not reproduce any introspection verbatim. Because the distribution is learned over many introspection examples spanning different error types, sampling can produce novel recombinations: introspections the model has never generated during training but that combine elements from multiple learned introspections.
For example, if the model has learned introspection (about threshold-based escalation) and (about recognising lab value patterns), sampling from the recall distribution on a question requiring both may produce a novel that composes them—without appearing in the training data. The SFT distribution is the mechanism through which composition emerges naturally, without requiring explicit compositional training.
Spontaneous recall. The model must also learn when recall is beneficial. We include both recall-beneficial examples (questions where injecting an introspection improves performance) and recall-unnecessary examples (straightforward questions within the model’s base capability) in the SFT data. The model learns to produce only when the question matches a known error pattern, analogous to how reasoning models learn when is worth producing [deepseekr1].
Full Training Pipeline
The complete pipeline is cyclic:
- Deploy and collect: the model encounters tasks and produces exchanges (corrections).
- RL (GRPO): train the model to generate good introspections, producing a library of triples.
- SFT: train the model to select and recall the right introspection via
tags, using the library as supervision. - Inference: model sees a new question spontaneously generates
samples from learned distribution may produce novel composed introspection solves in single shot. - Test-time adaptation: integrate new introspections into weights (see below).
Test-Time Adaptation
Because of the handshake, the model may produce introspections at test time—including novel compositions from the recall distribution—that its weights have not yet learned to consume. The notes analogy is instructive: study notes are triggers, not the knowledge itself—someone else’s notes are often useless because the background knowledge differs.
We propose test-time GRPO as a “dreaming” mechanism: the model simulates alternate scenarios and runs GRPO on them to integrate new introspections into its weights. Given introspection :
- Take , generate rollouts
- Compute reward (similarity to desired behaviour)
- Vary the input, expect the same behavioural shift
- Update weights via RL
Relation to Prior Work
Self-Summarisation
Cursor Composer 2 [yang2025cursor] uses self-summarisation: the model summarises its own context at token limits, optimised via RL on both summary and downstream task quality.
| Self-Summ. | SDPO | SSD | Composer 2 | Ours | |
|---|---|---|---|---|---|
| Goal | Continue task | Alignment | Base dist. | Task perf. | Elim.\ exchange |
| Trigger | Token limit | Each interact. | Offline | Each rollout | Worth learning |
| Transfer | Same task | Per-interact. | Same type | Same task | Cross-input |
| Artefact | Compressed st. | None (wts) | None (wts) | Summary text | Reusable text |
| Multi-shot ? | Many many | Impl.\ shift | Impl.\ shift | Many many | Many one |
| Output invar. | No | No | No | No | Strict |
| Grad.\ signal | All tokens | Token-level | All tokens | All tokens | All tokens |
SDPO: Self-Distillation Policy Optimisation
H{\"u}botter et al.\ [hubotter2025sdpo] propose SDPO, which converts rich textual feedback (e.g., runtime errors, judge evaluations) into a dense learning signal by treating the model conditioned on feedback as a self-teacher. The model’s feedback-informed next-token predictions are distilled back into the base policy, leveraging the model’s ability to retrospectively identify its own mistakes in-context. SDPO improves sample efficiency and final accuracy over RLVR baselines across scientific reasoning, tool use, and competitive programming.
The overlap is significant: both use the exchange as the learning signal, both want the model to behave without the exchange as if it had the exchange, and both see personalisation as a natural consequence. The fundamental difference: SDPO bakes learning into weights implicitly. Introspection produces an explicit, portable artefact that can be stored, composed, and applied to novel inputs. SDPO cannot do because there is no introspection object.
Useful borrowing. SDPO’s token-level log-ratio advantage (their Equation 1) is a clean reward signal: . This could replace or supplement our similarity-based reward.
Self-Distillation from Shifted Distributions (SSD)
Zhang et al.\ [zhang2026ssd] show that sampling from a model at shifted temperature and truncation parameters, then fine-tuning on the raw (unverified) outputs, substantially improves code generation—Qwen3-30B improves from 42.4% to 55.3% pass@1 on LiveCodeBench. No RL, no verifier, and the method works even when 62% of training samples are incorrect.
The key insight is the precision-exploration conflict: token positions are either “locks” (one correct continuation, distractors in the tail) or “forks” (multiple valid continuations). No single decoding temperature handles both optimally. SSD reshapes the output distribution context-dependently—stripping tails at locks while preserving diversity at forks—through distributional self-distillation rather than explicit token-level control.
Like SDPO, SSD improves the model’s weights implicitly with no artefact produced; it cannot perform cross-input transfer via an explicit introspection object. However, SSD and introspection are complementary rather than competitive: SSD could improve the base model before introspection RL, and SSD’s finding that noisy training outputs still provide useful gradient signal through distributional reshaping is encouraging for our approach, where early-stage introspections are often semantically imperfect.
Composer 2
Cursor’s Composer 2 [composer2] achieves 61.3% on CursorBench via continued pretraining on code followed by large-scale asynchronous RL, built on Kimi K2.5 (1.04T parameters, 32B active MoE). Rollouts execute in full Firecracker VMs with real codebases; each rollout chains multiple generations connected by self-summaries. Critically, RL applies the final reward to all tokens in the chain—agent actions and summaries—so good summaries are reinforced because they led to downstream success.
Like our approach, Composer 2 trains on full trajectories with dense gradient signal. An initial variant of our method that trained only on 50 introspection tokens per rollout suffered from an extremely low gradient-signal-to-compute ratio (Section [sec:contrastive]); the observation that Composer 2 rewards all tokens in the chain—including summaries—informed our decision to apply advantages across full trajectories. Composer 2 also operates at massive scale (hundreds of thousands of parallel VMs, single-epoch training over unique real-world coding tasks), with mid-rollout weight updates and asynchronous RL that never waits. Their Dr.\ GRPO variant omits length standardisation and standard-deviation normalisation of advantages, using KL regularisation with the estimator instead.
Preliminary Experiments
Setup
We train Gemma 4 e2b-it (4B parameters) on a single RTX A6000 (51GB). Early experiments drew questions from obstetrics clustered by clinical principle; later experiments sample randomly from a full medical question bank across all specialties. We use single-shot rollouts—winners from the first generation only, no multi-turn correction—to avoid meta-gaming (Section [sec:metagaming]).
Rollout scaling. Batched generation latency on this hardware shows two regimes: sublinear growth from 1–32 rollouts (amortising weight-loading and launch overhead) and near-linear growth from 64–128 (KV-cache traffic and straggler effects dominate). We operate at rollouts.
Training Stability (Winner-Only, Clustered Questions)
The following results are from the initial training approach: introspection tokens only, winner-generated learning points, obstetrics questions clustered by clinical principle. This approach was subsequently revised (Section 3.2) based on the failures identified here.
Initial training collapsed within 5 steps—the model degenerated to producing “N/A” outputs. We identified six causes:
- SGD with lr= and momentum 0.9 was too aggressive for policy gradient. Fixed: lr=, no momentum.
- Negative advantages destroyed the model: guarantees some correct rollouts get negative advantage. Fixed: .
- No gradient clipping. Fixed: .
- Wrong baseline for regularisation advantage—subtracting the initial question’s mean rather than a fresh baseline on the transfer question .
- Missing
torch.no_grad()on rollout generation, causing KV cache to persist in the computation graph. Reserved GPU memory grew from 10.4GB to 41.0GB. - No skip for zero-advantage steps, wasting forward/backward passes.
After fixes, 23 training steps completed without collapse. A canary metric (initial rollout accuracy) remained stable, confirming base capability preservation.
Results
Validation advantage (same question). Mostly positive—learning points help on the question they were generated from. This is expected and represents the easy case.
Regularisation advantage (different question). Noisy around zero. Learning points are not consistently transferring to different questions within the same cluster over 23 steps.
Mini-Cluster Training
A focused run on 5 Type A pre-eclampsia questions (5 epochs, 25 steps, SGD lr=, 32 initial rollouts, single-shot) produced no collapse and no meta-gaming, but combined reward was flat over 25 steps (0.27 average, no upward trend). Learning points at step 24 were paraphrases of step 0 outputs—all encoding the same rule (“hypertension + proteinuria = escalate”).
The contrastive signal problem. Training on winners produces no contrastive gradient. Winners already answered correctly; their reflections are post-hoc rationalisations of what the model already knew. Reinforcing these tokens barely moves weights because they were already high-probability. The model needs a signal that distinguishes better introspections from worse ones, not just correct from incorrect answers.
Compute-to-signal ratio. Each training step takes 10 minutes (32 initial rollouts + validation + regularisation + baseline) but produces gradient updates on only 50 introspection tokens—an extremely low gradient-signal-to-compute ratio compared to standard GRPO where all reasoning tokens contribute.
Regularisation advantage validates reward design. Despite flat training, the regularisation advantage correctly differentiated learning point quality: broad learning points mentioning 140/90 transferred well across the cluster; narrow ones anchoring only on 160/110 failed on lower-BP questions. The reward signal is informative—the model simply is not changing in response to it.
Meta-Gaming
Multi-turn correction prompting (“the answer is wrong, try again”) caused the model to produce exam-strategy learning points instead of domain reasoning:
“In a multiple-choice setting, if the standard options are repeatedly rejected, the intended answer is often the most extreme, definitive action listed.”Analysis of 87 learning points: 13% showed meta-gaming. At 1–2 correction turns, 7–14% meta-gaming; at 3+ turns, 100%. We switched to single-shot rollouts, eliminating the correction loop entirely.
Cluster Analysis
Manual analysis of a 13-question pre-eclampsia cluster revealed four sub-types with different transferability profiles:
- Type A (5 questions): BP + proteinuria urgent escalation. One learning point covers all 5.
- Type B (3 questions): HELLP syndrome recognition from lab values. Transferable within sub-type.
- Type C (3 questions): Standalone knowledge facts. No transferable structure.
- Type D (1 question): Pathophysiology diagnosis.
The ideal learning point for Type A, discovered through iterative analysis, encodes two decision thresholds (140/90 urgent referral; 160/110 emergency admission). A learning point capturing only “escalate urgently” without the threshold distinction fails on questions requiring the finer discrimination. This demonstrates that natural introspection tends to over-generalise, compressing away discriminating detail.
Discussion
Energy landscape interpretation. Model behaviour can be understood through an energy landscape where low-energy states are attractors—easy to fall into, hard to escape. The “start a new chat” phenomenon occurs because the model is trapped in a basin that further prompting cannot escape. Introspection aims to stabilise high-energy states that are useful but fragile, making them attainable in a single step.
Cross-learning between rollouts. Within a GRPO batch, successful rollouts could teach unsuccessful ones—using successful rollout content as supervision rather than just scalar rewards. This self-distillation within the group provides richer gradient signal than “this rollout was good, that one was bad,” though it risks collapsing diversity.
Winner vs.\ loser learning points. A model that failed a question and is then shown the correct answer must bridge the gap between its wrong reasoning and the correct answer—a more informative signal than a winner’s post-hoc rationalisation. Loser-generated learning points may therefore be higher quality: losers articulate what was missing, while winners describe what they already knew. This observation motivates including losers in the training loop (Section 3.2) rather than discarding them as in standard GRPO.
Emergent structure detection. Forcing generalisation via a regularisation reward across handcrafted question clusters is counterproductive when the task lacks transferable structure. In medicine, determining which questions share structure is itself a hard problem; hand-selecting “structurally similar” questions amounts to hand-designing the generalisations, defeating the purpose of automation. Training over a full, unstructured question bank and letting the model discover which learning points transfer is preferable to imposing structure through clustering.
Spontaneous introspection and recall. Analogous to how reasoning models learned to use tags through RL reward [deepseekr1], we expect models trained with introspection advantage to learn when is worth producing—without explicit prompting. The SFT phase extends this to recall: the model learns when would help (novel problem matching a known error pattern) vs.\ when base capability suffices. The two mechanisms are complementary— generates new knowledge after a task, retrieves existing knowledge before a task.
Conclusion
We have presented introspection as a trainable capability for language models: the ability to generate explicit, transferable artefacts that reconstruct internal states otherwise requiring multi-turn exchanges. Our two-phase pipeline—RL for introspection generation with both winner and loser contributions, SFT for introspection selection via recall—provides both a clear reward signal (introspection advantage) and a mechanism for compositional generalisation (sampling novel recombinations from the learned recall distribution). Preliminary experiments on a winner-only, clustered-question variant identified key challenges: meta-gaming from multi-turn correction, the contrastive signal problem (winner reflections are post-hoc rationalisations that do not move weights), and low gradient-signal-to-compute ratio when training only on introspection tokens. These findings directly shaped the current training procedure, which incorporates loser-generated learning points and dense gradient signal across full trajectories. The core research questions remain open: can RL produce introspections that genuinely transfer across inputs, and can the SFT recall distribution produce composed introspections that exceed the sum of their parts?