Continual Learning Through Introspection: Training Language Models to Generate Transferable Self-Knowledge via Reinforcement Learning
tags, with compositional generalisation emerging naturally from sampling over the learned distribution. We describe desirable properties of introspection outputs (cross-input transfer, compositionality, generalisation-as-regularisation) and report preliminary experiments on a 4B parameter model showing that while per-question introspection advantage is achievable, cross-question transfer remains an open challenge.Introduction
Large language models deployed as autonomous agents encounter recurring failure modes: ambiguous instructions, poorly structured data, subtle domain conventions that conflict with training priors. In multi-turn interactions, users correct these failures through conversational exchanges—iterative feedback loops that eventually steer the model toward the desired behaviour. However, these corrections are ephemeral. The same model, presented with a structurally identical problem in a new context, will repeat the same error and require the same correction.
Prompting-based approaches to self-reflection (e.g., chain-of-thought, self-critique) ask the model to perform introspection in a single forward pass, which amounts to pattern-matching against what introspection looks like in training data rather than genuine reflection over experience [shinn2023reflexion]. Real introspection is a process over time: notice something, revisit with new context, gradually refine understanding.
We propose training language models to produce introspection outputs—explicit textual artefacts that, when injected into context, reconstruct the internal state that would otherwise require a full multi-turn exchange. The key distinction from prior work on self-summarisation [yang2025cursor] and self-distillation [hubotter2025sdpo] is that introspection produces a portable, reusable artefact with strict output invariance: must match the output obtained through the full corrective exchange.
Problem Formulation
Definitions
An exchange is a multi-turn conversation where the model is corrected from an initial wrong output to the final correct output. An exchange is defined by the underlying lesson being corrected, not the surface form—two different conversations correcting the same type of mistake constitute the same exchange.
An introspection is a generated text such that for input and exchange :
Desirable Properties
Cross-input transfer. If is generated from exchange on input , it should also work on a different input that involves the same underlying error type:
Compositionality. Tasks can be compositions of subtasks. If input is composed of sub-problems and , then:
Generalisation as regularisation. The objective should maximise the number of different questions for which a single introspection helps:
Spontaneous recall. Given a new question , the model should autonomously generate a block containing the introspection most relevant to —or a novel composition of previously learned introspections—before generating its answer. The recalled text enters the token context and directly conditions the model’s subsequent generation:
should confer no advantage and thus receive no reinforcement.
Method
Introspection as Internal State Reconstruction
Introspection is the ability for a model to generate outputs that recreate its internal states and replicate behaviours associated with those states, without going through the full interaction exchange. This perspective explains a known phenomenon: “don’t argue with the model, just start a new chat”—the model gets stuck in a low-energy internal state that additional prompting cannot escape. The exchange has put the model into a state basin, and no amount of further input within that exchange can shift it.
The model requires training on two fronts: generation (producing outputs that stimulate specific internal states) and usage (consuming those outputs and entering the corresponding states). This creates a handshake—the model learns both sides jointly, meaning introspection outputs are a private protocol between the model’s generation and consumption capabilities.
GRPO Training Procedure
We adapt Group Relative Policy Optimisation (GRPO) [shao2024grpo] for introspection training. For each training step on question :
- Generate rollouts for . Some rollouts naturally include
output, some do not. - Extract introspections from successful rollouts that produced introspection text.
- Evaluate transfer: for each extracted introspection , inject it into rollouts on a different question of the same type. Run baseline rollouts on without introspection.
- Compute introspection advantage:
where denotes mean reward across rollouts. The baseline subtraction isolates the introspection’s contribution from the model’s base capability on .
- Policy gradient: backpropagate through the original rollout’s log-probabilities, weighted by introspection advantage. Only update gradients on introspection tokens.
This reward structure trains three capabilities simultaneously:
- When to introspect: rollouts that include
on questions with nothing to learn produce zero-advantage introspections, so the model learns not to introspect unnecessarily. - What to introspect about: introspections capturing transferable lessons get positive advantage; surface-level summaries get zero.
- How to format introspection: the model discovers what representation is most consumable by itself (the handshake).
Cost. Each reward evaluation requires extra rollouts on (with and without introspection), making each step more expensive than standard GRPO. We mitigate this by using small (4–8), only evaluating introspections from successful rollouts, and batching baseline and introspection rollouts together.
SFT for Introspection Selection via Recall
The GRPO phase (Section 3.2) produces a library of validated introspections with ground truth effectiveness mappings: for each introspection , we know which question types it helps and its measured advantage score. This gives us a dataset of triples. The second training phase uses supervised fine-tuning to teach the model to select the right introspection for a given task.
Given a problem statement, the model should generate tags containing the introspection most relevant to the current task. The SFT training data is constructed directly from Phase 1: for each question , the target is the introspection that achieved the highest introspection advantage on 's type cluster.
Compositional generalisation through sampling. SFT trains the model to produce introspections token-by-token from a learned distribution. At inference time, the model samples from this distribution—it need not reproduce any introspection verbatim. Because the distribution is learned over many introspection examples spanning different error types, sampling can produce novel recombinations: introspections the model has never generated during training but that combine elements from multiple learned introspections.
For example, if the model has learned introspection (about threshold-based escalation) and (about recognising lab value patterns), sampling from the recall distribution on a question requiring both may produce a novel that composes them—without appearing in the training data. The SFT distribution is the mechanism through which composition emerges naturally, without requiring explicit compositional training.
Spontaneous recall. The model must also learn when recall is beneficial. We include both recall-beneficial examples (questions where injecting an introspection improves performance) and recall-unnecessary examples (straightforward questions within the model’s base capability) in the SFT data. The model learns to produce only when the question matches a known error pattern, analogous to how reasoning models learn when is worth producing [deepseekr1].
Full Training Pipeline
The complete pipeline is cyclic:
- Deploy and collect: the model encounters tasks and produces exchanges (corrections).
- RL (GRPO): train the model to generate good introspections, producing a library of triples.
- SFT: train the model to select and recall the right introspection via
tags, using the library as supervision. - Inference: model sees a new question spontaneously generates
samples from learned distribution may produce novel composed introspection solves in single shot. - Test-time adaptation: integrate new introspections into weights (see below).
Test-Time Adaptation
Because of the handshake, the model may produce introspections at test time—including novel compositions from the recall distribution—that its weights have not yet learned to consume. The notes analogy is instructive: study notes are triggers, not the knowledge itself—someone else’s notes are often useless because the background knowledge differs.
We propose test-time GRPO as a “dreaming” mechanism: the model simulates alternate scenarios and runs GRPO on them to integrate new introspections into its weights. Given introspection :
- Take , generate rollouts
- Compute reward (similarity to desired behaviour)
- Vary the input, expect the same behavioural shift
- Update weights via RL
Relation to Prior Work
Self-Summarisation
Cursor Composer 2 [yang2025cursor] uses self-summarisation: the model summarises its own context at token limits, optimised via RL on both summary and downstream task quality.
| Self-Summ. | SDPO | SSD | Introspection (ours) | |
|---|---|---|---|---|
| Goal | Continue long task | Better alignment | Improve base dist. | Eliminate exchange |
| Trigger | Token limit | Every interaction | Offline | Something worth learning |
| Transfer | Same task | Per-interaction | Same task type | Cross-input |
| Artefact | Compressed state | None (weights) | None (weights) | Reusable text |
| Multi-shot ? | Many many | Implicit shift | Implicit shift | Many one |
| Output invariance | Not claimed | Not claimed | Not claimed | Strict |
SDPO: Self-Distillation Policy Optimisation
H{\"u}botter et al.\ [hubotter2025sdpo] propose SDPO, which converts rich textual feedback (e.g., runtime errors, judge evaluations) into a dense learning signal by treating the model conditioned on feedback as a self-teacher. The model’s feedback-informed next-token predictions are distilled back into the base policy, leveraging the model’s ability to retrospectively identify its own mistakes in-context. SDPO improves sample efficiency and final accuracy over RLVR baselines across scientific reasoning, tool use, and competitive programming.
The overlap is significant: both use the exchange as the learning signal, both want the model to behave without the exchange as if it had the exchange, and both see personalisation as a natural consequence. The fundamental difference: SDPO bakes learning into weights implicitly. Introspection produces an explicit, portable artefact that can be stored, composed, and applied to novel inputs. SDPO cannot do because there is no introspection object.
Useful borrowing. SDPO’s token-level log-ratio advantage (their Equation 1) is a clean reward signal: . This could replace or supplement our similarity-based reward.
Self-Distillation from Shifted Distributions (SSD)
Zhang et al.\ [zhang2026ssd] show that sampling from a model at shifted temperature and truncation parameters, then fine-tuning on the raw (unverified) outputs, substantially improves code generation—Qwen3-30B improves from 42.4% to 55.3% pass@1 on LiveCodeBench. No RL, no verifier, and the method works even when 62% of training samples are incorrect.
The key insight is the precision-exploration conflict: token positions are either “locks” (one correct continuation, distractors in the tail) or “forks” (multiple valid continuations). No single decoding temperature handles both optimally. SSD reshapes the output distribution context-dependently—stripping tails at locks while preserving diversity at forks—through distributional self-distillation rather than explicit token-level control.
Like SDPO, SSD improves the model’s weights implicitly with no artefact produced; it cannot perform cross-input transfer via an explicit introspection object. However, SSD and introspection are complementary rather than competitive: SSD could improve the base model before introspection RL, and SSD’s finding that noisy training outputs still provide useful gradient signal through distributional reshaping is encouraging for our approach, where early-stage introspections are often semantically imperfect.
Preliminary Experiments
Setup
We train Gemma 4 e2b-it (4B parameters) on a single RTX A6000 (51GB). Questions are drawn from obstetrics, clustered by underlying clinical principle. We use single-shot rollouts—winners from the first generation only, no multi-turn correction—to avoid meta-gaming (Section [sec:metagaming]).
Rollout scaling. Batched generation latency on this hardware shows two regimes: sublinear growth from 1–32 rollouts (amortising weight-loading and launch overhead) and near-linear growth from 64–128 (KV-cache traffic and straggler effects dominate). We operate at rollouts.
Training Stability
Initial training collapsed within 5 steps—the model degenerated to producing “N/A” outputs. We identified six causes:
- SGD with lr= and momentum 0.9 was too aggressive for policy gradient. Fixed: lr=, no momentum.
- Negative advantages destroyed the model: guarantees some correct rollouts get negative advantage. Fixed: .
- No gradient clipping. Fixed: .
- Wrong baseline for regularisation advantage—subtracting the initial question’s mean rather than a fresh baseline on the transfer question .
- Missing
torch.no_grad()on rollout generation, causing KV cache to persist in the computation graph. Reserved GPU memory grew from 10.4GB to 41.0GB. - No skip for zero-advantage steps, wasting forward/backward passes.
After fixes, 23 training steps completed without collapse. A canary metric (initial rollout accuracy) remained stable, confirming base capability preservation.
Results
Validation advantage (same question). Mostly positive—learning points help on the question they were generated from. This is expected and represents the easy case.
Regularisation advantage (different question). Noisy around zero. Learning points are not consistently transferring to different questions within the same cluster over 23 steps.
Meta-Gaming
Multi-turn correction prompting (“the answer is wrong, try again”) caused the model to produce exam-strategy learning points instead of domain reasoning:
“In a multiple-choice setting, if the standard options are repeatedly rejected, the intended answer is often the most extreme, definitive action listed.”Analysis of 87 learning points: 13% showed meta-gaming. At 1–2 correction turns, 7–14% meta-gaming; at 3+ turns, 100%. We switched to single-shot rollouts, eliminating the correction loop entirely.
Cluster Analysis
Manual analysis of a 13-question pre-eclampsia cluster revealed four sub-types with different transferability profiles:
- Type A (5 questions): BP + proteinuria urgent escalation. One learning point covers all 5.
- Type B (3 questions): HELLP syndrome recognition from lab values. Transferable within sub-type.
- Type C (3 questions): Standalone knowledge facts. No transferable structure.
- Type D (1 question): Pathophysiology diagnosis.
The ideal learning point for Type A, discovered through iterative analysis, encodes two decision thresholds (140/90 urgent referral; 160/110 emergency admission). A learning point capturing only “escalate urgently” without the threshold distinction fails on questions requiring the finer discrimination. This demonstrates that natural introspection tends to over-generalise, compressing away discriminating detail.
Discussion
Energy landscape interpretation. Model behaviour can be understood through an energy landscape where low-energy states are attractors—easy to fall into, hard to escape. The “start a new chat” phenomenon occurs because the model is trapped in a basin that further prompting cannot escape. Introspection aims to stabilise high-energy states that are useful but fragile, making them attainable in a single step.
Cross-learning between rollouts. Within a GRPO batch, successful rollouts could teach unsuccessful ones—using successful rollout content as supervision rather than just scalar rewards. This self-distillation within the group provides richer gradient signal than “this rollout was good, that one was bad,” though it risks collapsing diversity.
Spontaneous introspection and recall. Analogous to how reasoning models learned to use tags through RL reward [deepseekr1], we expect models trained with introspection advantage to learn when is worth producing—without explicit prompting. The SFT phase extends this to recall: the model learns when would help (novel problem matching a known error pattern) vs.\ when base capability suffices. The two mechanisms are complementary— generates new knowledge after a task, retrieves existing knowledge before a task.
Conclusion
We have presented introspection as a trainable capability for language models: the ability to generate explicit, transferable artefacts that reconstruct internal states otherwise requiring multi-turn exchanges. Our two-phase pipeline—RL for introspection generation, SFT for introspection selection via recall—provides both a clear reward signal (introspection advantage) and a mechanism for compositional generalisation (sampling novel recombinations from the learned recall distribution). Preliminary experiments identify key challenges: meta-gaming from multi-turn correction, the difficulty of cross-question transfer, and the need for fine-grained cluster design. The core research questions remain open: can RL produce introspections that genuinely transfer across inputs, and can the SFT recall distribution produce composed introspections that exceed the sum of their parts?