Continual Learning Through Introspection: Training Language Models to Generate Transferable Self-Knowledge via Reinforcement Learning

% Yi Hein Chai

Abstract. We propose introspection—the ability for a language model to generate explicit textual artefacts that reconstruct internal states which would otherwise require multi-turn conversational exchanges to produce. Unlike self-summarisation, which compresses task context for continuation, introspection produces transferable lessons that eliminate the need for corrective exchanges entirely: given an introspection generated from one episode, a structurally similar episode should be solved in a single shot. We present a two-phase training pipeline: (1) RL via GRPO to train introspection generation, where the reward signal is the introspection advantage—the causal improvement in downstream performance when an introspection is injected into a different question of the same type; and (2) SFT to train introspection selection, where the model learns to recall the right introspection for a given task via tags, with compositional generalisation emerging naturally from sampling over the learned distribution. We describe desirable properties of introspection outputs (cross-input transfer, compositionality, generalisation-as-regularisation) and report preliminary experiments on a 4B parameter model showing that while per-question introspection advantage is achievable, cross-question transfer remains an open challenge.

Introduction

Large language models deployed as autonomous agents encounter recurring failure modes: ambiguous instructions, poorly structured data, subtle domain conventions that conflict with training priors. In multi-turn interactions, users correct these failures through conversational exchanges—iterative feedback loops that eventually steer the model toward the desired behaviour. However, these corrections are ephemeral. The same model, presented with a structurally identical problem in a new context, will repeat the same error and require the same correction.

Prompting-based approaches to self-reflection (e.g., chain-of-thought, self-critique) ask the model to perform introspection in a single forward pass, which amounts to pattern-matching against what introspection looks like in training data rather than genuine reflection over experience [shinn2023reflexion]. Real introspection is a process over time: notice something, revisit with new context, gradually refine understanding.

We propose training language models to produce introspection outputs—explicit textual artefacts that, when injected into context, reconstruct the internal state that would otherwise require a full multi-turn exchange. The key distinction from prior work on self-summarisation [yang2025cursor] and self-distillation [hubotter2025sdpo] is that introspection produces a portable, reusable artefact with strict output invariance: $input + introspection \to output$ must match the output obtained through the full corrective exchange.

Problem Formulation

Definitions

An exchange is a multi-turn conversation where the model is corrected from an initial wrong output to the final correct output. An exchange is defined by the underlying lesson being corrected, not the surface form—two different conversations correcting the same type of mistake constitute the same exchange.

An introspection is a generated text $r$ such that for input $x$ and exchange $e$ :

f(x, e) = y \implies f(x, r) = y

where

f

is the model’s output function. The exchange becomes redundant—multi-shot tasks become single-shot.

Desirable Properties

Cross-input transfer. If $r$ is generated from exchange $e_A$ on input $x_A$ , it should also work on a different input $x_C$ that involves the same underlying error type:

f(x_C, r_A) = f(x_C, e_A) = y_C

Introspection is a function of the exchange, not the input.

Compositionality. Tasks can be compositions of subtasks. If input $x_D$ is composed of sub-problems $K$ and $F$ , then:

\begin{align} f(x_K, r_D) &= y_K f(x_F, r_D) &= y_F \end{align}

and conversely,

r_K + r_F

composed should help on

x_D

. This implies introspection has internal structure—it is a composition of separable lessons, not a monolithic summary.

Generalisation as regularisation. The objective should maximise the number of different questions for which a single introspection helps:

\max_{r} \sum_{q \in \mathcal{Q}} \text{reward}(f(q, r))

This prevents overfitting introspections to specific inputs and pushes toward modular, reusable lessons.

Spontaneous recall. Given a new question $x$ , the model should autonomously generate a block containing the introspection most relevant to $x$ —or a novel composition of previously learned introspections—before generating its answer. The recalled text enters the token context and directly conditions the model’s subsequent generation:

f(x) \rightarrow f(x, \underbrace{

where

\hat{r}

is sampled from a distribution learned via SFT over validated introspections (Section [sec:recall]). Because the model generates the recall tokens autoregressively,

\hat{r}

need not match any introspection seen during training—it can be a novel recombination of learned lessons, composed on the fly for the current task. Crucially, this is not retrieval from an external memory store; the introspection is regenerated from the model’s own weights into the context window, where it shapes the downstream computation. The model should also learn when not to recall—when the question is within base capability, producing should confer no advantage and thus receive no reinforcement.

Method

Introspection as Internal State Reconstruction

Introspection is the ability for a model to generate outputs that recreate its internal states and replicate behaviours associated with those states, without going through the full interaction exchange. This perspective explains a known phenomenon: “don’t argue with the model, just start a new chat”—the model gets stuck in a low-energy internal state that additional prompting cannot escape. The exchange has put the model into a state basin, and no amount of further input within that exchange can shift it.

The model requires training on two fronts: generation (producing outputs that stimulate specific internal states) and usage (consuming those outputs and entering the corresponding states). This creates a handshake—the model learns both sides jointly, meaning introspection outputs are a private protocol between the model’s generation and consumption capabilities.

GRPO Training Procedure

We adapt Group Relative Policy Optimisation (GRPO) [shao2024grpo] for introspection training. For each training step on question $Q$ :

Generate $N$ rollouts for $Q$ . Some rollouts naturally include output, some do not.
Extract introspections from successful rollouts that produced introspection text.
Evaluate transfer: for each extracted introspection $r_i$ , inject it into $M$ rollouts on a different question $Q'$ of the same type. Run $M$ baseline rollouts on $Q'$ without introspection.
Compute introspection advantage: $A_{\text{intro}}(r_i) = \bar{R}(Q' + r_i) - \bar{R}(Q')$ where $\bar{R}$ denotes mean reward across rollouts. The baseline subtraction isolates the introspection’s contribution from the model’s base capability on $Q'$ .
Policy gradient: backpropagate through the original rollout’s log-probabilities, weighted by introspection advantage. Only update gradients on introspection tokens.

This reward structure trains three capabilities simultaneously:

When to introspect: rollouts that include on questions with nothing to learn produce zero-advantage introspections, so the model learns not to introspect unnecessarily.
What to introspect about: introspections capturing transferable lessons get positive advantage; surface-level summaries get zero.
How to format introspection: the model discovers what representation is most consumable by itself (the handshake).

Cost. Each reward evaluation requires $2M$ extra rollouts on $Q'$ (with and without introspection), making each step $\sim 2M\times$ more expensive than standard GRPO. We mitigate this by using small $M$ (4–8), only evaluating introspections from successful rollouts, and batching baseline and introspection rollouts together.

SFT for Introspection Selection via Recall

The GRPO phase (Section 3.2) produces a library of validated introspections with ground truth effectiveness mappings: for each introspection $r_i$ , we know which question types it helps and its measured advantage score. This gives us a dataset of $(q_{\text{type}}, r, A_{\text{intro}})$ triples. The second training phase uses supervised fine-tuning to teach the model to select the right introspection for a given task.

Given a problem statement, the model should generate ... tags containing the introspection most relevant to the current task. The SFT training data is constructed directly from Phase 1: for each question $q$ , the target is the introspection $r^*$ that achieved the highest introspection advantage on $q$ 's type cluster.

Compositional generalisation through sampling. SFT trains the model to produce introspections token-by-token from a learned distribution. At inference time, the model samples from this distribution—it need not reproduce any introspection verbatim. Because the distribution is learned over many introspection examples spanning different error types, sampling can produce novel recombinations: introspections the model has never generated during training but that combine elements from multiple learned introspections.

For example, if the model has learned introspection $r_A$ (about threshold-based escalation) and $r_B$ (about recognising lab value patterns), sampling from the recall distribution on a question requiring both may produce a novel $r_C$ that composes them—without $r_C$ appearing in the training data. The SFT distribution is the mechanism through which composition emerges naturally, without requiring explicit compositional training.

Spontaneous recall. The model must also learn when recall is beneficial. We include both recall-beneficial examples (questions where injecting an introspection improves performance) and recall-unnecessary examples (straightforward questions within the model’s base capability) in the SFT data. The model learns to produce only when the question matches a known error pattern, analogous to how reasoning models learn when is worth producing [deepseekr1].

Full Training Pipeline

The complete pipeline is cyclic:

Deploy and collect: the model encounters tasks and produces exchanges (corrections).
RL (GRPO): train the model to generate good introspections, producing a library of $(q_{\text{type}}, r, A_{\text{intro}})$ triples.
SFT: train the model to select and recall the right introspection via tags, using the library as supervision.
Inference: model sees a new question $\rightarrow$ spontaneously generates $\rightarrow$ samples from learned distribution $\rightarrow$ may produce novel composed introspection $\rightarrow$ solves in single shot.
Test-time adaptation: integrate new introspections into weights (see below).

Steps 4–5 produce new exchanges and introspections that feed back into steps 2–3, creating a continual learning loop.

Test-Time Adaptation

Because of the handshake, the model may produce introspections at test time—including novel compositions from the recall distribution—that its weights have not yet learned to consume. The notes analogy is instructive: study notes are triggers, not the knowledge itself—someone else’s notes are often useless because the background knowledge differs.

We propose test-time GRPO as a “dreaming” mechanism: the model simulates alternate scenarios and runs GRPO on them to integrate new introspections into its weights. Given introspection $r$ :

Take $x + r$ , generate rollouts
Compute reward (similarity to desired behaviour)
Vary the input, expect the same behavioural shift
Update weights via RL

Relation to Prior Work

Self-Summarisation

Cursor Composer 2 [yang2025cursor] uses self-summarisation: the model summarises its own context at token limits, optimised via RL on both summary and downstream task quality.

Comparison of introspection with related approaches.
	Self-Summ.	SDPO	SSD	Introspection (ours)
Goal	Continue long task	Better alignment	Improve base dist.	Eliminate exchange
Trigger	Token limit	Every interaction	Offline	Something worth learning
Transfer	Same task	Per-interaction	Same task type	Cross-input
Artefact	Compressed state	None (weights)	None (weights)	Reusable text
Multi-shot $\rightarrow$ ?	Many $\rightarrow$ many	Implicit shift	Implicit shift	Many $\rightarrow$ one
Output invariance	Not claimed	Not claimed	Not claimed	Strict

SDPO: Self-Distillation Policy Optimisation

H{\"u}botter et al.\ [hubotter2025sdpo] propose SDPO, which converts rich textual feedback (e.g., runtime errors, judge evaluations) into a dense learning signal by treating the model conditioned on feedback as a self-teacher. The model’s feedback-informed next-token predictions are distilled back into the base policy, leveraging the model’s ability to retrospectively identify its own mistakes in-context. SDPO improves sample efficiency and final accuracy over RLVR baselines across scientific reasoning, tool use, and competitive programming.

The overlap is significant: both use the exchange as the learning signal, both want the model to behave without the exchange as if it had the exchange, and both see personalisation as a natural consequence. The fundamental difference: SDPO bakes learning into weights implicitly. Introspection produces an explicit, portable artefact that can be stored, composed, and applied to novel inputs. SDPO cannot do $r_A + x_C \rightarrow y_C$ because there is no introspection object.

Useful borrowing. SDPO’s token-level log-ratio advantage (their Equation 1) is a clean reward signal: $A_i = \log \pi(y_i \mid x, o, y_{<i}) - \log \pi(y_i \mid x, y_{<i})$ . This could replace or supplement our similarity-based reward.

Self-Distillation from Shifted Distributions (SSD)

Zhang et al.\ [zhang2026ssd] show that sampling from a model at shifted temperature and truncation parameters, then fine-tuning on the raw (unverified) outputs, substantially improves code generation—Qwen3-30B improves from 42.4% to 55.3% pass@1 on LiveCodeBench. No RL, no verifier, and the method works even when 62% of training samples are incorrect.

The key insight is the precision-exploration conflict: token positions are either “locks” (one correct continuation, distractors in the tail) or “forks” (multiple valid continuations). No single decoding temperature handles both optimally. SSD reshapes the output distribution context-dependently—stripping tails at locks while preserving diversity at forks—through distributional self-distillation rather than explicit token-level control.

Like SDPO, SSD improves the model’s weights implicitly with no artefact produced; it cannot perform cross-input transfer via an explicit introspection object. However, SSD and introspection are complementary rather than competitive: SSD could improve the base model before introspection RL, and SSD’s finding that noisy training outputs still provide useful gradient signal through distributional reshaping is encouraging for our approach, where early-stage introspections are often semantically imperfect.

Preliminary Experiments

Setup

We train Gemma 4 e2b-it (4B parameters) on a single RTX A6000 (51GB). Questions are drawn from obstetrics, clustered by underlying clinical principle. We use single-shot rollouts—winners from the first generation only, no multi-turn correction—to avoid meta-gaming (Section [sec:metagaming]).

Rollout scaling. Batched generation latency on this hardware shows two regimes: sublinear growth from 1–32 rollouts (amortising weight-loading and launch overhead) and near-linear growth from 64–128 (KV-cache traffic and straggler effects dominate). We operate at $N=16$ rollouts.

Training Stability

Initial training collapsed within 5 steps—the model degenerated to producing “N/A” outputs. We identified six causes:

SGD with lr= $10^{-4}$ and momentum 0.9 was too aggressive for policy gradient. Fixed: lr= $10^{-6}$ , no momentum.
Negative advantages destroyed the model: $\text{adv} = R - \bar{R}$ guarantees some correct rollouts get negative advantage. Fixed: $clamp (adv, \min = 0)$ .
No gradient clipping. Fixed: $clip_grad_norm (\max = 1.0)$ .
Wrong baseline for regularisation advantage—subtracting the initial question’s mean rather than a fresh baseline on the transfer question $Q'$ .
Missing torch.no_grad() on rollout generation, causing KV cache to persist in the computation graph. Reserved GPU memory grew from 10.4GB to 41.0GB.
No skip for zero-advantage steps, wasting forward/backward passes.

After fixes, 23 training steps completed without collapse. A canary metric (initial rollout accuracy) remained stable, confirming base capability preservation.

Results

Validation advantage (same question). Mostly positive—learning points help on the question they were generated from. This is expected and represents the easy case.

Regularisation advantage (different question). Noisy around zero. Learning points are not consistently transferring to different questions within the same cluster over 23 steps.

Meta-Gaming

Multi-turn correction prompting (“the answer is wrong, try again”) caused the model to produce exam-strategy learning points instead of domain reasoning:

“In a multiple-choice setting, if the standard options are repeatedly rejected, the intended answer is often the most extreme, definitive action listed.”

Analysis of 87 learning points: 13% showed meta-gaming. At 1–2 correction turns, 7–14% meta-gaming; at 3+ turns, 100%. We switched to single-shot rollouts, eliminating the correction loop entirely.

Cluster Analysis

Manual analysis of a 13-question pre-eclampsia cluster revealed four sub-types with different transferability profiles:

Type A (5 questions): BP + proteinuria $\rightarrow$ urgent escalation. One learning point covers all 5.
Type B (3 questions): HELLP syndrome recognition from lab values. Transferable within sub-type.
Type C (3 questions): Standalone knowledge facts. No transferable structure.
Type D (1 question): Pathophysiology diagnosis.

The ideal learning point for Type A, discovered through iterative analysis, encodes two decision thresholds ( $\geq$ 140/90 $\rightarrow$ urgent referral; $\geq$ 160/110 $\rightarrow$ emergency admission). A learning point capturing only “escalate urgently” without the threshold distinction fails on questions requiring the finer discrimination. This demonstrates that natural introspection tends to over-generalise, compressing away discriminating detail.

Discussion

Energy landscape interpretation. Model behaviour can be understood through an energy landscape where low-energy states are attractors—easy to fall into, hard to escape. The “start a new chat” phenomenon occurs because the model is trapped in a basin that further prompting cannot escape. Introspection aims to stabilise high-energy states that are useful but fragile, making them attainable in a single step.

Cross-learning between rollouts. Within a GRPO batch, successful rollouts could teach unsuccessful ones—using successful rollout content as supervision rather than just scalar rewards. This self-distillation within the group provides richer gradient signal than “this rollout was good, that one was bad,” though it risks collapsing diversity.

Spontaneous introspection and recall. Analogous to how reasoning models learned to use tags through RL reward [deepseekr1], we expect models trained with introspection advantage to learn when is worth producing—without explicit prompting. The SFT phase extends this to recall: the model learns when would help (novel problem matching a known error pattern) vs.\ when base capability suffices. The two mechanisms are complementary— generates new knowledge after a task, retrieves existing knowledge before a task.

Conclusion

We have presented introspection as a trainable capability for language models: the ability to generate explicit, transferable artefacts that reconstruct internal states otherwise requiring multi-turn exchanges. Our two-phase pipeline—RL for introspection generation, SFT for introspection selection via recall—provides both a clear reward signal (introspection advantage) and a mechanism for compositional generalisation (sampling novel recombinations from the learned recall distribution). Preliminary experiments identify key challenges: meta-gaming from multi-turn correction, the difficulty of cross-question transfer, and the need for fine-grained cluster design. The core research questions remain open: can RL produce introspections that genuinely transfer across inputs, and can the SFT recall distribution produce composed introspections that exceed the sum of their parts?

Acknowledgments

Experiments were conducted on a single RTX A6000 provided by RunPod. This work was not externally funded.