Agent Self-Improvement¶

How the review agent evolves its reasoning over time. The system is designed so the agent doesn't start fresh each week — it sees its prior work, assesses its quality, and adjusts.

The Autoregressive Loop¶

Each weekly review cycle feeds the next. The agent's outputs become inputs for future runs:

Week N                              Week N+1
─────                              ────────
Generate hypothesis H1  ───────►   See H1 in thesis packet
Score NVDA: 7 (bullish) ───────►   See prior score trend: 7, 7, 6
Record CATALYST_STATE   ───────►   See prior observation via lookup
Write REFLECTION        ───────►   Read own reflection before scoring
Resolve H1: confirmed   ───────►   H1 appears in "Recently Resolved"
                                   with quality assessment in resolution

What persists across runs¶

Data	Storage	How agent sees it
Hypotheses	`thesis_hypotheses` (DB)	Active + recently resolved shown in thesis packets
Scores	`rubric_scores` (DB)	Prior scores context in thesis/holding packets
Evidence	`decision_evidence` (DB)	Linked to hypotheses, shown in packets
Observations	`review_annotations` (DB) + `observations.json` (filesystem)	`openfin review lookup <thesis-slug>`
Decisions	`decisions.json` + DB	`openfin review show <run-id>`

What's computed fresh each run¶

Data	Source	Purpose
Thesis health	Hypothesis confirmed/invalidated ratio, recency-weighted	Modulates action thresholds
Time pressure	Thesis time_horizon vs elapsed time	Urgency signal
News age	Finnhub article timestamps	Catalyst priced-in assessment

Three Feedback Loops¶

1. Hypothesis lifecycle (structured, cross-run)¶

The primary mechanism. Hypotheses persist in DB, survive across runs, and carry their full history.

Created (run A)
  → Active: shown in thesis packet each run
  → Agent assesses against new data each week
  → Confirmed/Invalidated (run B)
      resolution includes: what happened + was this useful + what would be better
  → Shown in "Recently Resolved" for 4 weeks
  → Feeds thesis health computation

Quality signal: The resolution field is where meta-learning happens. A good resolution says not just "ASML bookings beat by 15%" but also "this hypothesis was well-timed and directly informed the BUY_MORE decision on ASML; similar earnings-catalyst hypotheses work well for this thesis." A bad one notes: "this hypothesis was too vague to act on — next time specify the revenue threshold."

The revised status captures evolution: when a hypothesis isn't wrong but needs refinement, the agent marks it revised and creates an improved version, leaving a trail of how its thinking sharpened.

Bear-case hypotheses (BEAR: prefix) follow the same lifecycle but represent the strongest argument against a thesis. Their confirmation is a warning signal; their invalidation clears a risk.

2. Reflection annotations (narrative, cross-run)¶

Per-thesis observations stored as obs:<thesis-slug> annotations with structured prefixes:

Prefix	Purpose	Example
`REFLECTION:`	Meta-learning from prior week	"Scored AMD too bullish last week (7→actual -4%). Hypothesis H3 was too vague to test. Need tighter invalidation criteria on earnings hypotheses."
`CATALYST_STATE:`	Point-in-time catalyst assessment	"PRICED_IN \| GTC news is 5d old, NVDA has already moved +8%"

These accumulate in observations.json during the active cycle and persist to review_annotations in DB. The agent retrieves its own history via openfin review lookup <thesis-slug> at the start of each review.

Cross-run evolution example:

Week 1: CATALYST_STATE: ABSORBING | NVDA GTC announcements driving price higher
Week 2: CATALYST_STATE: PRICED_IN | GTC news is 10d old, price has plateaued
Week 2: REFLECTION: Last week's ABSORBING call was correct but I scored
         news_sentiment too high (8) — the move was already 60% done.
         Calibrate: when ABSORBING shifts to PRICED_IN, news_sentiment
         should drop to 5-6, not stay elevated.
Week 3: [agent applies this calibration to current scoring]

3. Score calibration (implicit, data-driven)¶

Prior scores are shown in thesis and holding packets as trend context:

Prior Scores: Last review (2026-03-21): composite=0.8 -> BUY_MORE. Trend (last 3): 0.8, 0.7, 0.6

The agent sees whether its scores have been trending up/down and whether the trend matched price action. The SOP instructs: - When 3+ prior data points exist, note trend direction - Compare prior scores against subsequent price moves during reflection - Adjust calibration: if consistently too bullish, bias scores lower

This is implicit — no separate storage, just the agent reasoning over data already in the packet.

Filesystem Artifacts During a Review Cycle¶

All agent outputs accumulate on disk during the active cycle before DB finalization:

{RUN_ID}/
  inputs.json          # Phase 1: raw gathered data (read-only after gather)
  context.json         # Phase 1: symbol scoring contexts
  summary.json         # Phases 1+5: narrative summaries, accumulates
  scoring.json         # Phases 3-4: scores + evidence, accumulates
  observations.json    # Phases 1.5+2+2.5: obs: annotations, accumulates
  decisions.json       # Phase 6: composites + actions (written at finalize)
  report.md            # Phase 6: rendered final report
  theses/{SLUG}.md     # Phase 1: thesis context packets
  holdings/{SYM}.md    # Phase 1: holding context packets

observations.json is a dict[str, str] — field → latest value. Same shape as what goes to DB via review_annotations, but inspectable on disk during the active cycle.

Querying the Learning Trail¶

# See all observations for a thesis across runs (newest first)
openfin review lookup ai-compute-hardware

# See a specific run's full state (scores, annotations, decisions)
openfin review show weekly-youthful-ferocity

# See hypothesis lifecycle for a thesis
openfin thesis status ai-compute-hardware

# List recent runs to find the prior one
openfin review list -n 5

What the Agent Should Get Better At Over Time¶

Hypothesis specificity — vague claims get noted as low-quality in resolutions; the agent learns to write tighter invalidation criteria
Score calibration — systematic over/under-scoring gets flagged in reflections; the agent adjusts its anchoring
Catalyst timing — tracking how CATALYST_STATE transitions correlate with price moves teaches the agent when to call something priced in
Bear case quality — bear hypotheses that get invalidated quickly were too conservative; ones that persist are genuinely informative risks
Research targeting — Phase 2.5 research that consistently finds nothing useful for certain thesis types teaches the agent to focus research elsewhere

Soul & Policy Layer¶

The agent's reasoning is shaped by two user-authored files in ~/.openfin/:

Soul (soul.yaml) — prose narrative describing who the user is as an investor. Injected verbatim into all agent system prompts (weekly review, daily triage, Telegram /ask). Rarely changes. Gives the agent stable context about risk tolerance, investment style, and decision-making philosophy.
Policy (policy.yaml) — guidance prose + structured limits (max position %, max sector %, min cash %). Selected from a template during openfin init, then customized by the user. Shapes how the agent frames scoring rationale and action recommendations.

Both are prepended to system prompts so they sit at the highest priority in the agent's context hierarchy. When the agent writes a REFLECTION: or scores a rubric, it reasons within the bounds of the investor's identity and constraints rather than applying generic heuristics.

Future: Policy tick. The weekly review will evaluate policy fitness — detecting limit breaches, style drift, and time horizon mismatches — and write POLICY: annotation suggestions. The human reads these and manually edits policy.yaml if they agree. The agent never auto-modifies policy.

Design Constraints¶

No LLM calls from CLI. The agent calls CLI commands; the CLI persists data. Intelligence lives in the agent, not the tool.
No new tables for meta-learning. Hypothesis resolution text, obs: annotations, and score history provide the audit trail. Structured extraction can be added later if needed.
Filesystem-first. JSON artifacts are authoritative during the active cycle. DB is best-effort write-through for queryable history.
Convention over schema. BEAR: prefix on hypotheses, REFLECTION: / CATALYST_STATE: prefixes on annotations — these are naming conventions the agent follows, not enforced types. This keeps the system flexible while the right patterns emerge.