Tuesday, 2 June 2026

Reflection Agent for Generating Article

Reflection = generate, critique yourself, fix what failed—automated with LangGraph, three local LLMs, and a score-based stop condition so you get better descriptions without manual editing every time.

A short walkthrough of the generate → evaluate → improve loop in Practice_Reflection_QA.py, built with LangGraph and multiple local LLMs via Ollama.


The problem

You need a 1–2 line description for an article title. A single LLM call often works, but quality is inconsistent: vague wording, wrong tone, or missing the point.


The reflection pattern fixes this by splitting the work into roles:


  1. Generator — draft the description

  2. Evaluator — score it and give feedback

  3. Improver — rewrite using that feedback


The graph loops until the description is good enough or a safety cap is hit.



Architecture at a glance


Node

Model (Ollama)

Role

generate_description

mistral:latest

First draft from title

evaluate_description

gpt-oss:20b

Score (1–10) + feedback

improve_description

llama3.2:latest

Revise using feedback


Each node is a separate LLM. That is a lightweight multi-agent setup: not three chat personas in one model, but three specialized steps with their own prompts and models.


Shared state

All nodes read and update one ArticleReflectionState:


class ArticleReflectionState(TypedDict):

    title: str

    description: str

    feedback: str

    score: int

    iteration_took: int


  • title — user input

  • description — current draft (updated by generate and improve)

  • feedback — evaluator’s raw output (score + suggestions)

  • score — parsed integer from SCORE=7 style text

  • iteration_took — how many improve rounds have run


LangGraph passes this dict through the graph; each node returns the updated state.


The workflow (LangGraph)

START → generate → evaluate → [conditional]

                                  ├─ need_improvement → improve → evaluate (loop)

                                  └─ evaluation_pass → END


Generate invokes the generator prompt with the title.


Evaluate scores the description and stores feedback. A small parser pulls the score from free-form LLM text:

PASSING_EVAL_SCORE = 9

MAX_IMPROVE_ROUNDS_BEFORE_ACCEPT = 2


Improve rewrites the description using title, current text, and feedback, then increments iteration_took.


Routing (is_article_to_be_improved) stops when:

  • score >= 9 — good enough, keep result

  • or iteration_took >= 2 — cap reached, accept latest draft anyway


That second rule avoids infinite loops when the evaluator never gives a 9.


Why multiple LLMs?

Using different models per role is a practical multi-agent pattern:


Benefit

How it helps here

Role separation

Generator optimizes for creativity; evaluator for critique; improver for targeted edits

Model strengths

You can pick a fast model for drafts and a stronger one for judgment

Local / cost control

All three run through Ollama — no single API vendor lock-in

Clear debugging

Logs show which step failed (bad draft vs harsh scorer vs weak rewrite)


You do not need three different models for reflection to work; one model with three prompts is enough. Using three is an experiment in specialization and ensemble-style quality.


Review: what works well

  1. Explicit reflection loop — Evaluate before ship; improve only when needed.

  2. Structured exit — Score threshold + max rounds is a solid production habit.

  3. Parsed scoreSCORE=(\d+) keeps routing deterministic despite messy LLM prose.

  4. LangGraph — Conditional edges make the loop obvious and easy to visualize (graph/reflection_graph.png).

  5. Thin nodes — Each function does one thing: prompt → LLM → update state.


Run it

# Ollama: pull models used in the script

ollama pull mistral

ollama pull gpt-oss:20b

ollama pull llama3.2


python Practice_Reflection_QA.py


Enter an article title; type /bye to exit. Uncomment render_store_graph(wf_app) to export the graph PNG.


Code: Practice_Reflection_QA.py

No comments: