Reflection Agent for Generating Article

Reflection = generate, critique yourself, fix what failed—automated with LangGraph, three local LLMs, and a score-based stop condition so you get better descriptions without manual editing every time.

A short walkthrough of the generate → evaluate → improve loop in Practice_Reflection_QA.py, built with LangGraph and multiple local LLMs via Ollama.

The problem

You need a 1–2 line description for an article title. A single LLM call often works, but quality is inconsistent: vague wording, wrong tone, or missing the point.

The reflection pattern fixes this by splitting the work into roles:

Generator — draft the description
Evaluator — score it and give feedback
Improver — rewrite using that feedback

The graph loops until the description is good enough or a safety cap is hit.

Architecture at a glance

Node	Model (Ollama)	Role
generate_description	mistral:latest	First draft from title
evaluate_description	gpt-oss:20b	Score (1–10) + feedback
improve_description	llama3.2:latest	Revise using feedback

Each node is a separate LLM. That is a lightweight multi-agent setup: not three chat personas in one model, but three specialized steps with their own prompts and models.

Shared state

All nodes read and update one ArticleReflectionState:

class ArticleReflectionState(TypedDict):

title: str

description: str

feedback: str

score: int

iteration_took: int

title — user input
description — current draft (updated by generate and improve)
feedback — evaluator’s raw output (score + suggestions)
score — parsed integer from SCORE=7 style text
iteration_took — how many improve rounds have run

LangGraph passes this dict through the graph; each node returns the updated state.

The workflow (LangGraph)

START → generate → evaluate → [conditional]

├─ need_improvement → improve → evaluate (loop)

└─ evaluation_pass → END

Generate invokes the generator prompt with the title.

Evaluate scores the description and stores feedback. A small parser pulls the score from free-form LLM text:

PASSING_EVAL_SCORE = 9

MAX_IMPROVE_ROUNDS_BEFORE_ACCEPT = 2

Improve rewrites the description using title, current text, and feedback, then increments iteration_took.

Routing (is_article_to_be_improved) stops when:

score >= 9 — good enough, keep result
or iteration_took >= 2 — cap reached, accept latest draft anyway

That second rule avoids infinite loops when the evaluator never gives a 9.

Why multiple LLMs?

Using different models per role is a practical multi-agent pattern:

Benefit	How it helps here
Role separation	Generator optimizes for creativity; evaluator for critique; improver for targeted edits
Model strengths	You can pick a fast model for drafts and a stronger one for judgment
Local / cost control	All three run through Ollama — no single API vendor lock-in
Clear debugging	Logs show which step failed (bad draft vs harsh scorer vs weak rewrite)

You do not need three different models for reflection to work; one model with three prompts is enough. Using three is an experiment in specialization and ensemble-style quality.

Review: what works well

Explicit reflection loop — Evaluate before ship; improve only when needed.
Structured exit — Score threshold + max rounds is a solid production habit.
Parsed score — SCORE=(\d+) keeps routing deterministic despite messy LLM prose.
LangGraph — Conditional edges make the loop obvious and easy to visualize (graph/reflection_graph.png).
Thin nodes — Each function does one thing: prompt → LLM → update state.

Run it

# Ollama: pull models used in the script

ollama pull mistral

ollama pull gpt-oss:20b

ollama pull llama3.2

python Practice_Reflection_QA.py

Enter an article title; type /bye to exit. Uncomment render_store_graph(wf_app) to export the graph PNG.

Code: Practice_Reflection_QA.py

Rajendhiran Easu - #TechFreak

Tuesday, 2 June 2026