Reflection = generate, critique yourself, fix what failed—automated with LangGraph, three local LLMs, and a score-based stop condition so you get better descriptions without manual editing every time.
A short walkthrough of the generate → evaluate → improve loop in Practice_Reflection_QA.py, built with LangGraph and multiple local LLMs via Ollama.
The problem
You need a 1–2 line description for an article title. A single LLM call often works, but quality is inconsistent: vague wording, wrong tone, or missing the point.
The reflection pattern fixes this by splitting the work into roles:
Generator — draft the description
Evaluator — score it and give feedback
Improver — rewrite using that feedback
The graph loops until the description is good enough or a safety cap is hit.
Architecture at a glance
Each node is a separate LLM. That is a lightweight multi-agent setup: not three chat personas in one model, but three specialized steps with their own prompts and models.
Shared state
All nodes read and update one ArticleReflectionState:
class ArticleReflectionState(TypedDict):
title: str
description: str
feedback: str
score: int
iteration_took: int
title — user input
description — current draft (updated by generate and improve)
feedback — evaluator’s raw output (score + suggestions)
score — parsed integer from SCORE=7 style text
iteration_took — how many improve rounds have run
LangGraph passes this dict through the graph; each node returns the updated state.
The workflow (LangGraph)
START → generate → evaluate → [conditional]
├─ need_improvement → improve → evaluate (loop)
└─ evaluation_pass → END
Generate invokes the generator prompt with the title.
Evaluate scores the description and stores feedback. A small parser pulls the score from free-form LLM text:
PASSING_EVAL_SCORE = 9
MAX_IMPROVE_ROUNDS_BEFORE_ACCEPT = 2
Improve rewrites the description using title, current text, and feedback, then increments iteration_took.
Routing (is_article_to_be_improved) stops when:
score >= 9 — good enough, keep result
or iteration_took >= 2 — cap reached, accept latest draft anyway
That second rule avoids infinite loops when the evaluator never gives a 9.
Why multiple LLMs?
Using different models per role is a practical multi-agent pattern:
You do not need three different models for reflection to work; one model with three prompts is enough. Using three is an experiment in specialization and ensemble-style quality.
Review: what works well
Explicit reflection loop — Evaluate before ship; improve only when needed.
Structured exit — Score threshold + max rounds is a solid production habit.
Parsed score — SCORE=(\d+) keeps routing deterministic despite messy LLM prose.
LangGraph — Conditional edges make the loop obvious and easy to visualize (graph/reflection_graph.png).
Thin nodes — Each function does one thing: prompt → LLM → update state.
Run it
# Ollama: pull models used in the script
ollama pull mistral
ollama pull gpt-oss:20b
ollama pull llama3.2
python Practice_Reflection_QA.py
Enter an article title; type /bye to exit. Uncomment render_store_graph(wf_app) to export the graph PNG.
Code: Practice_Reflection_QA.py
