Post

AI prompt frameworks that are actually useful

A practical comparison of GSD, Grill Me, Grill with Docs, ReAct, and Chain of Draft.

AI prompt frameworks that are actually useful

Introduction

AI prompt frameworks are easier to name than to compare well.

Terms like GSD, Grill Me, Grill with Docs, ReAct, and Chain of Draft often end up in the same conversation, but they are not all trying to do the same job. Some are execution workflows. Some are clarification patterns. Some are ways to ground reasoning. Some are simply leaner ways to think through a task.

That distinction matters more than the label.

GSD is trying to solve context rot in long-running coding sessions. Grill Me is trying to expose hidden assumptions before code is written. Grill with Docs is trying to align the model to the project’s actual language and documented decisions. ReAct is trying to ground reasoning through actions and observations. Chain of Draft is trying to cut reasoning cost and latency without throwing reasoning away.

Seen that way, the useful question is not which framework is best in the abstract. It is which one matches the failure mode you are actually dealing with.

TL;DR: if you only want the short version, jump to the comparison table.

What follows is a comparison that I think is more useful in practice: where each framework fits, what it costs, where it fails, and which one seems underrated.

A Practical Way To Compare Them

For practical work, I think these frameworks are worth comparing on seven axes:

  • applicability area
  • context strategy
  • token efficiency
  • setup cost
  • durability across sessions
  • main strength
  • main failure mode

That lens makes the tradeoffs easier to see. There is no single winner, because the frameworks are solving different problems.

The mismatch usually shows up quickly in practice. If you reach for Grill Me when you really need GSD, you may get good clarification without enough execution structure. If you reach for GSD when the real issue is terse reasoning, you may introduce more process than the task needs. If you use ReAct for a design problem that is mostly internal, you may pay for a tool-using loop without getting much back from it.

GSD

GSD, short for Get Shit Done, is the most system-oriented framework in this list.

It is not just a good prompt. It is a structured development workflow built around meta-prompting, spec-first execution, context engineering, and verification. The sources around GSD are explicit about the problem it is trying to solve: long AI coding sessions decay. Context gets crowded, early architectural choices fall out of the window, and the model becomes progressively less reliable.

What makes GSD interesting is that it treats this as an engineering problem, not a personality problem. It does not just tell the model to “stay focused.” It introduces staged workflows, scoped prompt builders, context budgets, token profiles, tool policies, and summary artifacts. The DeepWiki material is especially useful here because it shows that GSD is managing prompt assembly deliberately: what gets inlined, what gets compressed, where truncation happens, and even how prompt sections are ordered for cache hits.

For longer, messier coding efforts, that makes GSD a serious option. If the task is large enough that context management becomes the bottleneck, the extra setup and process can be justified.

The tradeoff is that GSD is heavier than the other entries. For small features, quick scripts, or changes where a normal back-and-forth would be enough, it can add structure that the task does not really need. It also widens the operational surface area: permissions, orchestration, multiple stages, and more moving parts to trust.

My take: GSD is strongest when session length and project size are the real problem. If context rot is not hurting you yet, it may be more framework than you need.

Grill Me

Grill Me is interesting because it is almost the opposite of GSD in terms of weight.

The core idea is simple: before implementation, ask the agent to challenge your plan instead of executing it. The agent interrogates the design one question at a time, proposes a recommended answer for each question, and resolves anything it can directly from the codebase instead of bothering you.

One reason this works well is that it changes the timing of the conversation. You describe a plan, the model starts coding, and only later do you discover the real ambiguity. Grill Me moves that ambiguity earlier. It spends time on the decision tree before code exists.

Its setup cost is close to zero, which is one reason it has spread so quickly. You do not need a whole system or a repository convention. You just need the right interrogation behavior.

That does not make it free. Grill Me often spends more tokens up front than a direct implementation prompt would. But that is not the most useful way to measure it. On an ambiguous task, the real comparison is not Grill Me versus one short prompt. It is Grill Me versus one short prompt plus the repair cycle that follows a half-baked implementation.

It is not always the right fit. If the problem is already constrained by external rules, existing APIs, or an obvious implementation path, interrogation can produce a lot of confirmation without much discovery. It also loses much of its value if you do not turn the result into something durable, such as a written plan, PRD, or issue breakdown.

My take: Grill Me is one of the highest-leverage low-setup patterns in this space. It is especially good just before you would otherwise hand a feature to an agent and hope for the best.

Grill With Docs

Grill with Docs is a useful extension of Grill Me because it addresses a different failure mode.

Not every bad AI answer is caused by lack of reasoning. Sometimes the problem is vocabulary. The model uses generic terms like item, handler, or data, while the project actually revolves around domain concepts with specific meanings. At that point the model is not merely imprecise. It is thinking in the wrong language.

That is what Grill with Docs is trying to fix.

The workflow adds a domain glossary in CONTEXT.md, optionally supported by ADRs for high-value decisions. During the conversation, the agent is supposed to challenge fuzzy wording, call out conflicts with the glossary, cross-check claims against the code, and update documentation inline when decisions solidify. The public skill definition makes that operating model clearer than the name alone.

That changes the economic profile of the workflow. The upfront setup cost is higher than Grill Me, because some documentation discipline is required. But the recurring cost can be lower in a real project because each new session no longer needs to rebuild the same domain alignment from scratch. One of the most convincing claims in the source material is that this can make the model think and answer more tersely, not because the prompt is shorter, but because the model no longer wastes tokens inventing local terminology.

That makes Grill with Docs especially relevant in brownfield work. If the repo has business language, long-lived decisions, or multiple bounded contexts, it can be more compelling than plain Grill Me.

The risks are mostly cultural. If CONTEXT.md becomes a dumping ground for implementation detail, the system gets worse, not better. If every decision becomes an ADR, the docs become noise. And if the project is tiny or disposable, the overhead is hard to justify.

My take: Grill with Docs is what I would reach for when the repo is real, the language matters, and future sessions need to inherit the same mental model instead of relearning it every time.

ReAct

ReAct is older and more research-shaped than the other named workflows here, but it still belongs in the conversation because it solves a problem that the prompt-skill workflows do not. The original paper is still the cleanest reference for the pattern.

The pattern interleaves reasoning and action. Instead of asking the model to think everything through internally and then answer, it lets the model reason, take an action, observe the result, and continue. That matters when the model needs grounding from an external environment, a search result, a tool, or a knowledge source.

In that setting, ReAct is often more reliable than a pure reasoning prompt. The model does not have to act as if its internal knowledge is enough. It can look, update, and continue.

That comes with cost. ReAct-style traces are usually not cheap in token terms, and they can become noisy if the model keeps oscillating between thought and action without converging. But when correctness depends on live observation, that overhead can be justified. In those cases, fewer hallucinations can matter more than terser prompting.

For coding work, I would not treat ReAct as a replacement for design clarification or spec-first planning. It fits better when the agent needs to inspect tools, query systems, gather observations, or work in an environment where the next move depends on what the environment says back.

My take: ReAct is still one of the clearest answers to the question, “what should the model do when it does not yet know enough and can go find out?”

The Underrated One: Chain Of Draft

If I had to pick one underrated framework from the research angle, it would be Chain of Draft.

Its contribution is simple and practical: reasoning does not have to be verbose to be useful. Instead of long chain-of-thought traces, the model writes short, essential intermediate notes. The paper reports that this can match or beat chain-of-thought accuracy while using as little as 7.6% of the tokens.

That matters because token efficiency is not a cosmetic concern anymore. On paid models, verbose reasoning costs money. On interactive tools, it costs latency. On long workflows, it also increases the chance that useful context gets crowded out by the model’s own wordiness.

Chain of Draft is underrated partly because it is not a complete end-to-end workflow like GSD, and partly because it sounds less ambitious than a large orchestration system. But for bounded reasoning tasks it addresses a very real bottleneck directly.

I would not present it as a whole coding workflow. It is better understood as a reasoning style that pairs well with other systems. Use it when the task benefits from structure but not from essay-length explanation. Skip it when you need a durable rationale that humans will later read or audit.

My take: if the last year taught us to take context engineering seriously, the next useful lesson may be to take reasoning brevity seriously too.

So Which One Should You Actually Use?

My short version looks like this:

  • Use Grill Me when the plan is still fuzzy and mistakes are expensive to reverse.
  • Use Grill with Docs when the repo already has domain language and decisions that need to stay consistent across sessions.
  • Use GSD when the task is large enough that context rot, workflow structure, and verification become the main problem.
  • Use ReAct when the agent needs external observations, not just internal speculation.
  • Use Chain of Draft when token cost and latency matter and the reasoning step can stay terse.

I do not think these frameworks make much sense on a single leaderboard.

In practice, they can also stack.

You might use Grill Me to surface the decision tree, turn that into a durable spec, run a GSD-style execution workflow for implementation, rely on ReAct where tool observations are necessary, and keep parts of the reasoning terse in a Chain of Draft style where verbosity would only burn tokens.

That is closer to how these systems are likely to be used in real work than picking one framework and treating it as the answer to every problem.

Comparison Table

FrameworkBest forContext strategyToken profileSetup costMain strengthMain weakness
GSDLong-running coding workSpecs, staged artifacts, scoped prompt assembly, tool policyMedium to high total cost, but strong budget controlHighBest at fighting context rot in larger tasksToo much ceremony for small work
Grill MePre-implementation plan reviewDecision-tree interrogationMedium upfront, often saves downstream wasteVery lowExcellent at surfacing hidden assumptionsWeak if findings are not captured
Grill with DocsBrownfield or domain-heavy reposInterrogation plus glossary and ADR alignmentMedium setup, lower recurring onboarding costMediumStrong semantic alignment across sessionsRequires disciplined docs maintenance
ReActTool-using and grounded tasksInterleaved reasoning, actions, observationsMedium to highMediumReduces hallucination by groundingCan become verbose and loop-heavy
Chain of DraftToken-sensitive reasoningMinimal intermediate notesVery lowVery lowStrong cost and latency efficiencyNot a full workflow and weak as durable documentation

Conclusion

The most useful prompt frameworks are the ones that solve the failure mode you actually have.

If your problem is ambiguity, use a framework that interrogates the plan. If your problem is context rot, use one that structures execution and context budgets. If your problem is semantic drift, use one that aligns the model to project language. If your problem is hallucination under incomplete information, use one that acts and observes. If your problem is token burn, use one that writes less.

That framing may be less tidy than naming one universal winner, but it is more faithful to how these tools behave in real work.

I hope you found this useful, see you in the next one!

This post is licensed under CC BY 4.0 by the author.