A skill library showed up in my feed: gstack, 35 skills for Claude Code built by engineers for engineering workflows. Well-designed. The obvious move was to install it.

Before doing that, I looked at what it required: bun (not installed), Playwright Chromium (a 400MB browser automation binary), and a setup script that compiles browser tools, registers skills globally, and initializes a lessons database. Then I looked at what the skills actually did. /ship automates the full journey from feature branch to pull request. /qa runs browser-based testing across a live web application. /retro generates weekly retrospectives from git commit history.

This site runs on PHP flat files. No test suite. No feature branches. No CI/CD pipeline.

The mismatch was obvious. But reading those skill files more carefully, something else became clear: the skills were wrong for this workflow, but the design behind them was exactly right.


What's worth stealing

Five patterns appear across gstack's skills, and none of them are specific to code review or browser automation.

Phase-numbered workflows. Every skill breaks its work into explicit, numbered phases. Not a vague instruction like "do X and Y." Explicit steps where Phase 1 must complete before Phase 2 begins. This forces sequential quality logic. A model that must finish one phase before starting the next cannot collapse steps to appear efficient.

Iron laws. Every skill embeds rules that cannot be argued away. gstack's /investigate skill opens with one: "NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST." Stated in caps. Positioned early. Not a suggestion. Without iron laws, a model that senses the user wants speed will shortcut the parts of the process that slow things down. Those are usually the parts that matter most.

Multi-perspective passes. The /review skill dispatches parallel specialist subagents: one for security, one for performance, one for test coverage. Each runs independently. Then the skill merges and de-duplicates findings. A single review pass misses things a specialist pass catches.

Defined output format. Every skill specifies exactly what the result looks like: the sections, the order, the field names. This is not cosmetic. An undefined output format means the model chooses how to present findings. It will choose in ways optimized for appearing thorough, not for being useful.

Lessons persistence. Skills read from and write to a lessons file that survives context window resets. Each session references what previous sessions discovered. Without this, every session starts from scratch.

None of these are bound to a specific domain. They are design principles. They apply to any workflow.


Start with quality gates, not skills

Before writing a single skill, map where quality decisions happen in your workflow.

Not what you do. Where quality fails. Every workflow has a handful of moments where the wrong call creates work downstream. For a blogging workflow, those moments are:

Before drafting: is the claim specific enough to write around? Posts that start without a sharp central claim produce drafts that meander and require structural rework. That is a quality gate.

During review: does this read like the person who wrote it, or like AI assistance that smoothed over the author's voice? That is a quality gate.

Before publishing: are the mechanical rules clean? The em dashes, the passive constructions, the banned words that signal AI-generated text to any reader paying attention. That is a quality gate.

After publishing: what did this session reveal about what works? That is a quality gate most people skip entirely, which is why they rediscover the same patterns session after session.

Each gate is a candidate for a skill. The skill encodes what "good" looks like at that stage, what failure looks like, and what to do about it.


Separating intent from context

The hardest design work is separating what you want the skill to accomplish from the specific context of the thing you are working on.

gstack's /review reviews code. But its intent is separable from code review entirely: run multiple specialist perspectives against a defined quality standard, surface specific findings.

The /blog-review skill we built runs three passes: a mechanical AI-tell scan, a voice assessment, and an argument quality check. Each pass has a defined quality standard. Each produces specific, line-level findings. The intent is identical to gstack's /review. The context is completely different.

This is the actual skill design work. It is also where most people fail. They describe their current task instead of encoding the quality logic that would apply to any version of that task. The result is a skill that works once and confuses on every subsequent run because the context has shifted.

The context feeds in through arguments. The intent is baked in. Get that separation right and the skill works on any instance of the problem, not just the one you wrote it for.


Sharpening a skill

Numbered phases are the guardrail. A model that must complete Phase 1 before Phase 2 cannot skip the slow parts to appear efficient.

Iron laws prevent softening. Not "try to avoid X" but "never do X." The /blog-review skill has one: "Do not skip a single banned item to be polite. If the draft is clean, say so explicitly. Silence is not a pass." Without that law, a model reviewing a draft it senses the user is proud of will pull its punches.

A defined output format matters more than it looks. The skill specifies sections, order, and the structure of each finding. A finding that quotes the original text and shows the rewrite is useful. "The voice could be stronger in places" is not a finding.

Explicit scope constraints prevent drift. The /blog-polish skill says: "Do not rewrite sentences that are clean. Fix what's wrong, leave what's right." Without that constraint, a polishing skill becomes a rewriting skill. Those are different jobs.


The compounding layer

The /blog-learn skill is the one that earns its value slowly.

Every post that lands well, every review that catches a recurring mistake, every reader response that reveals how people actually read. These are data points. Without a place to put them, each session starts from scratch. With a lessons file that future skill runs read automatically, each session calibrates on what previous sessions discovered.

A blogging workflow without a lessons loop will keep rediscovering the same patterns across different posts. The loop is cheap to set up. The value accumulates per session.


The temptation with any proven framework is to adopt it wholesale. gstack is well-built. Installing it would have taken ten minutes.

But 32 of those 35 skills would have been noise. The three relevant ones would have been designed for someone else's quality gates, not these.

The better move was to read the skill files as design examples, extract the patterns, map the workflow's actual checkpoints, and build four skills that encode those specific gates.

Writing the files took about an hour. The design thinking took longer. That ratio is right. The files are just where you put the thinking.