Testing runs your solution, evaluation, and cleanup code against a real AWS account. It creates and deletes real resources in your own account. Before testing, review_project must pass.
Testing granularity
You can test at different levels of granularity depending on how much confidence you have in the code.
Request
What runs
test solution 01.02
Solution code for step 01.02 only
test evaluation 01.02
Evaluation code for step 01.02 only
test step 01.02
Solution then evaluation for step 01.02
test objective 02
Solution then evaluation for all steps in objective 02
cleanup objective 02
Cleanup code for objective 02
cycle test objective 02
Test objective 02, clean it up, test it again
test project solutions
Solution code for all objectives in sequence
test project
Solution then evaluation for all objectives
test and cleanup project
Solution, evaluation, then cleanup for all objectives
test next
The next untested step based on session state
Start with individual steps when testing for the first time. Once you’re confident the code is solid, run at objective or project scope.
Resource flow
Each test run follows this sequence:
Solution code Test runner Evaluation code
───────────── ─────────── ───────────────
Creates AWS resources Receives resources list Receives generated handles
Returns {name, Creates generated as events (type + id only)
type, id} handles in session Receives resolved handles
(name + type + id)
Validates configuration
Returns {name, type, id, status}
Checks IDs:
match → promote to resolved
mismatch → hard fail
Next step's solution
gets all handles as
context["resolved"]
Generated handles are created from the solution’s return value. They are unvalidated — the test runner knows the resource exists but hasn’t confirmed it’s correctly configured.
Resolved handles are handles that have passed evaluation. They are available to subsequent steps via context["resolved"].
Evaluation code receives two views of the current step’s resources:
events — the generated handles as unqualified events (type + id, no name)
resolved — all previously validated handles from prior steps (name + type + id)
If evaluation returns a handle in its validated list with status: "found", the test runner promotes it from generated to resolved. If the ID in the evaluation result doesn’t match the generated handle’s ID, the runner hard-fails — this indicates the evaluation code found a different resource than the solution created.
Cycle tests
A cycle test runs an objective, cleans it up, then runs it again. This verifies two things: that the code works correctly, and that cleanup leaves the account in a state where the objective can be run again cleanly.
Cycle test sequence for objective 02:
test_solution scope=objective 02
test_evaluation scope=objective 02
test_cleanup scope=objective 02
test_solution scope=objective 02
test_evaluation scope=objective 02
If cleanup leaves orphaned resources, the second run will typically fail with a naming collision or dependency conflict — which is exactly what you want to catch before publishing.
Session state
The test runner maintains a session in .genlabz/test-session.json (project-local, gitignored). The session tracks:
Resource handles accumulated across steps (both generated and resolved)
Which steps have been tested (solution and evaluation)
Results for each step (success/failure, messages, timestamps)
Cleanup results per objective
The session enables resumability — if a test run is interrupted, you can pick up where you left off. Call get_test_session to inspect the current state.
When some objectives are already deployed from a prior session and you want to test the project, clarify whether to start from the next untested objective or roll back the deployed objectives first.
Failure handling
If any solution, evaluation, or cleanup script fails, stop immediately — do not run further tools. The test runner returns structured output including the error, logs, and handle state.
A typical failure summary looks like:
solution failed at step 01.02
Error: BucketAlreadyExists
Logs:
Failed to create bucket: An error occurred (BucketAlreadyExists)...
Handle
Type
ID
State
01_01_bucket
AWS::S3::Bucket
my-bucket-a3f9c2d1
resolved
01_02_policy
AWS::S3::BucketPolicy
—
not_found
From here you have three options:
Investigate — read the failing code file, understand the error, fix it
Fix and re-run — edit the code, then re-run test solution 01.02 or test evaluation 01.02
Start objective over — run cleanup objective 01, then re-run from the beginning
Presenting results
After every test_* tool call, call get_test_summary scope=last and display the output verbatim. The test tools return a slim JSON response with success/failure status; get_test_summary returns the full formatted markdown with tables, logs, and handles.
To see the full project status at any point, call get_test_summary scope=session.
AWS credentials
The test runner resolves credentials in this order:
Explicit profile and region parameters on the tool call
Project config in config.local.toml under [test]
Default AWS credential chain
Account ID is verified via sts:GetCallerIdentity at session creation and logged so you know which account resources are being created in.
What’s next
After all tests pass, the project is ready for IAM policy generation and publishing preparation.
For the full tool signatures and parameter reference, see MCP tools.