Using AI to Build AI: Lessons from Building an AI Orchestration Layer with GitHub Copilot

Most Copilot articles are about using AI to write CRUD screens or unit tests. This one is different. We used Copilot to build the system that decides which AI answers your request, what it costs, and whether there’s a record of it.

The system we built: provider routing, billing tracking, and audit trail — all flowing through one microservice. And Copilot helped write every layer of it.

There’s a question that doesn’t get asked enough in the AI productivity conversation:

What happens when you use an AI coding assistant to build infrastructure that governs other AI systems?

Not a todo app. Not a dashboard. The actual routing layer that sits between your application and your AI providers — the code that says “this use case goes to Gemini, that one goes to OpenAI, here’s the cost per call, here’s the audit trail.”

That’s what my team has been building. And using GitHub Copilot to do it has taught us things about AI-assisted development that you simply don’t encounter on simpler codebases.

What We’re Building

Our team owns the AI integration microservice for an EHS (Environment, Health and Safety) platform. Every AI feature in the application flows through this service. It has three core responsibilities:

Provider routing — mapping each use case to the right AI provider. Not every request goes to the same model. Some use cases are better suited to OpenAI. Others to Gemini. The microservice makes that decision, abstracts it from the rest of the application, and makes it configurable without a code deployment.

Billing tracking — recording the cost of every AI call against the right account. Different features, different cost centers, different billing rules. The numbers need to be accurate. Approximations are not acceptable.

Audit trail — maintaining a complete, immutable record of every AI interaction. Who requested it, which provider handled it, what was sent, what came back, whether it succeeded or failed. This isn’t optional in a compliance-sensitive domain. It needs to work even when the provider call fails.

That’s the system. Now here’s what it’s like to build it with Copilot.

Where Copilot Was Genuinely Excellent

Let me start with the honest wins, because there were real ones.

Provider SDK integration code: Each AI provider has its own SDK, its own request/response structure, its own authentication pattern. Copilot is extremely good at this category of work. It has seen thousands of SDK integration examples in training data, and the pattern of “wrap external API in an internal abstraction” is one it handles fluently. The scaffolding for a new provider — the client setup, the request builder, the response mapper — came together faster than it would have manually.

DTO and mapping classes: Every provider has its own request and response schema. Mapping between your internal model and each provider’s expected format is tedious, pattern-heavy work. Exactly the kind of thing Copilot excels at. We accepted suggestions here more freely than anywhere else in the codebase.

Strategy pattern boilerplate: The provider routing logic is implemented as a strategy pattern — each provider implements a common interface, and the router selects the right one at runtime. The structural scaffolding for this is repetitive to write by hand and perfectly suited to Copilot’s strengths.

Unit tests for deterministic logic: Billing calculation tests, routing rule tests, configuration validation tests — anywhere the expected output was fully deterministic, Copilot’s test suggestions were solid starting points. It doesn’t know your business rules, but it knows how to structure a test.

Where Copilot Struggled — And Why It Matters More Here

This is the section that matters. Because the places where Copilot falls short are manageable on a typical codebase. On this one, they’re load-bearing.

Billing logic edge cases:

Copilot writes billing code that handles the happy path confidently. The request succeeds, the tokens are counted, the cost is calculated, the record is written. Clean, readable, looks correct.

What it doesn’t handle — unless you explicitly ask, and sometimes not even then:

What happens when the provider returns a partial response?
What happens when the request times out after the provider has already started processing?
What happens when a retry succeeds — is the first attempt’s partial cost recorded?
What happens when the billing write fails after the provider call succeeds?

These are not edge cases in the theoretical sense. In a system processing AI requests at scale, they are guaranteed to happen. Copilot’s training data is full of billing code that looks correct. It is not full of billing code that is correct in all failure modes, because that code is less common and less visible.

Every billing-related suggestion we received needed a deliberate review pass asking: what happens when this goes wrong? That question has to be asked by a human. Copilot won’t ask it for you.

Audit trail completeness:

The audit trail requirement is non-negotiable: every AI interaction must be recorded, regardless of outcome. If the provider call fails, the failed attempt must still be recorded. If the audit write fails after a successful provider call, that needs to be handled — you can’t just swallow the error.

Copilot consistently suggested audit writes in the happy path only. The pattern it learned from is: call the provider, write the result. It doesn’t know that your requirement says “write the audit record even if the result is an exception.” That requirement lives in your head, your domain knowledge, your compliance context — not in the training data.

We caught several suggestions where the audit write was inside a try block that would silently skip it on failure. Each one required an explicit correction. After the third time, we added a code review rule: any try/catch block in audit-related code requires a reviewer to verify the catch branch also writes to the audit log.

Provider-specific failure modes:

Each AI provider has its own rate limiting behavior, its own error codes, its own retry semantics. OpenAI’s rate limit response is different from Gemini’s. A 429 from one provider doesn’t mean the same thing as a 429 from another. Retry-after headers vary. Backoff strategies that work well for one provider may be suboptimal for another.

Copilot generalizes. It suggests retry logic that looks reasonable in the abstract. It doesn’t know that provider A has a sliding window rate limit and provider B has a fixed quota reset at the top of the minute. That distinction matters for how you implement backoff, and getting it wrong means either hammering a provider unnecessarily or backing off more than you need to.

Concurrency in billing counters:

If two requests come in simultaneously for the same billing account, does the counter update atomically? Copilot doesn’t ask this question. It writes the increment logic. It doesn’t write the locking logic unless you explicitly ask for it — and sometimes even when you do, it misses the full scope of what needs protection.

This is the same failure mode we encountered with the AsNoTracking() race condition in our earlier article. Copilot optimizes the fragment it can see. It cannot see the concurrent request arriving 50 milliseconds later.

The Governance Paradox

Here’s the thing that kept coming up during our code reviews and that I haven’t seen discussed anywhere else.

We are building a system that audits AI. GitHub Copilot is AI. Copilot helped write the auditing system.

That creates a question worth sitting with: who audits the auditor?

It’s not rhetorical. The audit trail code that Copilot helped write is the code that will record whether AI provider calls are being made correctly and completely. If Copilot introduced a subtle gap in that code — a failure case that doesn’t write the audit record — then the audit trail will look complete when it isn’t. The absence of an audit entry isn’t visible. You can’t audit what was never recorded.

This is why we applied a stricter review standard to the audit and billing components than to any other part of the codebase. Any Copilot suggestion touching those components was reviewed with the full function visible — not a fragment, not a snippet. And any reviewer using Copilot Chat during that review was required to share the entire function context, not just the lines they were curious about.

The principle we landed on: the more load-bearing the code, the less latitude you give the AI. Copilot’s authority in our codebase is inversely proportional to the consequence of it being wrong.

What We Changed After Learning This

Four concrete things came out of building this system with Copilot:

1. We created a “high consequence” code category: Billing logic, audit writes, and provider abstraction interfaces are in this category. Code in this category gets a mandatory second reviewer, and Copilot suggestions in these areas require explicit sign-off that the full function — not just the suggestion — was reviewed.

2. We wrote the provider abstraction interface by hand: The interface that all AI providers implement — the contract that the routing layer depends on — was written by a human, reviewed by the team, and explicitly marked as off-limits for Copilot suggestions. This is the architectural decision that everything else hangs off. It needs to reflect our requirements, not a generalised pattern from Copilot’s training data.

3. We made failure paths first-class in our test requirements: Copilot writes tests for the happy path naturally. We added an explicit checklist item: for every Copilot-suggested test, is there a corresponding test for the failure case? If a billing calculation is tested for a successful response, it must also be tested for a timeout, a partial response, and a provider error. Copilot doesn’t generate these automatically. A human has to ask for them.

4. We wrote inline intent comments on all concurrency-sensitive code: Any code that touches shared state, runs inside a transaction, or depends on atomic operations gets a comment explaining why. Not what the code does — why it does it that way. This protects against future Copilot suggestions (from a reviewer or a developer) that look like optimizations but would break the concurrent behavior.

// TransactionScope used intentionally to acquire a row-level lock.
// Do not add AsNoTracking() — it affects locking behaviour under load.
using (var scope = new TransactionScope())
{
    var record = _context.Records
        .FirstOrDefault(r => r.Id == recordId);
    // ...
}

The Broader Lesson: Authority Scales With Consequence

Using Copilot to build an AI orchestration layer taught us something that applies well beyond this specific domain.

AI coding assistants are not equally trustworthy across all types of code. The same tool that reliably scaffolds a DTO class is unreliable on billing edge cases. The same tool that writes a solid unit test for a pure function is a liability on audit trail completeness. The difference isn’t the tool’s capability — it’s the consequence of being wrong and the visibility of the failure.

Incorrect DTO mapping fails loudly. A missing audit entry fails silently. Wrong unit test structure gets caught in review. Billing drift under concurrent load surfaces weeks later.

The teams that will use AI coding assistants well in complex, consequence-sensitive domains are the ones that have mapped their codebase by failure visibility — and adjusted Copilot’s authority accordingly.

AI-Assisted development means the human stays in control of the decisions where being wrong is invisible and expensive.

AI-Unsupervised development is what happens when you apply the same level of trust to billing logic as you do to a DTO class.

The irony of using one AI to build the system that governs all the others didn’t escape us. But the discipline it forced on our review process was worth it.

What’s Next

We’re still building. The provider abstraction layer is expanding. New use cases, new providers, new billing rules. Copilot is part of the workflow — it’s genuinely useful, and we’re faster with it than without it.

But we go into every session with a question that we didn’t ask in the early months: what is this code responsible for, and what does it look like if it’s wrong in a way I can’t see?

That question is the line between AI-Assisted and AI-Unsupervised.

It’s also, increasingly, the line between engineering teams that thrive with these tools and teams that get burned by them.

This is part of an ongoing series on AI-Assisted development:

Using AI to Build AI: Lessons from Building an AI Orchestration Layer with GitHub Copilot was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using AI to Build AI: Lessons from Building an AI Orchestration Layer with GitHub Copilot

What We’re Building

Where Copilot Was Genuinely Excellent

Where Copilot Struggled — And Why It Matters More Here

The Governance Paradox

What We Changed After Learning This

The Broader Lesson: Authority Scales With Consequence

What’s Next

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Kevin Warsh signals potential Fed rate cuts tied to AI productivity gains

MEXC faces $260M USDC debt on AAVE V3, liquidation risk in 6-8 days

Why First Choice Coin DAO ($FCC) Stands Out: Active Giveth Participation, Vibrant IRL Meet-Ups, and…

Hey!

LILBEDISM: A Unified Framework for Coherent Human Action, Ethical Decision-Making, and Systemic…

All about Aave’s bearish outlook as trust weakens, outflows accelerate