building relay infrastructure at scale: architecture and learnings
--
what does it actually take to abstract gas away from users at scale? here’s the architecture behind it and what i learned building it.
background
this is a personal writeup documenting the architecture of a relay infrastructure system i worked on at a web3 infrastructure company. the goal is twofold: to document how the system was designed, and to capture what i learned about building backend software at scale. some implementation details are fuzzy or missing from memory, and i’ve noted those explicitly.
what is relay?
a relay is a meta-transaction infrastructure that allows developers to submit transactions on behalf of their users without users needing to hold native gas tokens. the system i worked on supported several transaction types:
sponsoredCallandsponsoredCallERC2771: the sponsor (dapp developer) pays for gas via a prepaid gas tank. the ERC2771 variant adds support for msg.sender context via a trusted forwarder.callWithSyncFeeandcallWithSyncFeeERC2771: the target contract itself pays the relay fee synchronously during execution. the relayer gets paid back inline.- bundler operations like
eth_sendUserOperation: ERC-4337 account abstraction support, where the platform acted as a bundler.
the system was built as a microservices architecture in nodejs/express, with services communicating over rabbitmq (inter-service) and bullmq/redis (intra-service). each service had a clear, narrow responsibility.
services overview
1. api-gateway
the entry point for all incoming requests. every relay call from an external developer first hit the api-gateway before being routed anywhere.
responsibilities:
- api key validation: every request needed a valid api key, validated at this layer before anything else happened
- rate limiting: enforced per api key to prevent abuse
- auth: broader authentication checks
- routing: once a request was validated, the gateway routed it to the appropriate downstream service based on the method.
learning: putting cross-cutting concerns like auth, rate limiting, and routing in a single gateway layer is a clean pattern at scale. it means downstream services don’t each need to reimplement auth logic, and you have one place to enforce access control policy. this is the api gateway pattern, and it maps closely to what tools like aws api gateway or kong do in production systems.
2. relay-backend (api + worker)
this service was responsible for validating incoming relay requests and queuing them for execution. it had two components: an api and a worker.
checker api (synchronous validation layer):
when a relay request arrived from the gateway, the checker api ran a series of checks:
- chain support check: is this chain supported by the platform?
- simulation: simulate the transaction using
eth_callto check if it would revert. catching reverts before submission saves gas and prevents wasted execution. - balance check (for sponsoredCall): query the fee abstraction service to verify the sponsor has enough balance in their gas tank to cover the estimated fee.
- contract verification (for callWithSyncFee): validate that the target contract would actually pay back the relayer inline.
if all checks passed, the checker api queued a task to the worker via bullmq (backed by redis).
checker worker (async processing layer):
the worker picked up tasks from the bullmq queue and was responsible for handing them off to the executor. the exact internal logic of the worker is fuzzy, but the handoff to the executor happened via a rabbitmq queue, with an http call as a fallback mechanism.
if a task failed to execute, it was retried using bullmq’s built-in retry logic until it hit maxRetry, at which point it was marked as failed.
learning: separating the synchronous validation (api) from the async processing (worker) is a really important pattern. the api can respond quickly to the caller (“your task is queued”) without blocking on execution, and the worker can take its time, handle retries, and deal with downstream failures gracefully. this is essentially the command pattern combined with a job queue. bullmq’s built-in retry with backoff means you don’t have to write retry logic yourself, which is a big deal at scale.
learning: the two-queue setup (bullmq internally, rabbitmq externally) is also notable. bullmq over redis is lightweight and fast for intra-service task passing. rabbitmq is more robust for inter-service communication where you need durable messaging, topic-based routing, and decoupled consumers. using the right tool for each layer matters.
3. executor
the executor was the service that actually submitted transactions on-chain. it was the most infrastructure-heavy service in the relay stack.
eoa wallet pool:
the executor managed a pool of funded EOA (externally owned account) wallets per chain. these wallets were the actual signers that broadcast transactions to the network.
the wallets were kept funded automatically using the platform’s own automation infrastructure, which is a nice example of dogfooding: using your own product to maintain your own infrastructure.
nonce management and gas pricing:
the executor handled nonce management internally to avoid nonce collisions across concurrent transactions on the same wallet. it also fetched gas prices from on-chain oracles rather than relying solely on the rpc’s eth_gasPrice. (specific implementation details of the nonce locking mechanism are unclear.)
per-chain fee logic:
one of the more complex parts of the executor was handling fee estimation correctly across different chain types. two chain architectures required special handling:
op stack chains (optimism, base, etc):
transactions on op stack have two fee components:
- l2 execution fee:
gas used * l2 gas price - l1 data fee: the cost of posting the transaction’s calldata to ethereum mainnet, proportional to the byte size of the serialized tx
op stack exposes a precompile contract called GasPriceOracle at address 0x420000000000000000000000000000000000000F. calling getL1Fee(bytes calldata _data) on it returns the l1 component for a given transaction. the executor had to call this precompile to get the full cost estimate before deciding if a sponsored transaction was within the sponsor's budget.
total fee = l2 execution fee + l1 data fee
arbitrum:
arbitrum’s fee model is more subtle. fees are inherently two-dimensional (l2 computation + l1 calldata cost) but ethereum’s transaction format is one-dimensional (gas_limit * gas_price). arbitrum resolves this by inflating the gas limit field to absorb both dimensions:
gas_limit = l2_gas_used + (l1_calldata_price * l1_calldata_size) / l2_gas_pricethe gas price returned by arbitrum’s rpc is just the l2 gas price. the l1 cost is encoded into the gas limit estimate. this creates a counterintuitive behavior: when l2 gas price rises, the reported gas limit actually decreases (because the l1 cost denominator gets larger), while the total fee stays roughly the same.
the practical implication for the executor was that on arbitrum, eth_estimateGas returns a composite gas limit, not a pure computation estimate. a stale estimate from a volatile l2 gas price moment could be wrong by the time the tx was submitted. this required the executor to be aware of which fee model a given chain used.
learning: building for multiple chains means building for multiple execution environments, not just multiple rpc endpoints. op stack and arbitrum have meaningfully different fee models, and getting fee estimation wrong on a relay service has direct financial consequences. the executor needed per-chain configuration that told it which fee model to apply. this is a good example of how abstraction breaks down at the edges: “just submit a transaction” is not a single operation when you’re operating across 50+ chains.
4. fee abstraction service (gas tank)
this was the accounting and fee management layer. it handled the financial side of sponsored transactions and was composed of four components:
- gastank-listener: listened for on-chain deposit events into gas tank contracts. when a sponsor deposited funds, this service picked it up and updated the internal balance record.
- executor-listener: listened for transaction confirmation events from the executor (via rabbitmq). when a sponsored tx was confirmed on-chain, this triggered the settlement process.
- settlement: did the actual fee deduction from the sponsor’s balance. calculated how much gas was used, what the fee was, and decremented the balance accordingly.
- api: exposed balance query endpoints, used by the checker-api to validate sponsor balance before accepting a task.
learning: this design is a good example of event-driven accounting. rather than doing synchronous balance updates inline with execution, each stage publishes events and the accounting layer reacts asynchronously. this keeps the execution path fast and lets the accounting layer handle its own consistency guarantees independently.
5. status service (api + listener)
the status service tracked the full lifecycle of every relay transaction and exposed that to users.
status listener:
the listener consumed messages from rabbitmq (topic exchange) published by the executor, checker-backend, and other services. the message types it handled included:
checkPending: task received, validation in progressexecPending: task passed checks, queued for executionexecSuccess: transaction confirmed on-chainwaitingForConfirmation: transaction submitted, waiting for block confirmationcancelled: task was cancelledexpired: task exceeded its validity windownotFound: task id not recognized
on receiving a message, the listener wrote the status update to redis first (cache), then persisted it to the database asynchronously. the cache-first write meant users querying for status got a fast response without waiting for a db write.
the listener also managed websocket connections, pushing real-time status updates to connected clients as messages arrived. this avoided users needing to poll.
status api:
exposed two interfaces: a rest endpoint for polling-based status queries (reads from redis cache), and websocket support for push-based real-time updates.
learning: the pattern here is write-through caching with async persistence. redis is fast but volatile, the db is durable but slower. by writing to redis first, you optimize for the common case (user querying status right after submission) and let the db write happen in the background. this is a well-known pattern but seeing it in a production context makes it concrete.
learning: the dual interface (rest + websocket) on the same service is worth noting. for transaction status specifically, websockets are the right ux because users don’t want to poll every 500ms. but you still need the rest endpoint for one-off queries or clients that don’t support ws. supporting both from the same api layer is a practical production decision.
6. usage tracker
i worked on one specific feature here: post-execution transaction simulation via tenderly. after a transaction was executed on-chain (successful or failed), users could trigger a tenderly simulation of the same transaction to get a detailed execution trace. this was useful for debugging failed or reverted transactions.
the broader internals of this service are outside the scope of what i worked on directly, so i’m keeping this section minimal.
end-to-end flow: sponsoredCall
putting it all together, here is what happened when a developer called sponsoredCall:
- request hits the api-gateway, which validates the api key, checks rate limits, and routes to the checker backend
- checker api simulates the transaction, checks the sponsor’s gas tank balance, verifies chain support
- if checks pass, checker api enqueues a task to the checker worker via bullmq
- checker worker picks up the task and hands it off to the executor via rabbitmq (http fallback if queue unavailable)
- executor selects a wallet via round robin, estimates gas (applying the correct fee model for the target chain), and submits the transaction on-chain
- throughout this process, each stage publishes status events to rabbitmq
- status listener consumes these events, writes to redis cache and db, and pushes updates to any connected websocket clients
- fee abstraction executor-listener picks up the confirmation event, triggers settlement to deduct the fee from the sponsor’s gas tank balance
- user can query status via rest or receive it in real time via websocket
broader learnings
microservices are a tradeoff, not a default answer. yes, having separate services means you can scale the executor independently from the status service, and a bug in the usage tracker doesn’t take down the executor. those are real benefits at scale. but the operational cost is significant. to work on any single feature end-to-end, you often need 5+ services running locally. debugging a bug that spans the checker api, rabbitmq, the executor, and the status listener is genuinely painful. shipping a feature that touches multiple services means coordinating deployments, versioning message contracts, and making sure nothing breaks at the boundaries. a well-structured monorepo can give you a lot of the code organization benefits without that overhead, especially early on. microservices make sense when the scaling and team autonomy arguments are real, not as a default architectural choice.
event-driven architecture is powerful but adds observability requirements. rabbitmq topic exchanges decouple services cleanly, but debugging a flow that spans 5 services and 2 message queues without proper tooling is a nightmare. correlation ids across every log line, queue depth monitoring, and distributed tracing are not optional in this kind of system. they are the only way to answer “why did this transaction get stuck” without spending hours grepping through logs across multiple services.
the financial consequences of bugs are real. a fee estimation bug in the executor on a high-volume chain isn’t just a code error, it’s money lost. this kind of system requires more careful handling of edge cases (chain-specific fee models, gas price volatility, wallet nonce collisions) than a typical web backend. the stakes change how you think about testing and defensive coding.
abstractions break down at the edges. “submit a transaction” sounds simple until you’re doing it across 50+ chains with different fee models, finality assumptions, and rpc behaviors. the executor had to know about each chain’s quirks explicitly. this is a recurring theme in multi-chain infrastructure: the abstraction layer is always thinner than you think.