Start now →

building relay infrastructure at scale: architecture and learnings

By Anirudh Makhana · Published April 24, 2026 · 11 min read · Source: Blockchain Tag
Web3
building relay infrastructure at scale: architecture and learnings

building relay infrastructure at scale: architecture and learnings

Anirudh MakhanaAnirudh Makhana9 min read·Just now

--

what does it actually take to abstract gas away from users at scale? here’s the architecture behind it and what i learned building it.

background

this is a personal writeup documenting the architecture of a relay infrastructure system i worked on at a web3 infrastructure company. the goal is twofold: to document how the system was designed, and to capture what i learned about building backend software at scale. some implementation details are fuzzy or missing from memory, and i’ve noted those explicitly.

what is relay?

a relay is a meta-transaction infrastructure that allows developers to submit transactions on behalf of their users without users needing to hold native gas tokens. the system i worked on supported several transaction types:

the system was built as a microservices architecture in nodejs/express, with services communicating over rabbitmq (inter-service) and bullmq/redis (intra-service). each service had a clear, narrow responsibility.

services overview

1. api-gateway

the entry point for all incoming requests. every relay call from an external developer first hit the api-gateway before being routed anywhere.

responsibilities:

learning: putting cross-cutting concerns like auth, rate limiting, and routing in a single gateway layer is a clean pattern at scale. it means downstream services don’t each need to reimplement auth logic, and you have one place to enforce access control policy. this is the api gateway pattern, and it maps closely to what tools like aws api gateway or kong do in production systems.

2. relay-backend (api + worker)

this service was responsible for validating incoming relay requests and queuing them for execution. it had two components: an api and a worker.

checker api (synchronous validation layer):

when a relay request arrived from the gateway, the checker api ran a series of checks:

if all checks passed, the checker api queued a task to the worker via bullmq (backed by redis).

checker worker (async processing layer):

the worker picked up tasks from the bullmq queue and was responsible for handing them off to the executor. the exact internal logic of the worker is fuzzy, but the handoff to the executor happened via a rabbitmq queue, with an http call as a fallback mechanism.

if a task failed to execute, it was retried using bullmq’s built-in retry logic until it hit maxRetry, at which point it was marked as failed.

learning: separating the synchronous validation (api) from the async processing (worker) is a really important pattern. the api can respond quickly to the caller (“your task is queued”) without blocking on execution, and the worker can take its time, handle retries, and deal with downstream failures gracefully. this is essentially the command pattern combined with a job queue. bullmq’s built-in retry with backoff means you don’t have to write retry logic yourself, which is a big deal at scale.

learning: the two-queue setup (bullmq internally, rabbitmq externally) is also notable. bullmq over redis is lightweight and fast for intra-service task passing. rabbitmq is more robust for inter-service communication where you need durable messaging, topic-based routing, and decoupled consumers. using the right tool for each layer matters.

3. executor

the executor was the service that actually submitted transactions on-chain. it was the most infrastructure-heavy service in the relay stack.

eoa wallet pool:

the executor managed a pool of funded EOA (externally owned account) wallets per chain. these wallets were the actual signers that broadcast transactions to the network.

the wallets were kept funded automatically using the platform’s own automation infrastructure, which is a nice example of dogfooding: using your own product to maintain your own infrastructure.

nonce management and gas pricing:

the executor handled nonce management internally to avoid nonce collisions across concurrent transactions on the same wallet. it also fetched gas prices from on-chain oracles rather than relying solely on the rpc’s eth_gasPrice. (specific implementation details of the nonce locking mechanism are unclear.)

per-chain fee logic:

one of the more complex parts of the executor was handling fee estimation correctly across different chain types. two chain architectures required special handling:

op stack chains (optimism, base, etc):

transactions on op stack have two fee components:

op stack exposes a precompile contract called GasPriceOracle at address 0x420000000000000000000000000000000000000F. calling getL1Fee(bytes calldata _data) on it returns the l1 component for a given transaction. the executor had to call this precompile to get the full cost estimate before deciding if a sponsored transaction was within the sponsor's budget.

total fee = l2 execution fee + l1 data fee

arbitrum:

arbitrum’s fee model is more subtle. fees are inherently two-dimensional (l2 computation + l1 calldata cost) but ethereum’s transaction format is one-dimensional (gas_limit * gas_price). arbitrum resolves this by inflating the gas limit field to absorb both dimensions:

gas_limit = l2_gas_used + (l1_calldata_price * l1_calldata_size) / l2_gas_price

the gas price returned by arbitrum’s rpc is just the l2 gas price. the l1 cost is encoded into the gas limit estimate. this creates a counterintuitive behavior: when l2 gas price rises, the reported gas limit actually decreases (because the l1 cost denominator gets larger), while the total fee stays roughly the same.

the practical implication for the executor was that on arbitrum, eth_estimateGas returns a composite gas limit, not a pure computation estimate. a stale estimate from a volatile l2 gas price moment could be wrong by the time the tx was submitted. this required the executor to be aware of which fee model a given chain used.

learning: building for multiple chains means building for multiple execution environments, not just multiple rpc endpoints. op stack and arbitrum have meaningfully different fee models, and getting fee estimation wrong on a relay service has direct financial consequences. the executor needed per-chain configuration that told it which fee model to apply. this is a good example of how abstraction breaks down at the edges: “just submit a transaction” is not a single operation when you’re operating across 50+ chains.

4. fee abstraction service (gas tank)

this was the accounting and fee management layer. it handled the financial side of sponsored transactions and was composed of four components:

learning: this design is a good example of event-driven accounting. rather than doing synchronous balance updates inline with execution, each stage publishes events and the accounting layer reacts asynchronously. this keeps the execution path fast and lets the accounting layer handle its own consistency guarantees independently.

5. status service (api + listener)

the status service tracked the full lifecycle of every relay transaction and exposed that to users.

status listener:

the listener consumed messages from rabbitmq (topic exchange) published by the executor, checker-backend, and other services. the message types it handled included:

on receiving a message, the listener wrote the status update to redis first (cache), then persisted it to the database asynchronously. the cache-first write meant users querying for status got a fast response without waiting for a db write.

the listener also managed websocket connections, pushing real-time status updates to connected clients as messages arrived. this avoided users needing to poll.

status api:

exposed two interfaces: a rest endpoint for polling-based status queries (reads from redis cache), and websocket support for push-based real-time updates.

learning: the pattern here is write-through caching with async persistence. redis is fast but volatile, the db is durable but slower. by writing to redis first, you optimize for the common case (user querying status right after submission) and let the db write happen in the background. this is a well-known pattern but seeing it in a production context makes it concrete.

learning: the dual interface (rest + websocket) on the same service is worth noting. for transaction status specifically, websockets are the right ux because users don’t want to poll every 500ms. but you still need the rest endpoint for one-off queries or clients that don’t support ws. supporting both from the same api layer is a practical production decision.

6. usage tracker

i worked on one specific feature here: post-execution transaction simulation via tenderly. after a transaction was executed on-chain (successful or failed), users could trigger a tenderly simulation of the same transaction to get a detailed execution trace. this was useful for debugging failed or reverted transactions.

the broader internals of this service are outside the scope of what i worked on directly, so i’m keeping this section minimal.

end-to-end flow: sponsoredCall

putting it all together, here is what happened when a developer called sponsoredCall:

  1. request hits the api-gateway, which validates the api key, checks rate limits, and routes to the checker backend
  2. checker api simulates the transaction, checks the sponsor’s gas tank balance, verifies chain support
  3. if checks pass, checker api enqueues a task to the checker worker via bullmq
  4. checker worker picks up the task and hands it off to the executor via rabbitmq (http fallback if queue unavailable)
  5. executor selects a wallet via round robin, estimates gas (applying the correct fee model for the target chain), and submits the transaction on-chain
  6. throughout this process, each stage publishes status events to rabbitmq
  7. status listener consumes these events, writes to redis cache and db, and pushes updates to any connected websocket clients
  8. fee abstraction executor-listener picks up the confirmation event, triggers settlement to deduct the fee from the sponsor’s gas tank balance
  9. user can query status via rest or receive it in real time via websocket

broader learnings

microservices are a tradeoff, not a default answer. yes, having separate services means you can scale the executor independently from the status service, and a bug in the usage tracker doesn’t take down the executor. those are real benefits at scale. but the operational cost is significant. to work on any single feature end-to-end, you often need 5+ services running locally. debugging a bug that spans the checker api, rabbitmq, the executor, and the status listener is genuinely painful. shipping a feature that touches multiple services means coordinating deployments, versioning message contracts, and making sure nothing breaks at the boundaries. a well-structured monorepo can give you a lot of the code organization benefits without that overhead, especially early on. microservices make sense when the scaling and team autonomy arguments are real, not as a default architectural choice.

event-driven architecture is powerful but adds observability requirements. rabbitmq topic exchanges decouple services cleanly, but debugging a flow that spans 5 services and 2 message queues without proper tooling is a nightmare. correlation ids across every log line, queue depth monitoring, and distributed tracing are not optional in this kind of system. they are the only way to answer “why did this transaction get stuck” without spending hours grepping through logs across multiple services.

the financial consequences of bugs are real. a fee estimation bug in the executor on a high-volume chain isn’t just a code error, it’s money lost. this kind of system requires more careful handling of edge cases (chain-specific fee models, gas price volatility, wallet nonce collisions) than a typical web backend. the stakes change how you think about testing and defensive coding.

abstractions break down at the edges. “submit a transaction” sounds simple until you’re doing it across 50+ chains with different fee models, finality assumptions, and rpc behaviors. the executor had to know about each chain’s quirks explicitly. this is a recurring theme in multi-chain infrastructure: the abstraction layer is always thinner than you think.

This article was originally published on Blockchain Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →