A Microservice That Teaches Everything Not to Do

Every engineer eventually encounters a system that becomes a living catalog of anti-patterns. Recently, I reviewed a microservice responsible for parsing large measurement files and storing the extracted results in a database. The problem itself is straightforward: read files, extract measurements, enrich the data, and persist it. Unfortunately, the implementation demonstrates how a relatively simple pipeline can become fragile, inefficient, and operationally dangerous when basic distributed-systems principles are ignored.

This article walks through several architectural mistakes found in this service and explains what better alternatives would look like.

The Problem the Service Was Supposed to Solve

The pipeline’s intended workflow is simple:

Large measurement files are uploaded.
A microservice reads the files.
The service parses measurements.
Data is enriched with reference tables.
Results are stored in ClickHouse for analytics.

The service runs as multiple pods in Kubernetes to scale horizontally. In theory, this architecture should allow parallel processing of files and high ingestion throughput.

In practice, the system fights its own design.

1. Random Sleep Instead of Concurrency Control

The first red flag appears immediately when the service starts processing files. Each worker begins with a random delay measured in seconds. The intention was to reduce the chance that multiple pods would pick the same file simultaneously.

This approach reveals a misunderstanding of concurrency in distributed systems. Random delays are not synchronization mechanisms. They only reduce the probability of collisions; they never eliminate them.

In distributed systems, coordination must be explicit. Common solutions include:

Distributed locks (Redis, etcd, ZooKeeper)
Atomic file claiming via metadata storage
Message queues with consumer groups
Database row locking or job tables

For example, a simple and reliable pattern is to maintain a job table:

Each file is inserted as a job.
Workers atomically claim jobs using a state transition (pending → processing).
A unique constraint or transactional update ensures only one worker processes each file.

Without deterministic coordination, race conditions are guaranteed eventually.

2. Using NFS as a Coordination Mechanism

Instead of using object storage, the system relies on a shared NFS mount for file management.

NFS can work for simple shared storage, but it is poorly suited for distributed event pipelines. It provides weak guarantees around concurrent file operations and becomes a bottleneck under heavy parallel workloads.

Modern systems typically use object storage, such as:

Amazon S3
MinIO
Google Cloud Storage
Azure Blob Storage

Object storage provides durability, versioning, event triggers, and scalability that NFS simply cannot match. It also integrates naturally with event-driven architectures.

Using NFS for a distributed ingestion pipeline is a common source of race conditions, performance degradation, and operational complexity.

3. File Renaming as a Locking Strategy

The service tries to prevent multiple workers from processing the same file by renaming it. When a pod starts processing, it appends a suffix like .zumbolize to the filename. Once processing finishes, the file is deleted.

This creates several problems:

First, the rename operation itself is not a reliable distributed lock. Multiple pods can still read the file list simultaneously and race to rename it.

Second, deleting the file after processing removes any ability to audit or reprocess data.

Third, there is no traceability. If processing fails halfway through, there is no reliable record of what happened.

A proper ingestion pipeline maintains clear state transitions, such as:

uploaded
queued
processing
completed
failed

These states are persisted in a durable store. Files remain immutable artifacts rather than temporary coordination tools.

4. Data Modeling Contradictions

Another problematic area is the database schema.

The system stores measurements in a massive table that approaches 1 terabyte. However, enrichment is performed by joining with normalized reference tables designed using the third normal form (3NF).

Normalization is useful in transactional systems to eliminate redundancy. But analytical databases like ClickHouse are optimized for denormalized, columnar data.

Mixing OLTP normalization concepts with OLAP storage leads to two major issues:

Expensive joins on very large tables
Repeated enrichment operations during ingestion

In analytical pipelines, denormalization is often intentional. It improves query performance and simplifies downstream processing.

ClickHouse is especially optimized for wide tables with many columns. Trying to force a highly normalized schema into a columnar analytics database defeats its design advantages.

5. Storing Structured Data as Strings

The most painful design decision appears in how measurement metadata is stored.

The system builds a dictionary of key-value pairs, serializes it as a string, and then stores it in ClickHouse. Every downstream query must:

Convert the string into JSON
Query the fields
Convert the result back to a string
Store it again

This is computationally wasteful and unnecessary.

ClickHouse already supports multiple structured data types:

JSON / Object types
Map types
Nested columns
Explicit flattened columns

Serializing structured data into opaque strings eliminates:

Query optimization
Column compression
Indexing capabilities
Type safety

It also dramatically increases CPU overhead during query execution.

In a columnar database designed for analytical workloads, flattening structured data into proper columns usually produces the best performance.

6. Database Access Anti-Pattern: Per-Row Queries and Connection Exhaustion

One of the most critical failures in the system is not architectural at a high level, but operational at the code level — and it directly impacts reliability.

A production incident exposed a severe flaw: the service opens a new ClickHouse connection and executes a full table scan for every single cell processed.

The failure manifests as:

Code: 210 — Connection reset by peer
Parse jobs failing under moderate load
ClickHouse is actively terminating connections

What’s happening under the hood?

Inside the parsing pipeline:

Each measurement object triggers a cell_id lookup
Which opens a new database connection and runs a full SELECT query

This happens inside a loop over all cells.

So for a file with 1,000 cells:

1,000 database connections are opened
1,000 full table scans are executed
Each lookup performs an O(n) linear scan

This is not just inefficient — it is catastrophic under load.

Why does this fail in practice?

Databases are not designed for this access pattern. Specifically:

Connection pools get exhausted
TCP connections are reset under pressure
Query latency compounds multiplicatively
The system becomes I/O bound instead of CPU-bound

This is a textbook violation of a core principle:

Never perform external I/O inside a tight processing loop.

7. The Missing Abstraction: In-Memory Lookup

The fix is trivial, which makes the mistake more costly.

Instead of querying the database per cell, the system should:

Fetch the reference data once per job
Transform it into an in-memory structure
Perform constant-time lookups during parsing

Concretely:

Load the cell map once
Build a dictionary keyed
Pass that dictionary through the call chain

This transforms:

Before: O(n²) behavior + N database calls
After: O(n) behavior + 1 database call

It also eliminates connection exhaustion.

8. Secondary Issues Amplifying the Problem

Several additional flaws made the situation worse:

No connection timeouts → workers hang indefinitely
No caching layer → repeated identical queries
Linear scans instead of indexed lookups
Weak exception handling → loss of stack traces

These are not independent issues — they compound each other.

9. Lack of Observability and Idempotency

The architecture also lacks two key properties required for reliable pipelines:

Idempotency:
If a file is processed twice, the system should produce the same result without duplicating data.

Observability:
Operators should be able to answer simple questions such as:

Which files were processed?
Which ones failed?
How long did processing take?
Can we replay a file?

Deleting files and avoiding job tracking make these questions impossible to answer reliably.

Final Thoughts

None of these mistakes individually would necessarily break a system. But together they create a pipeline that is fragile, inefficient, and difficult to operate.

The most striking part is that the system does not fail because the problem is complex. It fails because fundamental engineering principles are ignored:

Deterministic coordination instead of randomness
Appropriate storage for the workload
Respect for database access patterns
Separation of I/O from computation

When those fundamentals are violated, engineers compensate with random sleeps, filename tricks, and excessive database calls — until the system collapses under its own weight.

And that is often the most expensive mistake of all.

A Microservice That Teaches Everything Not to Do was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Microservice That Teaches Everything Not to Do

The Problem the Service Was Supposed to Solve

1. Random Sleep Instead of Concurrency Control

2. Using NFS as a Coordination Mechanism

3. File Renaming as a Locking Strategy

4. Data Modeling Contradictions

5. Storing Structured Data as Strings

6. Database Access Anti-Pattern: Per-Row Queries and Connection Exhaustion

7. The Missing Abstraction: In-Memory Lookup

8. Secondary Issues Amplifying the Problem

9. Lack of Observability and Idempotency

Final Thoughts

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

From Live Options Analytics to a Queryable Database in Python

The Free Market Era in Critical Minerals Is Ending

Dennis Porter: Bitcoin is a national security tool, proof of work deters digital spam, and economic barriers enhance cybersecurity | The Wolf Of All Streets

Atlastradingpro.com: the ASIC‑flagged Malta clone that cost a Melbourne mother $275,000

Foreststonecapital.com.au: the ASIC‑flagged clone that cost a Sydney widow $347,000

CRYPTO BASICS