Every engineer eventually encounters a system that becomes a living catalog of anti-patterns. Recently, I reviewed a microservice responsible for parsing large measurement files and storing the extracted results in a database. The problem itself is straightforward: read files, extract measurements, enrich the data, and persist it. Unfortunately, the implementation demonstrates how a relatively simple pipeline can become fragile, inefficient, and operationally dangerous when basic distributed-systems principles are ignored.
This article walks through several architectural mistakes found in this service and explains what better alternatives would look like.

The Problem the Service Was Supposed to Solve
The pipeline’s intended workflow is simple:
- Large measurement files are uploaded.
- A microservice reads the files.
- The service parses measurements.
- Data is enriched with reference tables.
- Results are stored in ClickHouse for analytics.
The service runs as multiple pods in Kubernetes to scale horizontally. In theory, this architecture should allow parallel processing of files and high ingestion throughput.
In practice, the system fights its own design.
1. Random Sleep Instead of Concurrency Control
The first red flag appears immediately when the service starts processing files. Each worker begins with a random delay measured in seconds. The intention was to reduce the chance that multiple pods would pick the same file simultaneously.
This approach reveals a misunderstanding of concurrency in distributed systems. Random delays are not synchronization mechanisms. They only reduce the probability of collisions; they never eliminate them.
In distributed systems, coordination must be explicit. Common solutions include:
- Distributed locks (Redis, etcd, ZooKeeper)
- Atomic file claiming via metadata storage
- Message queues with consumer groups
- Database row locking or job tables
For example, a simple and reliable pattern is to maintain a job table:
- Each file is inserted as a job.
- Workers atomically claim jobs using a state transition (pending → processing).
- A unique constraint or transactional update ensures only one worker processes each file.
Without deterministic coordination, race conditions are guaranteed eventually.
2. Using NFS as a Coordination Mechanism
Instead of using object storage, the system relies on a shared NFS mount for file management.
NFS can work for simple shared storage, but it is poorly suited for distributed event pipelines. It provides weak guarantees around concurrent file operations and becomes a bottleneck under heavy parallel workloads.
Modern systems typically use object storage, such as:
- Amazon S3
- MinIO
- Google Cloud Storage
- Azure Blob Storage
Object storage provides durability, versioning, event triggers, and scalability that NFS simply cannot match. It also integrates naturally with event-driven architectures.
Using NFS for a distributed ingestion pipeline is a common source of race conditions, performance degradation, and operational complexity.
3. File Renaming as a Locking Strategy
The service tries to prevent multiple workers from processing the same file by renaming it. When a pod starts processing, it appends a suffix like .zumbolize to the filename. Once processing finishes, the file is deleted.
This creates several problems:
First, the rename operation itself is not a reliable distributed lock. Multiple pods can still read the file list simultaneously and race to rename it.
Second, deleting the file after processing removes any ability to audit or reprocess data.
Third, there is no traceability. If processing fails halfway through, there is no reliable record of what happened.
A proper ingestion pipeline maintains clear state transitions, such as:
- uploaded
- queued
- processing
- completed
- failed
These states are persisted in a durable store. Files remain immutable artifacts rather than temporary coordination tools.
4. Data Modeling Contradictions
Another problematic area is the database schema.
The system stores measurements in a massive table that approaches 1 terabyte. However, enrichment is performed by joining with normalized reference tables designed using the third normal form (3NF).
Normalization is useful in transactional systems to eliminate redundancy. But analytical databases like ClickHouse are optimized for denormalized, columnar data.
Mixing OLTP normalization concepts with OLAP storage leads to two major issues:
- Expensive joins on very large tables
- Repeated enrichment operations during ingestion
In analytical pipelines, denormalization is often intentional. It improves query performance and simplifies downstream processing.
ClickHouse is especially optimized for wide tables with many columns. Trying to force a highly normalized schema into a columnar analytics database defeats its design advantages.
5. Storing Structured Data as Strings
The most painful design decision appears in how measurement metadata is stored.
The system builds a dictionary of key-value pairs, serializes it as a string, and then stores it in ClickHouse. Every downstream query must:
- Convert the string into JSON
- Query the fields
- Convert the result back to a string
- Store it again
This is computationally wasteful and unnecessary.
ClickHouse already supports multiple structured data types:
- JSON / Object types
- Map types
- Nested columns
- Explicit flattened columns
Serializing structured data into opaque strings eliminates:
- Query optimization
- Column compression
- Indexing capabilities
- Type safety
It also dramatically increases CPU overhead during query execution.
In a columnar database designed for analytical workloads, flattening structured data into proper columns usually produces the best performance.
6. Database Access Anti-Pattern: Per-Row Queries and Connection Exhaustion
One of the most critical failures in the system is not architectural at a high level, but operational at the code level — and it directly impacts reliability.
A production incident exposed a severe flaw: the service opens a new ClickHouse connection and executes a full table scan for every single cell processed.
The failure manifests as:
- Code: 210 — Connection reset by peer
- Parse jobs failing under moderate load
- ClickHouse is actively terminating connections
What’s happening under the hood?
Inside the parsing pipeline:
- Each measurement object triggers a cell_id lookup
- Which opens a new database connection and runs a full SELECT query
This happens inside a loop over all cells.
So for a file with 1,000 cells:
- 1,000 database connections are opened
- 1,000 full table scans are executed
- Each lookup performs an O(n) linear scan
This is not just inefficient — it is catastrophic under load.
Why does this fail in practice?
Databases are not designed for this access pattern. Specifically:
- Connection pools get exhausted
- TCP connections are reset under pressure
- Query latency compounds multiplicatively
- The system becomes I/O bound instead of CPU-bound
This is a textbook violation of a core principle:
Never perform external I/O inside a tight processing loop.
7. The Missing Abstraction: In-Memory Lookup
The fix is trivial, which makes the mistake more costly.
Instead of querying the database per cell, the system should:
- Fetch the reference data once per job
- Transform it into an in-memory structure
- Perform constant-time lookups during parsing
Concretely:
- Load the cell map once
- Build a dictionary keyed
- Pass that dictionary through the call chain
This transforms:
- Before: O(n²) behavior + N database calls
- After: O(n) behavior + 1 database call
It also eliminates connection exhaustion.

8. Secondary Issues Amplifying the Problem
Several additional flaws made the situation worse:
- No connection timeouts → workers hang indefinitely
- No caching layer → repeated identical queries
- Linear scans instead of indexed lookups
- Weak exception handling → loss of stack traces
These are not independent issues — they compound each other.
9. Lack of Observability and Idempotency
The architecture also lacks two key properties required for reliable pipelines:
Idempotency:
If a file is processed twice, the system should produce the same result without duplicating data.
Observability:
Operators should be able to answer simple questions such as:
- Which files were processed?
- Which ones failed?
- How long did processing take?
- Can we replay a file?
Deleting files and avoiding job tracking make these questions impossible to answer reliably.
Final Thoughts
None of these mistakes individually would necessarily break a system. But together they create a pipeline that is fragile, inefficient, and difficult to operate.
The most striking part is that the system does not fail because the problem is complex. It fails because fundamental engineering principles are ignored:
- Deterministic coordination instead of randomness
- Appropriate storage for the workload
- Respect for database access patterns
- Separation of I/O from computation
When those fundamentals are violated, engineers compensate with random sleeps, filename tricks, and excessive database calls — until the system collapses under its own weight.
And that is often the most expensive mistake of all.
A Microservice That Teaches Everything Not to Do was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.