Ilia Gusev's Avatar

Ilia Gusev

@persikbl

Writing Podo Stack ๐Ÿ‡ - tools that survived production, weekly https://podostack.com

5
Followers
64
Following
199
Posts
22.01.2026
Joined
Posts Following

Latest posts by Ilia Gusev @persikbl

We ran mirrored queues for years thinking they were safe. They weren't.

The migration to quorum queues took a weekend. The peace of mind is permanent.

Full RabbitMQ production guide in this week's Podo Stack:
podostack.com/p/rabbitmq-...

09.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The trade-offs are honest.

No per-message TTL. No priority queues. No exclusive queues. These features conflict with Raft's consistency model.

For 90% of workloads - where "don't lose my data" matters more than message priority - quorum wins.

09.03.2026 14:02 ๐Ÿ‘ 3 ๐Ÿ” 0 ๐Ÿ’ฌ 4 ๐Ÿ“Œ 0

Creating one in Kubernetes is one YAML:

kind: Queue
spec:
name: orders
type: quorum
durable: true

That's it. Raft handles replication automatically. No ha-mode policies needed.

09.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The difference matters when a node dies.

Mirrored queues: manual intervention, potential data loss during sync.

Quorum queues: automatic leader election, zero data loss, sub-second failover. The surviving nodes just keep going.

09.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Quorum queues use Raft consensus. Same algorithm as etcd and Consul.

Every message write is confirmed only when a majority of nodes acknowledge it. 3 nodes, quorum of 2. No message gets lost silently.

Classic mirrored queues? They blocked the entire queue during sync.

09.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

RabbitMQ mirrored queues have been deprecated since 3.13.

If you're still using ha-mode policies, your data safety is an illusion.

Here's what replaced them - and why it's better.

09.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Everything in cert-manager is a CRD - Issuers, Certificates, challenges.

GitOps-friendly by design. Staging uses LE staging API. Production uses the real one. Same manifests, different issuers. Argo handles the rest.

podostack.com/p/cert-mana... ๐Ÿ‡

06.03.2026 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The CSI driver takes it further.

Each pod gets its own certificate. Created on start, cleaned up on stop.

No shared secrets. No wildcards covering your whole cluster. Every pod proves its own identity.

That's zero-trust without running a full service mesh.

06.03.2026 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Two challenge types for Let's Encrypt:

HTTP-01 - temp endpoint on your ingress, LE hits it to verify ownership. Works if your cluster is public.

DNS-01 - TXT record in your DNS zone. Works for wildcards and private clusters.

80% of setups I've seen use HTTP-01.

06.03.2026 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

It's a Kubernetes controller that watches Certificate resources.

About to expire? Renewed. New Certificate created? Provisioned. Pod deleted? Cleaned up.

You configure it once, it runs forever. No cron jobs. No scripts. No "I'll renew it next week."

06.03.2026 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Expired certs are one of the most preventable causes of production outages.

They keep happening because manual renewal doesn't scale. You can't rely on calendar reminders when you're running hundreds of services across multiple clusters.

cert-manager makes it automatic.

06.03.2026 16:00 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

A major cloud provider's control plane went down for 4 hours.

Not a hack. Not a cascading failure.

A certificate expired. The automation that was supposed to renew it had its own expired certificate. ๐Ÿงต

06.03.2026 16:00 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Honestly, I overlooked these for years. "Just pick the cheapest c-type" was my default.

Turns out the right instance suffix matters more than the right instance size.

More Kubernetes cost optimization:
podostack.com/p/spot-cons... ๐Ÿ‡

06.03.2026 14:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

The hidden benefit: stable baseline bandwidth.

Regular instances burst and throttle. The n-variants give you consistent network performance.

If your app has unpredictable network patterns, that stability alone is worth the premium.

06.03.2026 14:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

When you actually need n-instances:

- Spark/Hadoop shuffling (network-bound, not CPU-bound)
- In-memory databases with replication
- NFV and packet processing
- HPC workloads using EFA (OS-bypass networking)

When you don't: standard web apps bottlenecked by RDS.

06.03.2026 14:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

C6i: 50 Gbps network bandwidth.
C6in: 200 Gbps.

EBS bandwidth jumps too - from 40 Gbps to 100 Gbps.

Cost premium? Only 10-20%. For 4x network throughput that's a bargain most teams don't know about.

06.03.2026 14:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The "n" in c6in.xlarge stands for something most engineers ignore.

Network Optimized.

And the difference isn't 10%. It's 4x. ๐Ÿงต

06.03.2026 14:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

PDBs are non-negotiable for any disruption strategy.

Without them, Karpenter can drain a node and take down all your replicas at once.

Full lifecycle and disruption management guide:
podostack.com/p/spot-cons... ๐Ÿ‡

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Stateful workloads need their own rules.

Dedicated NodePool. consolidationPolicy: WhenEmpty. expireAfter: Never. On-Demand only.

Databases don't like surprise evictions. Give them stable nodes and let Karpenter optimize everything else.

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The bin-packing trick nobody talks about:

Karpenter's consolidation IS bin-packing. It doesn't just remove empty nodes - it actively repacks pods onto fewer, better-sized nodes.

That's disruption in service of efficiency.

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Two consolidation policies. Pick wisely.

WhenUnderutilized: aggressive. Karpenter replaces two half-empty nodes with one full node. Saves money, but causes pod disruption.

WhenEmpty: safe. Only removes nodes with zero pods. Less savings, zero risk.

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Karpenter's expireAfter field fixes this quietly.

Set expireAfter: 720h and every node gets replaced after 30 days. Fresh AMI, latest patches, zero manual work.

It's the closest thing to "set and forget" node management.

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

Your Kubernetes nodes haven't been updated in 6 months.

No AMI patches. No kernel fixes. No security updates.

And you're wondering why your compliance team is nervous. ๐Ÿงต

05.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Start with preferred, promote to required.

Roll out new affinity rules as soft preferences first. Watch the scheduler for a few days. Once you're sure the labels exist and are correct, switch to hard requirements.

podostack.com/p/kubernete... ๐Ÿ‡

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

One thing that trips people up: multiple nodeSelectorTerms use OR logic, but multiple matchExpressions inside one term use AND.

(zone=us-east-1a AND disktype=ssd) OR (zone=us-west-2a)

You can build complex placement rules without custom schedulers. Most teams don't know this.

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

The pattern I keep coming back to:

required โ†’ pin to the right node pool (gitops/target: api)
preferred โ†’ optimize within that pool (prefer arm64)

Hard-pin by purpose. Soft-prefer by architecture. Placement guarantees AND cost savings without scheduling failures.

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Preferred rules use weights (1-100).

Want ARM64 nodes but don't want pods stuck if they're full?

weight: 80 โ†’ arm64
weight: 20 โ†’ amd64

Scheduler scores nodes, picks the best match. Falls back gracefully. No Pending pods at 3 AM.

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

nodeAffinity gives you two modes nodeSelector never had.

required: hard rule. Pod won't schedule unless the node matches. Same as nodeSelector, but with real operators: In, NotIn, Exists, Gt.

preferred: soft preference. Scheduler tries to honor it. If it can't, pod still runs.

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

nodeSelector is a binary switch.

Either the label matches and the pod schedules, or it doesn't and the pod hangs Pending forever.

No fallback. No preference. No OR logic. Just "match or die." ๐Ÿงต

04.03.2026 16:01 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

Pod scheduling is where most Kubernetes cost problems hide.

One wrong affinity rule silently 3x your bill. One missing PDB turns a routine node drain into an outage.

Full pod packing and scheduling guide:
podostack.com/p/spot-cons... ๐Ÿ‡

04.03.2026 14:02 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0