We ran mirrored queues for years thinking they were safe. They weren't.
The migration to quorum queues took a weekend. The peace of mind is permanent.
Full RabbitMQ production guide in this week's Podo Stack:
podostack.com/p/rabbitmq-...
We ran mirrored queues for years thinking they were safe. They weren't.
The migration to quorum queues took a weekend. The peace of mind is permanent.
Full RabbitMQ production guide in this week's Podo Stack:
podostack.com/p/rabbitmq-...
The trade-offs are honest.
No per-message TTL. No priority queues. No exclusive queues. These features conflict with Raft's consistency model.
For 90% of workloads - where "don't lose my data" matters more than message priority - quorum wins.
Creating one in Kubernetes is one YAML:
kind: Queue
spec:
name: orders
type: quorum
durable: true
That's it. Raft handles replication automatically. No ha-mode policies needed.
The difference matters when a node dies.
Mirrored queues: manual intervention, potential data loss during sync.
Quorum queues: automatic leader election, zero data loss, sub-second failover. The surviving nodes just keep going.
Quorum queues use Raft consensus. Same algorithm as etcd and Consul.
Every message write is confirmed only when a majority of nodes acknowledge it. 3 nodes, quorum of 2. No message gets lost silently.
Classic mirrored queues? They blocked the entire queue during sync.
RabbitMQ mirrored queues have been deprecated since 3.13.
If you're still using ha-mode policies, your data safety is an illusion.
Here's what replaced them - and why it's better.
Everything in cert-manager is a CRD - Issuers, Certificates, challenges.
GitOps-friendly by design. Staging uses LE staging API. Production uses the real one. Same manifests, different issuers. Argo handles the rest.
podostack.com/p/cert-mana... ๐
The CSI driver takes it further.
Each pod gets its own certificate. Created on start, cleaned up on stop.
No shared secrets. No wildcards covering your whole cluster. Every pod proves its own identity.
That's zero-trust without running a full service mesh.
Two challenge types for Let's Encrypt:
HTTP-01 - temp endpoint on your ingress, LE hits it to verify ownership. Works if your cluster is public.
DNS-01 - TXT record in your DNS zone. Works for wildcards and private clusters.
80% of setups I've seen use HTTP-01.
It's a Kubernetes controller that watches Certificate resources.
About to expire? Renewed. New Certificate created? Provisioned. Pod deleted? Cleaned up.
You configure it once, it runs forever. No cron jobs. No scripts. No "I'll renew it next week."
Expired certs are one of the most preventable causes of production outages.
They keep happening because manual renewal doesn't scale. You can't rely on calendar reminders when you're running hundreds of services across multiple clusters.
cert-manager makes it automatic.
A major cloud provider's control plane went down for 4 hours.
Not a hack. Not a cascading failure.
A certificate expired. The automation that was supposed to renew it had its own expired certificate. ๐งต
Honestly, I overlooked these for years. "Just pick the cheapest c-type" was my default.
Turns out the right instance suffix matters more than the right instance size.
More Kubernetes cost optimization:
podostack.com/p/spot-cons... ๐
The hidden benefit: stable baseline bandwidth.
Regular instances burst and throttle. The n-variants give you consistent network performance.
If your app has unpredictable network patterns, that stability alone is worth the premium.
When you actually need n-instances:
- Spark/Hadoop shuffling (network-bound, not CPU-bound)
- In-memory databases with replication
- NFV and packet processing
- HPC workloads using EFA (OS-bypass networking)
When you don't: standard web apps bottlenecked by RDS.
C6i: 50 Gbps network bandwidth.
C6in: 200 Gbps.
EBS bandwidth jumps too - from 40 Gbps to 100 Gbps.
Cost premium? Only 10-20%. For 4x network throughput that's a bargain most teams don't know about.
The "n" in c6in.xlarge stands for something most engineers ignore.
Network Optimized.
And the difference isn't 10%. It's 4x. ๐งต
PDBs are non-negotiable for any disruption strategy.
Without them, Karpenter can drain a node and take down all your replicas at once.
Full lifecycle and disruption management guide:
podostack.com/p/spot-cons... ๐
Stateful workloads need their own rules.
Dedicated NodePool. consolidationPolicy: WhenEmpty. expireAfter: Never. On-Demand only.
Databases don't like surprise evictions. Give them stable nodes and let Karpenter optimize everything else.
The bin-packing trick nobody talks about:
Karpenter's consolidation IS bin-packing. It doesn't just remove empty nodes - it actively repacks pods onto fewer, better-sized nodes.
That's disruption in service of efficiency.
Two consolidation policies. Pick wisely.
WhenUnderutilized: aggressive. Karpenter replaces two half-empty nodes with one full node. Saves money, but causes pod disruption.
WhenEmpty: safe. Only removes nodes with zero pods. Less savings, zero risk.
Karpenter's expireAfter field fixes this quietly.
Set expireAfter: 720h and every node gets replaced after 30 days. Fresh AMI, latest patches, zero manual work.
It's the closest thing to "set and forget" node management.
Your Kubernetes nodes haven't been updated in 6 months.
No AMI patches. No kernel fixes. No security updates.
And you're wondering why your compliance team is nervous. ๐งต
Start with preferred, promote to required.
Roll out new affinity rules as soft preferences first. Watch the scheduler for a few days. Once you're sure the labels exist and are correct, switch to hard requirements.
podostack.com/p/kubernete... ๐
One thing that trips people up: multiple nodeSelectorTerms use OR logic, but multiple matchExpressions inside one term use AND.
(zone=us-east-1a AND disktype=ssd) OR (zone=us-west-2a)
You can build complex placement rules without custom schedulers. Most teams don't know this.
The pattern I keep coming back to:
required โ pin to the right node pool (gitops/target: api)
preferred โ optimize within that pool (prefer arm64)
Hard-pin by purpose. Soft-prefer by architecture. Placement guarantees AND cost savings without scheduling failures.
Preferred rules use weights (1-100).
Want ARM64 nodes but don't want pods stuck if they're full?
weight: 80 โ arm64
weight: 20 โ amd64
Scheduler scores nodes, picks the best match. Falls back gracefully. No Pending pods at 3 AM.
nodeAffinity gives you two modes nodeSelector never had.
required: hard rule. Pod won't schedule unless the node matches. Same as nodeSelector, but with real operators: In, NotIn, Exists, Gt.
preferred: soft preference. Scheduler tries to honor it. If it can't, pod still runs.
nodeSelector is a binary switch.
Either the label matches and the pod schedules, or it doesn't and the pod hangs Pending forever.
No fallback. No preference. No OR logic. Just "match or die." ๐งต
Pod scheduling is where most Kubernetes cost problems hide.
One wrong affinity rule silently 3x your bill. One missing PDB turns a routine node drain into an outage.
Full pod packing and scheduling guide:
podostack.com/p/spot-cons... ๐