Member-only story
Why software fails even you do not change anything ?
When we go to production down troubleshooting calls, we often hear no changes made anywhere in the system. However, software still fails. Why ?
Below is a summary of what I experienced. Hope it can help you in the future.
1/ SSL certificates expired: Every SSL certificate will expire! SSL connection will then fail (e.g. from one microservice endpoint to another).
2/ Out of resources (memory, disk, CPUs, network, etc): It can be a simple disk full 100% on /root partition. Or the famous Java Out of Memory (OOM).
3/ License expired: It can be a trial license for 6 months for your PoC (Proof of Concept), then fails in Production short after a successful PoC.
4/ Unexpected load spikes: Sudden change in end user behaviors and/or faulty software/hardware hitting your API gateways hard. It can be a stuck keyboard hitting a web site hard and caused many requests in a short time.
5/ External changes/outages: Your CDN, Internet backbone provider down/partial outages, external DNS issues, AWS S3 bucket outage.
6/ Forgot to pay the bills: The person in charge of vendor bills is on PTO or the credit card on file expires.
7/ Miscellaneous: Human errors, still remember the old joke that the application went down every M-F at 2am ? because the janitor unplugged the power to the server, so he could plug in the vacuum cleaner at night !
Make sure you have good monitoring systems so you can get prompt alerts/notifications and easy to use dashboard to help troubleshooting !