Member-only story

Why software fails even you do not change anything ?

Walter Lee
2 min readOct 9, 2019

--

Photo by chuttersnap on Unsplash

When we go to production down troubleshooting calls, we often hear no changes made anywhere in the system. However, software still fails. Why ?

Below is a summary of what I experienced. Hope it can help you in the future.

1/ SSL certificates expired: Every SSL certificate will expire! SSL connection will then fail (e.g. from one microservice endpoint to another).

2/ Out of resources (memory, disk, CPUs, network, etc): It can be a simple disk full 100% on /root partition. Or the famous Java Out of Memory (OOM).

3/ License expired: It can be a trial license for 6 months for your PoC (Proof of Concept), then fails in Production short after a successful PoC.

4/ Unexpected load spikes: Sudden change in end user behaviors and/or faulty software/hardware hitting your API gateways hard. It can be a stuck keyboard hitting a web site hard and caused many requests in a short time.

5/ External changes/outages: Your CDN, Internet backbone provider down/partial outages, external DNS issues, AWS S3 bucket outage.

6/ Forgot to pay the bills: The person in charge of vendor bills is on PTO or the credit card on file expires.

7/ Miscellaneous: Human errors, still remember the old joke that the application went down every M-F at 2am ? because the janitor unplugged the power to the server, so he could plug in the vacuum cleaner at night !

Make sure you have good monitoring systems so you can get prompt alerts/notifications and easy to use dashboard to help troubleshooting !

Photo by Michael Dam on Unsplash

--

--

Walter Lee
Walter Lee

Written by Walter Lee

GCP Expert and Champion, AWS Community Builder, MS Azure Trainer, CKA/S. Many X Certified in 4xClouds. Opinions are my own and not the views of my employer.

No responses yet