From the Temples of Testers, a browser bestowed a 504 gateway timeout in your newly deployed internal facing Application Load Balancer (ALB). There was a gnashing of molars and gurning of visages.
Your ALB isn’t responding. Don’t panic and be “oh wow! heavy heavy heavy” like Neil from the Young Ones.
Be like Blackadder below with a cunning plan. Below are steps you too can follow to foil load balancer connectivity foibles.
Can the “computer says no” defect be replicated?
As it’s an internal not external load balancer
The problem appeared in an AWS direct-connect enabled testing account. But the development AWS account for debugging didn’t have direct connect. So how could the error be replicated?
The cloudformation stack had a bastion server. The solution was to add the bastion server CIDR range as an ingress rule to the ALB’s security group, so the load balancer would listen for requests from the bastion.
I ssh’ed to the bastion. I summoned the command of curl and lo! A 504 timeout was returned. Defect confirmed. First step, fair reader, is to get a bastion server with curl installed. Then add its CIDR range as ingress to the load balancer security group.
Are HTTP requests going to the right load balancer?
Like finding a single load balancer in a cloudformation stack
With multiple load balancers in the stack the next step was to check requests were going to right load balancer. The metrics tab in the suspect load balancer’s console confirmed recent requests and 5XX errors.
For IP details the load balancer logs were turned on. In the load balancer console, go to the load balancer properties. Logs are enabled via a tick box. It will ask for a s3 bucket to write logs to. Test log files will first be written. It takes the correct brewing of a nice cuppa (approximately 3 minutes according to earl grey boffins ) for real logs to be written to the s3 bucket.
Then curl again the calamitous http address and wait for logs to appear in s3. This will confirm where requests are being sent from. After an invigorating tea at approximately 3:33pm the logs confirmed the curl requests were being sent from the bastion and being received by our suspect load balancer.
Are the listening processes on the target server on?
Is anyone even listening at localhost?
Confirm the listening processes on the target server are running. Ssh to the target server from the bastion server. Curl the localhost to see what response is returned.
The server returned a HTTP 500 server error. The application server wasn’t starting automatically. A
ps -ef confirmed this. The server had not started. A butcher’s (hook i.e.look) at the server logs showed errors about the database connection were missing. Recent changes to the database loading sizes meant the application server was starting before loading finished. A fix was added for the application server to wait for the database connection.
With this fix the application server was a fine picture of Dorian Gray, and the next curl returned a HTTP 200. This confirmed the target server was listening.
Are the load balancer health checks even valid?
Because sometimes HTTP just ain’t enough
On the next attempt to curl from the bastion it returned a “HTTP 503 error service unavailable.” The load balancer health checks pointed to the correct port. However, the checks failed because it was set HTTP not HTTPS. The target endpoint had redirects from HTTP to HTTPS. All servers in the auto scaling group were being identified as unhealthy. So a quick flip to HTTPS and the instances returned as healthy. So requests from the load balancer could be passed onto healthy target servers.
Are the target groups, listeners & listener rules correct?
So many routes, so many rules, so many must just work
There was a sprawling tendril of target groups embedded with listeners spawned with listener rules. Never tested until its offering at the Temple of Testers. All but one target group, listener rule and listener were deleted from the cloudformation to isolate the connectivity problem. The condition on the listener rules needed revalidation. They were checked to have either correct paths e.g “orders/*, /login/*“ or relevant host values “*.myserver.io, *.*.myserver.io.”
Another curl attempt was made and a HTTP 502 was returned. The default listener rule was being triggered. However, it was directing traffic to a different target group which contained unhealthy instances. Problem for future diagnosis. The default group was changed to the target group being diagnosed. Curl still didn’t return a valid response. The listener ports were also updated to from HTTP to HTTPS. The change to HTTPS also required the additional configuration of certificates. As this was still non-production, development certificates were used. Alas curl was still returning an invalid response.
Are security groups from the load balancer to the target server correct?
What goes in, should with ingress/egress, go out too
Next test was to validate security groups. A python simple HTTP server was setup to listen for requests on the target server. Curl was re-run. HTTP server showed nothing. The security group from the load balancer to the target group didn’t have the port in the ingress rules. The port was added to ingress rules. The next curl returned a HTTP 200 response. Hail the stars! A Brian Blessed “Gordan is alive” hurrah all round!
Excuse me, did the web server pass requests to the application server?
Home base only counts if you land
It was a Trojan horse false victory. Python simple HTTP server was turned off and the correct web server was reinstated. Curl returned an invalid HTTP response again. The appropriate response at this point were words not for the faint of heart.
The configuration of the web server was inspected. Amazing an eyeball by the local application expert spotted the one parameter causing encryption kerfuffles in the redirects of HTTP to HTTPS requests. This parameter caused requests to not be processed. As the development certificates didn’t contain invalid addresses, the offending parameter was tweaked.
A curl again. The curl was harked! A HTTP 200 sprang forth from an application cradle of genesis, through a web primordial swamp, beyond an ALB gestalt and met its request with its magnificent HTML payload of “login and password.”
Lo! The internal ALB has been debugged. Computer can say “yes!”
Debug checklist for an Application Load Balancer:
Here is a summary checklist to debug an AWS ALB:
- Can the connectivity defect be reproduced? e.g. Sending cURL requests to the suspect URL via a bastion server.
- Are metrics and logs confirming requests are being sent to the right load balancer?
- Are the listening processes on the target host running?
- Are health checks on the load balancer valid? Does it need to be set to HTTPS?
- Do the target groups route to the correct auto-scaling group?
- Do the listener rule conditions have the correct host or path set? Does each condition resolve as expected?
- Is each listener and corresponding listener rule correct?
- Are the instances in the target group healthy?
- Do the ingress and egress rules on the loadbalancer security group allow traffic in/out to the target servers?
- Do the ingress and egress rules on the host server security group allow traffic in and out?
- Is the web server passing traffic onto the application server?
- Are certificates being used for encryption valid for the stack? Does it matter if they are not valid?