Kubernetes, Traefik, Apache PHP 7.3 – Static Assets giving HTTP 500’s

Problem Overview

A website based on PHP was being upgraded from an older k8s cluster (1.19.x) to a newer version (1.21+) in order to address a variety of security vulnerabilities.

Moving to the new cluster seemed to expose a new problem serving static assets on Apache PHP-FPM from the frontend container. Any HTTP GET for an image, PDF file, etc would result in an HTTP 500 response with the text “Internal Server Error” as evidenced below:

curl -v "<https://uat.foo.dev/public/custom/img/county_repr/_med-01-08-2021_1610149328.jpg>"
*   Trying
* Connected to uat.foo.dev ( port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=uat.foo.dev
*  start date: Jan 13 01:00:32 2022 GMT
*  expire date: Apr 13 01:00:31 2022 GMT
*  subjectAltName: host "uat.foo.dev" matched cert's "uat.foo.dev"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fa4a580d600)
> GET /public/custom/img/county_repr/_med-01-08-2021_1610149328.jpg HTTP/2
> Host: uat.foo.dev
> User-Agent: curl/7.64.1
> authority: uat.foo.dev
> accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
> accept-language: en-US,en;q=0.9
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 500 
< content-type: text/plain; charset=utf-8
< content-length: 21
< date: Mon, 17 Jan 2022 13:26:13 GMT
* Connection #0 to host uat.foo.dev left intact
Internal Server Error* Closing connection 0

Well, this is exactly what’s not supposed to happen with Docker and containers! 🤔 Why doesn’t the exact same code work on a newer cluster?

Envisioning the environment

We deploy to an AKS cluster in Azure, and make use of namespacing in Kubernetes to have multiple, tenanted copies of our app code with independent data, storage volumes, etc. Here’s a conceptual idea:

We happen to be using Calico in our AKS cluster, so the following YouTube video might be helpful for understanding how containers “talk to” each other (DNS resolution, traffic routing, etc). I had to take some time and absorb this level of detail to formulate a hypothesis about what might be broken.

And lastly, it’s helpful to be aware that kube-proxy allows containers running on different physical nodes but in the same namespace to find each other.

Most of the time, all of this magic “just works”…but occasionally there’s a breakdown and you have to understand it in order to begin to isolate the complexity and arrive at a solution.

Possible root causes

So we began to brainstorm on possible root causes, and develop techniques for finding the actual root cause.

Findings (getting more data)

Volume comparisons proved to be insignificant. The files had the same sizes, hashes, etc. As far as any tools (such as PHP, or cat in bash for that matter) were concerned, these files were just files.

We began inspecting network traffic on the container, so that we could know if the container was properly responding with JPG file content as we examined all scopes of the network.

We installed tcpdump and used it to check tcp traffic on the container

apt-get update
apt-get install tcpdump
# capture a variety of HTTP vergs (GET, POST, etc) and output hex/ascii views
tcpdump -X -s 0 "tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)" > capture.txt

This proved helpful, because it showed that a JPG file was being streamed back in all cases (both “inside” of the ingress point, and from the outside public Internet).

We also captured the serving of the .jpg file to disk, and downloaded it with kubectl cp.

kubectl cp -n uat frontend-deployment-5646d7cd78-9mqn5:/var/www/html/capture.txt ./capture.txt

So it doesn’t seem to be Apache or the container… what about Traefik?

Traefik was being managed with Helm, so we redeployed it with the parameter log.leve = DEBUG, based on the chart’s recommended values.yaml file on GitHub.

  - "--log.level=DEBUG"
  # ... other params omitted

Traefik logs

time=”2022-01-17T22:07:36Z” level=debug msg=”vulcand/oxy/roundrobin/rr: Forwarding this request to URL” Request=”{\”Method\”:\”GET\”,\”URL\”:{\”Scheme\”:\”\”,\”Opaque\”:\”\”,\”User\”:null,\”Host\”:\”\”,\”Path\”:\”/public/custom/img/county_repr/_med-01-08-2021_1610149328.jpg\”,\”RawPath\”:\”\”,\”ForceQuery\”:false,\”RawQuery\”:\”\”,\”Fragment\”:\”\”,\”RawFragment\”:\”\”},\”Proto\”:\”HTTP/2.0\”,\”ProtoMajor\”:2,\”ProtoMinor\”:0,\”Header\”:{\”Accept\”:[\”/\”],\”User-Agent\”:[\”curl/7.64.1\”],\”X-Forwarded-Host\”:[\”uat.foo.dev\”],\”X-Forwarded-Port\”:[\”443\”],\”X-Forwarded-Proto\”:[\”https\”],\”X-Forwarded-Server\”:[\”traefik-foo-8697ff5667-z8brp\”],\”X-Real-Ip\”:[\”\”]},\”ContentLength\”:0,\”TransferEncoding\”:null,\”Host\”:\”uat.foo.dev\”,\”Form\”:null,\”PostForm\”:null,\”MultipartForm\”:null,\”Trailer\”:null,\”RemoteAddr\”:\”\”,\”RequestURI\”:\”/public/custom/img/county_repr/_med-01-08-2021_1610149328.jpg\”,\”TLS\”:null}” ForwardURL=”http://frontend.uat.svc.cluster.local:8085

time=”2022-01-17T13:53:15Z” level=debug msg=”‘500 Internal Server Error‘ caused by: net/http: HTTP/1.x transport connection broken: malformed HTTP status code \”bytes\””

So clearly the connection between Traefik and Apache is to blame, but what is causing this? Why is Apache responding this way?

Online Research

With enough details to do intelligent searching, we stumbled upon this StackOverflow article:


It references a Microsoft article about Azure App Services that proved to be quite useful. Apache’s MMAP and SendFile features can sometimes cause this. The article doesn’t state a reason as to WHY this happens, but it does indicate that there is faulty behavior in Apache that seems to cause this.

How does disabling these affect Apache?

We weren’t confident that this would fix the issue, but we gave it a try (as well as upgrading to PHP 7.4 since this is an annual security upgrade), and sure enough! The problem stopped happening.

We decided to capture a successful download and compare it to the broken download. Here is the diff of the “broken” HTTP GET vs the “working” HTTP GET:

It seems that when MMAP and SendFile are on, at least under some circumstances, it can cause Apache not to send the proper HTTP headers back in the response. In fact, it’s a little crazy because you can see on the left side that the HTTP response packet never includes the typical starting text of “HTTP/1.1 200 OK”. Since it’s breaking protocol, of course net/http is choking on it.

It’s unclear why this didn’t affect previous version of Kubernetes (1.19.x), and perhaps Kubernetes is not to blame. Most likely, the earlier version of Traefik is somehow swallowing this protocol violation gracefully, as demonstrated in the log files above. So by updating to the latest version of Traefik on the existing 1.19.x cluster, we would begin to also experience the issue there.

Disabling MMAP and SendFile in this case is obviously the right solution, because we want to return well-formed HTTP responses. I’ll have to file a bug report with Apache.

Submit your response

Your email address will not be published. Required fields are marked *