♲ @denschub@pod.geraspora.de: Okay, here's a fun little story about a rabbithole I just fell into. It's about running the #diaspora project web infrastucture, and probably a glance behind the scenes.
Last week, I got an alert about unusually high load on the diaspora* project webserver. I was traveling for work, and the alert resolved itself after a couple of minutes -- so I ignored it. However, earlier today, I once again got alerted, this time about the wiki responding somewhat slow.
This was unusual. Some of our web assets, like the project's website or Discourse, do get quite a bit of traffic, sometimes holding a steady load of 5+ requests per second. The wiki, however, is not that. It generally only sees 5 requests or so per minute, so it was weird to see the wiki causing trouble. After some initial stop-gap measure, I had a closer look at the node, and noticed that the CPU load was on a significantly elevated, but stable, level. This, again, was unusual - because I designed the system in a way that allows it to handle quite large traffic spikes, and it shouldn't be under significant load on regular days.
Luckily, I have way more than enough metrics to investigate here. Looking at the traffic chart for the wiki only, it became very obvious that there was some sort of... deviation from the norm:
Again, traffic to the wiki is usually best measured in requests per minute - not requests per second. One could argue that diaspora* somehow just got very popular, but I have quite a bit of experience running services, and this chart didn't look natural to me.
The slow on-ramp of traffic over five days, and the occasional double traffic bursts, all of that looks like it's a bot (or a bot-farm) trying to see how far they can go before running into rate limits or load limits. So that's... odd - especially on the wiki.
Now, I have a few signals I can use to correlate requests with each other. In this case, I used the source IP and the User-Agent string provided. And it became quite obvious what was happening: A couple of crawlers, with the Amazon Web Crawler https://developer.amazon.com/support/amazonbot leading the list, followed by some .. shady and undocumented web crawlers, had a lot of fun crawling the entire diaspora* wiki in an endless loop - including images in various thumbnail sizes, all previous versions of all pages, etcetc. Not nice. But also easy to workaround.
I noticed a couple of other interesting things, though. A lot of people run their own webservice monitoring applications, and for some reason, people find it fun to add other people's web properties to the list. I guess that's fine if it's only one or two cases, but if a lot of servers request a page once a minute, .. well you're eventually getting to quite an accumulation of traffic. Anyway, that's annoying, but not on a level where it's causing pain. But some of you really really should update your application versions.
Another, and much much more intersting pattern I saw, was that... apparently, some diaspora* pods .. wanted to federate stuff to the wiki?!
All of those traffic sources are actual diaspora* pods. On rank 1, there's @Fla's diaspora-fr.org, and on rank 2, there's my very own Geraspora. So that was... odd. There's a couple of very odd things in here from the point of view of a diaspora* developer. In theory, diaspora* should not try to federate to... a wiki. Also, all those requests were GET
requests, not POST
requests like you'd imagine. In fact, the only case where diaspora* should send a GET
request is when it's trying to fetch a public entity from another pod. Every single one of those request was a GET /Choosing_a_pod
, and while I think that's a pretty important page on our wiki, ... y'know, it's not a diaspora* entity.
So what's going on here? After @Benjamin Neff and I spent a bit of time banging our head against a wall, we eventually found out what happened there.
A recently closed down pod, probably in order to help their existing users, set up a redirect to the wiki's Choosing a pod https://wiki.diasporafoundation.org/Choosing_a_pod page. But they didn't only do that for specific URLs, they just send all requests with a 301
status code to that wiki page. And well, that includes the HTTP routes used for federation. So every time those diaspora* pods wanted to send something to the closed down pod, they ended up requesting that wiki page. Fun times. But wait, didn't I say those requests were all GET
requests? Well yes, they were! Even though diaspora* itself started out with a POST
, that got lost somewhere.
I'm a webbrowser nerd, so I knew what was happening here, but I appreciate that that's not universally the case. So, to explain, what we're seeing here is a bit of an HTTP edge-case. The relevant specification says https://www.rfc-editor.org/rfc/rfc2616#section-10.3.2 (and yes, I know that that's an older version - but if you check the new versions, you will know why I quote RFC 2616 and not RFC 9110)
If the 301 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.
Note: When automatically redirecting a POST request after receiving a 301 status code, some existing HTTP/1.0 user agents will erroneously change it into a GET request.
and well, yes, that's exactly what's happening here. diaspora* sees the redirect (and actually marks the pod as "down", so it will stop receiving federation packets in two weeks), but the library we're using then switches to a GET
requests, and just... queries the wiki.
That's obviously a bug, and diaspora* should not follow redirects. Anyway. In the end, the requests caused by that bug were not responsible for any significant load on the server. But it was fun to dig into that. 😀