On Thu, 2023-08-10 at 16:00 +0000, Eliot Kimber wrote:
This REST endpoint is called from server-side code that also checks for a cached preview and just returns it (avoiding the overhead of the REST call), otherwise it calls the endpoint.
I do something similar for fromoldbooks.org (using memcached for the front page, as the site sometimes gets.. a little busy :) )
A couple of things to watch for...
* write the new cache file to a temp file and then rename it; that way, another process can't start reading an incomplete cache file
* i check the load average (by opening /proc/loadavg on a Linux server, it's a text file maintained by the kernel) and if it's too high, sleep for a while to slow down crawlers, then return failure.
* updating the cache i handle in the front end code, and i return the result before updating the cache, to shave off a few ms from “time to first paint”. This affects your position in Google search results, if that matters to you.
* if your pages are public, crawler bots will pre-populate the cache. Possibly with nonsensical parameters, so it can make sense to reject those early on. E.g. an incoming search at fromoldbooks.org with 30 keywords isn't from a human as the UI doesn't support more than 3. So I don't need to store 2^30 cached pages when the bot tries every combination
* you can use the Google search console (i think that's the right place) to tell the google bot about parameters that don't affect the result, so it shouldn't try with every possible value.
liam