Thanks for the tips.
This is an internal server so bots shouldn’t be a concern, but certainly the details of cache update will be important—that’s a detail to which I have not yet attended.
Today we implemented using in-browser JavaScript to asynchronously fetch the previews, which we got to work but we need to tune it an understand what the server load implications are.
We generate a report of all the DITA tables in a given set of content (i.e., a given publication or set of publications). This can be several thousand tables, so even at 20ms per table, it’s still a long wait.
We’ll see how this approach works.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Liam R. E. Quin liam@fromoldbooks.org Date: Thursday, August 10, 2023 at 1:03 PM To: Eliot Kimber eliot.kimber@servicenow.com, Christian Grün christian.gruen@gmail.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] How best to cache an intermediate result in the context of a larger query? [External Email]
On Thu, 2023-08-10 at 16:00 +0000, Eliot Kimber wrote:
This REST endpoint is called from server-side code that also checks for a cached preview and just returns it (avoiding the overhead of the REST call), otherwise it calls the endpoint.
I do something similar for fromoldbooks.org (using memcached for the front page, as the site sometimes gets.. a little busy :) )
A couple of things to watch for...
* write the new cache file to a temp file and then rename it; that way, another process can't start reading an incomplete cache file
* i check the load average (by opening /proc/loadavg on a Linux server, it's a text file maintained by the kernel) and if it's too high, sleep for a while to slow down crawlers, then return failure.
* updating the cache i handle in the front end code, and i return the result before updating the cache, to shave off a few ms from “time to first paint”. This affects your position in Google search results, if that matters to you.
* if your pages are public, crawler bots will pre-populate the cache. Possibly with nonsensical parameters, so it can make sense to reject those early on. E.g. an incoming search at fromoldbooks.org with 30 keywords isn't from a human as the UI doesn't support more than 3. So I don't need to store 2^30 cached pages when the bot tries every combination
* you can use the Google search console (i think that's the right place) to tell the google bot about parameters that don't affect the result, so it shouldn't try with every possible value.
liam
-- Liam Quin, https://www.delightfulcomputing.com/https://www.delightfulcomputing.com Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.orghttp://www.fromoldbooks.org