Thanks for the tips.

 

This is an internal server so bots shouldn’t be a concern, but certainly the details of cache update will be important—that’s a detail to which I have not yet attended.

 

Today we implemented using in-browser JavaScript to asynchronously fetch the previews, which we got to work but we need to tune it an understand what the server load implications are.

 

We generate a report of all the DITA tables in a given set of content (i.e., a given publication or set of publications). This can be several thousand tables, so even at 20ms per table, it’s still a long wait.

 

We’ll see how this approach works.

 

Cheers,

 

E.

 

_____________________________________________

Eliot Kimber

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com

LinkedIn | Twitter | YouTube | Facebook

 

From: Liam R. E. Quin <liam@fromoldbooks.org>
Date: Thursday, August 10, 2023 at 1:03 PM
To: Eliot Kimber <eliot.kimber@servicenow.com>, Christian Grün <christian.gruen@gmail.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] How best to cache an intermediate result in the context of a larger query?

[External Email]

On Thu, 2023-08-10 at 16:00 +0000, Eliot Kimber wrote:
>
>
> This REST endpoint is called from server-side code that also checks
> for a cached preview and just returns it (avoiding the overhead of
> the REST call), otherwise it calls the endpoint.

I do something similar for fromoldbooks.org (using memcached for the
front page, as the site sometimes gets.. a little busy :) )

A couple of things to watch for...

* write the new cache file to a temp file and then rename it; that way,
another process can't start reading an incomplete cache file

* i check the load average (by opening /proc/loadavg on a Linux server,
it's a text file maintained by the kernel) and if it's too high, sleep
for a while to slow down crawlers, then return failure.

* updating the cache i handle in the front end code, and i return the
result before updating the cache, to shave off a few ms from “time to
first paint”. This affects your position in Google search results, if
that matters to you.

* if your pages are public, crawler bots will pre-populate the cache.
Possibly with nonsensical parameters, so it can make sense to reject
those early on. E.g. an incoming search at fromoldbooks.org with 30
keywords isn't from a human as the UI doesn't support more than 3. So I
don't need to store 2^30 cached pages when the bot tries every
combination

* you can use the Google search console (i think that's the right
place) to tell the google bot about parameters that don't affect the
result, so it shouldn't try with every possible value.

liam


--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org