In the context of a RESTXQ handler that is processing a stored document to generate HTML From it, I need to do some expensive processing and then cache the result for the next time the same document is rendered. It doesn’t make sense to preprocess all the documents to cache the data as part of my overall ingestion process because only a fraction of documents will ever be rendered by this page.
I think the answer is to turn on MIXUPDATES so I can invoke an updating function in the process of the larger RESTXQ handling query, but I’m wondering if there’s some less drastic or better-architected approach? For example, should be I be invoking a separate REST end point to do the calculation, blocking on the request, and then fetch the cached result?
I read through the update module and the various pages on XQuery update and feel like I still don’t quite understand what my updating options are.
Thanks,
Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
On Thu, 2023-07-13 at 19:44 +0000, Eliot Kimber wrote:
In the context of a RESTXQ handler that is processing a stored document to generate HTML From it, I need to do some expensive processing and then cache the result for the next time the same document is rendered.
For fromoldbooks.org/Search i use a cache on the front end, outside BaseX. The one is use is custom code, but there are off the shelf ones now.
If the data is sensitive, there's a danger of http injection attacks with this approach, especially with POST queries, so check for the cache's story on security (more details on request).
For the front page (www.fromoldbooks.org) i use memcached, because what's served doesn't depend on query parameters.
liam
Hi Eliot,
The following RESTXQ function demonstrates how you can cache RESTXQ results in the store:
declare %rest:path('cache/{$a}') function local:convert($a) { store:get-or-put($a, function() { <result ts='{ current-dateTime() }'>{ $a * $a }</result> }) };
If the endpoint is called for the first time, the function argument is executed and the result is returned & cached in the store. If the endpoint is called a second time, the cached value is returned (you’ll see this by reading the timestamp).
Here’s one solution for storing cached results in a secondary database:
declare %updating %rest:path('cache/{$name}') function local:convert($name) { if(db:exists('db-cache', $name)) then ( update:output(db:get('db-cache', $name)) ) else ( let $updated := db:get('db', $name) update { insert node attribute ts { current-dateTime() } into * } return ( db:put('db-cache', $updated, $name), update:output($updated) ) ) };
Hope this helps, Christian
On Thu, Jul 13, 2023 at 9:44 PM Eliot Kimber eliot.kimber@servicenow.com wrote:
In the context of a RESTXQ handler that is processing a stored document to generate HTML From it, I need to do some expensive processing and then cache the result for the next time the same document is rendered. It doesn’t make sense to preprocess all the documents to cache the data as part of my overall ingestion process because only a fraction of documents will ever be rendered by this page.
I think the answer is to turn on MIXUPDATES so I can invoke an updating function in the process of the larger RESTXQ handling query, but I’m wondering if there’s some less drastic or better-architected approach? For example, should be I be invoking a separate REST end point to do the calculation, blocking on the request, and then fetch the cached result?
I read through the update module and the various pages on XQuery update and feel like I still don’t quite understand what my updating options are.
Thanks,
Eliot
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook
We have successfully implemented the technique described by Christian. We have a REST end point that takes a database and node ID of a DITA element. If an HTML preview of the element is present in our cache ( a file system directory organized by node ID), it is retrieved and returned, otherwise the preview is constructed, cached and returned.
This REST endpoint is called from server-side code that also checks for a cached preview and just returns it (avoiding the overhead of the REST call), otherwise it calls the endpoint.
This reduced the time to construct a preview from 50ms to about 17ms to fetch the cached preview. We initially used a database for the cache but that didn’t not offer any significant time savings, so we realized using the file system was better. We had no other reason to need a database for the cached previews.
We are no now exploring upping our JavaScript game to use the REST API from the browser, treating our table previews like the browser treats images, fetching them asynchronously on page load.
The main complexity in our implementation is we had to enable the non-use of the cache in the tree walk code that makes the previews (it’s a simple XQuery tree walk of the incoming XML to make the HTML).
Cheers,
E. _____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Christian Grün christian.gruen@gmail.com Date: Saturday, July 15, 2023 at 9:04 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] How best to cache an intermediate result in the context of a larger query? [External Email]
Hi Eliot,
The following RESTXQ function demonstrates how you can cache RESTXQ results in the store:
declare %rest:path('cache/{$a}') function local:convert($a) { store:get-or-put($a, function() { <result ts='{ current-dateTime() }'>{ $a * $a }</result> }) };
If the endpoint is called for the first time, the function argument is executed and the result is returned & cached in the store. If the endpoint is called a second time, the cached value is returned (you’ll see this by reading the timestamp).
Here’s one solution for storing cached results in a secondary database:
declare %updating %rest:path('cache/{$name}') function local:convert($name) { if(db:exists('db-cache', $name)) then ( update:output(db:get('db-cache', $name)) ) else ( let $updated := db:get('db', $name) update { insert node attribute ts { current-dateTime() } into * } return ( db:put('db-cache', $updated, $name), update:output($updated) ) ) };
Hope this helps, Christian
On Thu, Jul 13, 2023 at 9:44 PM Eliot Kimber eliot.kimber@servicenow.com wrote:
In the context of a RESTXQ handler that is processing a stored document to generate HTML From it, I need to do some expensive processing and then cache the result for the next time the same document is rendered. It doesn’t make sense to preprocess all the documents to cache the data as part of my overall ingestion process because only a fraction of documents will ever be rendered by this page.
I think the answer is to turn on MIXUPDATES so I can invoke an updating function in the process of the larger RESTXQ handling query, but I’m wondering if there’s some less drastic or better-architected approach? For example, should be I be invoking a separate REST end point to do the calculation, blocking on the request, and then fetch the cached result?
I read through the update module and the various pages on XQuery update and feel like I still don’t quite understand what my updating options are.
Thanks,
Eliot
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook
On Thu, 2023-08-10 at 16:00 +0000, Eliot Kimber wrote:
This REST endpoint is called from server-side code that also checks for a cached preview and just returns it (avoiding the overhead of the REST call), otherwise it calls the endpoint.
I do something similar for fromoldbooks.org (using memcached for the front page, as the site sometimes gets.. a little busy :) )
A couple of things to watch for...
* write the new cache file to a temp file and then rename it; that way, another process can't start reading an incomplete cache file
* i check the load average (by opening /proc/loadavg on a Linux server, it's a text file maintained by the kernel) and if it's too high, sleep for a while to slow down crawlers, then return failure.
* updating the cache i handle in the front end code, and i return the result before updating the cache, to shave off a few ms from “time to first paint”. This affects your position in Google search results, if that matters to you.
* if your pages are public, crawler bots will pre-populate the cache. Possibly with nonsensical parameters, so it can make sense to reject those early on. E.g. an incoming search at fromoldbooks.org with 30 keywords isn't from a human as the UI doesn't support more than 3. So I don't need to store 2^30 cached pages when the bot tries every combination
* you can use the Google search console (i think that's the right place) to tell the google bot about parameters that don't affect the result, so it shouldn't try with every possible value.
liam
Thanks for the tips.
This is an internal server so bots shouldn’t be a concern, but certainly the details of cache update will be important—that’s a detail to which I have not yet attended.
Today we implemented using in-browser JavaScript to asynchronously fetch the previews, which we got to work but we need to tune it an understand what the server load implications are.
We generate a report of all the DITA tables in a given set of content (i.e., a given publication or set of publications). This can be several thousand tables, so even at 20ms per table, it’s still a long wait.
We’ll see how this approach works.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Liam R. E. Quin liam@fromoldbooks.org Date: Thursday, August 10, 2023 at 1:03 PM To: Eliot Kimber eliot.kimber@servicenow.com, Christian Grün christian.gruen@gmail.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] How best to cache an intermediate result in the context of a larger query? [External Email]
On Thu, 2023-08-10 at 16:00 +0000, Eliot Kimber wrote:
This REST endpoint is called from server-side code that also checks for a cached preview and just returns it (avoiding the overhead of the REST call), otherwise it calls the endpoint.
I do something similar for fromoldbooks.org (using memcached for the front page, as the site sometimes gets.. a little busy :) )
A couple of things to watch for...
* write the new cache file to a temp file and then rename it; that way, another process can't start reading an incomplete cache file
* i check the load average (by opening /proc/loadavg on a Linux server, it's a text file maintained by the kernel) and if it's too high, sleep for a while to slow down crawlers, then return failure.
* updating the cache i handle in the front end code, and i return the result before updating the cache, to shave off a few ms from “time to first paint”. This affects your position in Google search results, if that matters to you.
* if your pages are public, crawler bots will pre-populate the cache. Possibly with nonsensical parameters, so it can make sense to reject those early on. E.g. an incoming search at fromoldbooks.org with 30 keywords isn't from a human as the UI doesn't support more than 3. So I don't need to store 2^30 cached pages when the bot tries every combination
* you can use the Google search console (i think that's the right place) to tell the google bot about parameters that don't affect the result, so it shouldn't try with every possible value.
liam
-- Liam Quin, https://www.delightfulcomputing.com/https://www.delightfulcomputing.com Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.orghttp://www.fromoldbooks.org
basex-talk@mailman.uni-konstanz.de