I’m measuring the specific db:token() lookup in order isolate effects of other processing.
These are page view records per document covering several different published versions of each document, so for a given path you would expect at most three or four results, as opposed to 1000s of results.
My implementation is quite naïve in that I’m just chunking the raw CSV data into a database and then hoping the token index will provide good look up results, which has been my experience with other queries (look up times of 0.02 seconds or better), which makes the 0.3 second time a bit anomalous and makes me suspect an error on my end.
This is in the context of a generic “enable processing of any CSV data” feature, rather than a dedicated “report on page views data” feature, where I would construct a more efficient index (i.e., node IDs to page view data or something).
Here are the settings for the analytics database, which holds the CSV XML data:
NAME _analytics SIZE 257 MB NODES 9793157 DOCUMENTS 11 BINARIES 0 VALUES 0 TIMESTAMP 2024-07-14T20:49:34.624Z UPTODATE ✓ RESOURCEPROPERTIES INPUTPATH INPUTSIZE 0 b INPUTDATE 2024-04-17T21:37:04.516Z INDEXES TEXTINDEX ✓ ATTRINDEX ✓ TOKENINDEX ✓ FTINDEX – TEXTINCLUDE ATTRINCLUDE TOKENINCLUDE FTINCLUDE LANGUAGE English STEMMING – CASESENS – DIACRITICS – STOPWORDS UPDINDEX ✓ AUTOOPTIMIZE – MAXCATS 100 MAXLEN 255 SPLITSIZE 0
Thanks,
Eliot _____________________________________________ Eliot Kimber Sr Staff Content Engineer Digital Content & Design O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Christian Grün christian.gruen@gmail.com Date: Tuesday, July 16, 2024 at 9:32 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Query optimization: What can I check or measure? [External Email]
________________________________ Hi Eliot,
It’s difficult to give a general response on that without having a complete look at the architecture, but I’ll try:
I’m measuring a consistent 0.3 seconds for this query:
How much time is spent if you omit the parent step?
db:token($analyticsmgmt:analyticsDatabase, $docPath, 'topicpath')
Next, how much results do you get for a single request? Is it always a single result, or can it be a vast number? How are the values distributued (index:tokens may help to assess this)?
You can attach "=> prof:time()" to an expression to do some isolated performance measurements.
In principle, it makes no difference if the data is stored in one huge document or in millions of documents.
Best, Christian