I was starting to think it was something like that.
We’re exploring how to implement tf/idf using the full text primitives—doesn’t look like it should be too hard but we’ve only taken baby steps so far (we have an intern for the summer who is studying data science and is eager to explore something he studied last semester and it’s a good way to learn some XQuery). Also another shout out for XQuery for Humanists—it’s been a good learning tool for him).
We can get the list of terms in the full text index and we can get the list of documents and we can of course get the full text for each document so calculating tf/idf should be a simple matter of iterating and capturing the result in a purpose-built index.
I don’t know that it will tell us anything interesting about our corpus of ServiceNow platform documentation but you never know until you ask the question…
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Christian Grün christian.gruen@gmail.com Date: Monday, June 13, 2022 at 9:50 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Term Frequency/inverse Document Frequency Calculation? [External Email]
Hi Eliot,
An earlier version of BaseX stored TF/IDF data in the full-text index. We eventually got rid of the solution as it was too expensive to recompute the IDF values after updates.
Best, Christian
On Wed, Jun 8, 2022 at 12:06 AM Eliot Kimber eliot.kimber@servicenow.com wrote:
We’d like to report tf/idf for our DITA content set (https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Tf**Bidf__;4oCT!!N... )
Of course this is possible using BaseX and basic full-text processing.
My question: has anyone done this or is there somewhere I can look to at least get an idea of the level of effort?
Having thought about it for not much time at all I’m thinking it’s an application of the basic “make an index over the words for each doc” technique that others have discussed recently.
Thanks,
E.
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com
LinkedIn | Twitter | YouTube | Facebook