Hello Christian & the BaseX Community,
Happy New Year and Decade!
I will get to my specific BaseX/XQuery question below, but first some long but mostly necessary context...
I am a 69-year-old post-cancer #PayItForward indie #CitizenScientist working at the intersection of #DigitalHumanities and #AI/#MachineLearning. I am developing the #MAGAZINEgts ground-truth storage format for serial publications (primarily print magazines and newspapers). My format is based on a ontological stack of CIDOC-CRM, FRBRoo, and PRESSoo, and utilizes a metamodel subgraph design pattern to support integrated complex document structure and content depiction models.
In addition, my fellow cancer-surviving wife and I are developing a reference implementation of the #MAGAZINEgts format for the 48-issue run of Softalk magazine published between 1980 and 1984 during the dawn of the Microcomputer and Digital Age. Post-cancer and with determination to reinvent ourselves as Citizen Scientists in the Digital Humanities, we funded digitization of our magazine collection into the Internet Archive as a gift to each other for our 25th wedding anniversary.
Initially, I did all our exploratory #MAGAZINEgts prototyping development in Python using import/export of data via JSON format with one-off scripts to format batches of information into XML format to fold into the published #MAGAZINEgts file for the Softalk magazine reference implementation at the Archive. Then I found BaseX!!! :D
Upon discovering BaseX, I began using it for direct #MAGAZINEgts XML access and development. The FactMiners Toolkit is, at this stage, the old pre-BaseX stuff plus the new features related to the generation of balanced training datasets for #MachineLearning model training. I am in the process now of revisiting my early development and refactoring it to incorporate BaseX direct editing.
I recently completed the generation of a "seed"/master dataset for all the advertisements in Softalk magazine (https://github.com/SoftalkAppleProject/datasets_ml_all_ads_1M). This seed dataset can be used in conjunction with the PRESSoo Issuing Rules of the Advertising Model in the Metamodel partition of the #MAGAZINEgts file. This "seed" dataset and the metamodel can be used to generate any "slice and dice" dataset for #ML model training that is aware of the interrelationships of advertisements' size, shape, and allowable positions on variously formatted multi-column pages. (If this domain of research interests anyone, please do not hesitate to contact me. I can point you to our #DATeCH papers/posters and Neo4j #GraphGists, etc. for more information. You will also find relevant links to our research on the GitHub repo cited above.)
Now on to my specific question/request-for-information...
To generate #MachineLearning training datasets, I need to not only slice-and-dice the "actual" data of the advertisement dataset, I also need to incorporate page examples which DO NOT include the document structure being trained to be recognized. In other words, the "non-label" instances of a #ML training dataset are the members of the set-theoretic complement of pages that DO have advertisements. Fortunately and by design, the #MAGAZINEgts format incorporates all the information to do this dataset generation... the problem is that I have only ever known enough XML-related technologies to "git'er done." And even whatever I may have known is long-since gone following my cancer battle.
Now, given the OCD-ness of my basic personality structure, I thought it would be best to develop a solution to present before asking for community help... and here I am.
The digital collection of Softalk magazine is in the Internet Archive here:
https://archive.org/details/softalkapple?sort=date
The #MAGAZINEgts file for Softalk magazine is here:
https://archive.org/download/softalkapple/softalkapple_publication.xml
For simplicity's sake during my exploratory/learning experience, I import the softalkapple_publication.xml file into BaseX with the "strip namespaces" option checked and rename the imported database as 'MAGgts'.
My problem is not that I cannot generate the non-label dataset, but rather that this query takes roughly around 30 seconds to execute!? Honestly, the fact that this query works is all I need. But given the world of modern near-instantaneous responses to virtually all data inquiries, I am wondering if there is a far better solution to my problem.
(Also note, the somewhat strange format of the returned results -- 'NoAdPg' elements wrapped within a 'Result' element is for the convenience of bringing this data into Python to be parsed by the 'xmltodict' module.)
If it turns out that this 30+-second run is, in fact, justified due to the unseen actual computations being done, no problem. An explanation of why that much processing power is needed is one possible response to this inquiry. On the other hand, there is a vast domain of XQuery know-how that I am blind to and would value any explanation and optimization that can be suggested to speed this operation up.
Here is my current 30+-second query:
======
(: Round up a list by issue of the leaf images of Softalk magazine that
DO NOT have an advertisement on them and that excludes Cover 1, inserts
(e.g. 'blow-in' cards), and missing pages... :)
(: The softalkapple_publication.xml file from the Internet Archive's Softalk
collection is imported as 'MAGgts' into BaseX XML database, with stripped
namespaces for interactive development/learning convenience. :)
(: This NS declaration is not needed when namespaces are stripped.
declare default element namespace "http://www.factminers.org/MAGAZINE/gts/2019-01-18"; :)
declare variable $ad_index := doc('MAGgts')//AdIndex/*;
(: Gather all issues in the Leaf2ppg map (page image to print page number map) :)
declare variable $all_issues := doc('MAGgts')//Metadata//Leaf2ppg_map[1]/*;
if (count($all_issues) > 0)
then
(<Result>{for $issue in $all_issues
return (for $leaf_spec in $issue/*
let $leaf_id := $leaf_spec/Issue_id
let $leaf_pgnum := $leaf_spec/PageNum
let $leaf_type := $leaf_spec/PgType
where ($leaf_pgnum != 'Cover1'
and $leaf_pgnum != 'Insert Stub'
and $leaf_pgnum != 'Insert Content'
and $leaf_type != 'MissingPgs')
return (if (exists($ad_index[PageNum = $leaf_pgnum and Issue_id = $leaf_id]))
(: We're not interested in pages w/ ads, only those WITHOUT ads :)
then ()
else <NoAdPg>
<LeafNum>{data($leaf_spec/@leafnum)}</LeafNum>
{$leaf_spec/(Issue_id | PgType | PageNum)
}</NoAdPg>
))}</Result>)
else
"Nothing found."
======
Again, Happy New Year/Decade. Any explanation or optimization greatly appreciated.
Happy-Healthy Vibes from Colorado,
-: Jim :-