Christian,
Wow, thank you so much!!! :)
Your insights and explanation are so helpful. I will surely be able to learn from them.
As I mentioned in my inquiry, it would be fine if it turned out that this query was about as good as it could be. I just wanted to make sure that I was not doing something so obviously poorly that it would negatively impact the perfomrance of my Python app, not to mention my credibility as a developer. Once I have refactored the existing code of the FactMiners Toolkit to incorporate real-time, direct #MAGAZINEgts XML editing via BaseX, I intend to publish this app via GitHub. As an untrained indie #CitizenScientist working with an awesome community of #DigitalHumanities and #MachineLearning folks, it is important that I do whatever I can to maintain credibility, etc.
The impact of this long-running query will be minimal on the overall performance of the FactMiners Toolkit. UI/UX-wise, I only need to either include a modal dialog signaling the long-running operation or spawn it off into a Python thread independent of the main app the way I do to silently queue up high-resolution page images from the Internet Archive during ground-truth metadata editing.
In addition to there being an easy UI/UX solution to handle this query processing, it is also true that this operation will not need to be done that often. The #MAGAZINEgts metamodel subgraph's PRESSoo Issuing Rules for various complex document structures allow #MachineLearning researchers to "slice and dice" a given structure as many ways as they might imagine and want to test. So while there can be any number of model training datasets generated for a given max_pixel resolution for page/label images, the set-theoretic complement -- that is, the set of document pages that do NOT have that structure on them -- is the same for every dataset that might be sliced and diced based on features of that structure! :-) So this operation will only need to be done once for any given document structure at any specific max_pixel resolution for the test images.
Again, thank you Christian for you insightful reply.
Happy-Healthy Vibes from Colorado USA, -: Jim :-
-----Original Message----- From: Christian Grün christian.gruen@gmail.com Sent: Friday, January 3, 2020 4:44 AM To: Jim Salmons jim.salmons@factminers.org Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Happy New Year! -- A #DigitialHumanities/#MachineLearning #CitizenScientist needs you help... :-)
Dear Jim,
Thanks for your mail; I am glad to hear you are fine, and it’s good to know that BaseX is still your favorite framework for processing Softalk!
I looked at your query, and I managed to get it a lil’ faster. The rewritten versions are attached:
• MAGgts1: I simplified some expressions in order to find obvious weak spots… But there were none, so it didn’t lead to a considerable improvement: The resulting query has the same runtime as your original query.
• MAGgts2: I inlined $ad_index (I replaced the variable reference by its path expression). This way, the resulting path with the predicate is rewritten for database index access. You can see this by checking the output of the Info panel: It contains the information "apply text index for $leaf_pgnum_3". – I expected a major improvement by this rewriting, but the query still runs about 15 seconds. The reason is that the index access is not very “selective”: Most values for PageNum are identical, which means that each call of the index function db:text returns lots of hits that need to matched for the correct Issue_id value (see the optimized query in the info output for more hints). – Next, I swapped the Issue_id and PageNum comparisons in the predicate. Due to that, an index access is created for $issue_id. As the id values are not highly selective either, this made runtime even worse.
• MAGgts3: As database index requests do only work for single text values, and as we have two values to match, I generated a custom string sequence at the beginning of your query, which contains all PageNum/Issue_id combinations (separated by semicolon). Later on, in the main loop, each $leaf_spec element is matched against this sequence. This query does not take advantage of the database indexes. However, it will be evaluated faster than the original query as a hash join will be generated at runtime for comparison with the newly introduced sequence: Instead of up to (9.542 * 7.159 = 68 million) single comparisons which had been performed by the original query (or, to be strict, 136 million comparisons, due to the 2 comparisons per iteration), we’ll only have 9.542 comparisons in the optimized query!
• MAGgts4: I moved the expression that joins the PageNum/Issue_id elements into a function. The rest of the query is pretty much the same, as well as the runtime.
Hope this helps; feel free to ask for more details reg. my technical remarks, Christian
On Fri, Jan 3, 2020 at 2:22 AM Jim Salmons jim.salmons@factminers.org wrote:
Hello Christian & the BaseX Community,
Happy New Year and Decade!
I will get to my specific BaseX/XQuery question below, but first some long but mostly necessary context...
I am a 69-year-old post-cancer #PayItForward indie #CitizenScientist working at the intersection of #DigitalHumanities and #AI/#MachineLearning. I am developing the #MAGAZINEgts ground-truth storage format for serial publications (primarily print magazines and newspapers). My format is based on a ontological stack of CIDOC-CRM, FRBRoo, and PRESSoo, and utilizes a metamodel subgraph design pattern to support integrated complex document structure and content depiction models.
In addition, my fellow cancer-surviving wife and I are developing a reference implementation of the #MAGAZINEgts format for the 48-issue run of Softalk magazine published between 1980 and 1984 during the dawn of the Microcomputer and Digital Age. Post-cancer and with determination to reinvent ourselves as Citizen Scientists in the Digital Humanities, we funded digitization of our magazine collection into the Internet Archive as a gift to each other for our 25th wedding anniversary.
Initially, I did all our exploratory #MAGAZINEgts prototyping development in Python using import/export of data via JSON format with one-off scripts to format batches of information into XML format to fold into the published #MAGAZINEgts file for the Softalk magazine reference implementation at the Archive. Then I found BaseX!!! :D
Upon discovering BaseX, I began using it for direct #MAGAZINEgts XML access and development. The FactMiners Toolkit is, at this stage, the old pre-BaseX stuff plus the new features related to the generation of balanced training datasets for #MachineLearning model training. I am in the process now of revisiting my early development and refactoring it to incorporate BaseX direct editing.
I recently completed the generation of a "seed"/master dataset for all the advertisements in Softalk magazine (https://github.com/SoftalkAppleProject/datasets_ml_all_ads_1M). This seed dataset can be used in conjunction with the PRESSoo Issuing Rules of the Advertising Model in the Metamodel partition of the #MAGAZINEgts file. This "seed" dataset and the metamodel can be used to generate any "slice and dice" dataset for #ML model training that is aware of the interrelationships of advertisements' size, shape, and allowable positions on variously formatted multi-column pages. (If this domain of research interests anyone, please do not hesitate to contact me. I can point you to our #DATeCH papers/posters and Neo4j #GraphGists, etc. for more information. You will also find relevant links to our research on the GitHub repo cited above.)
Now on to my specific question/request-for-information...
To generate #MachineLearning training datasets, I need to not only slice-and-dice the "actual" data of the advertisement dataset, I also need to incorporate page examples which DO NOT include the document structure being trained to be recognized. In other words, the "non-label" instances of a #ML training dataset are the members of the set-theoretic complement of pages that DO have advertisements. Fortunately and by design, the #MAGAZINEgts format incorporates all the information to do this dataset generation... the problem is that I have only ever known enough XML-related technologies to "git'er done." And even whatever I may have known is long-since gone following my cancer battle.
Now, given the OCD-ness of my basic personality structure, I thought it would be best to develop a solution to present before asking for community help... and here I am.
The digital collection of Softalk magazine is in the Internet Archive here:
https://archive.org/details/softalkapple?sort=date
The #MAGAZINEgts file for Softalk magazine is here:
https://archive.org/download/softalkapple/softalkapple_publication.xml
For simplicity's sake during my exploratory/learning experience, I import the softalkapple_publication.xml file into BaseX with the "strip namespaces" option checked and rename the imported database as 'MAGgts'.
My problem is not that I cannot generate the non-label dataset, but rather that this query takes roughly around 30 seconds to execute!? Honestly, the fact that this query works is all I need. But given the world of modern near-instantaneous responses to virtually all data inquiries, I am wondering if there is a far better solution to my problem.
(Also note, the somewhat strange format of the returned results -- 'NoAdPg' elements wrapped within a 'Result' element is for the convenience of bringing this data into Python to be parsed by the 'xmltodict' module.)
If it turns out that this 30+-second run is, in fact, justified due to the unseen actual computations being done, no problem. An explanation of why that much processing power is needed is one possible response to this inquiry. On the other hand, there is a vast domain of XQuery know-how that I am blind to and would value any explanation and optimization that can be suggested to speed this operation up.
Here is my current 30+-second query:
======
(: Round up a list by issue of the leaf images of Softalk magazine that
DO NOT have an advertisement on them and that excludes Cover 1, inserts
(e.g. 'blow-in' cards), and missing pages... :)
(: The softalkapple_publication.xml file from the Internet Archive's Softalk
collection is imported as 'MAGgts' into BaseX XML database, with stripped
namespaces for interactive development/learning convenience. :)
(: This NS declaration is not needed when namespaces are stripped.
declare default element namespace "http://www.factminers.org/MAGAZINE/gts/2019-01-18"; :)
declare variable $ad_index := doc('MAGgts')//AdIndex/*;
(: Gather all issues in the Leaf2ppg map (page image to print page number map) :)
declare variable $all_issues := doc('MAGgts')//Metadata//Leaf2ppg_map[1]/*;
if (count($all_issues) > 0)
then
(<Result>{for $issue in $all_issues return (for $leaf_spec in $issue/* let $leaf_id := $leaf_spec/Issue_id let $leaf_pgnum := $leaf_spec/PageNum let $leaf_type := $leaf_spec/PgType where ($leaf_pgnum != 'Cover1' and $leaf_pgnum != 'Insert Stub' and $leaf_pgnum != 'Insert Content' and $leaf_type != 'MissingPgs') return (if (exists($ad_index[PageNum = $leaf_pgnum and
Issue_id = $leaf_id]))
(: We're not interested in pages w/ ads, only those WITHOUT
ads :)
then () else <NoAdPg> <LeafNum>{data($leaf_spec/@leafnum)}</LeafNum> {$leaf_spec/(Issue_id | PgType | PageNum) }</NoAdPg> ))}</Result>)
else
"Nothing found."
======
Again, Happy New Year/Decade. Any explanation or optimization greatly appreciated.
Happy-Healthy Vibes from Colorado,
-: Jim :-