[basex-talk] Happy New Year! -- A #DigitialHumanities/#MachineLearning #CitizenScientist needs you help... :-)

2 Jan 2020


      Hello Christian & the BaseX Community,
Happy New Year and Decade!
I will get to my specific BaseX/XQuery question below, but first some long
but mostly necessary context...
I am a 69-year-old post-cancer #PayItForward indie #CitizenScientist working
at the intersection of #DigitalHumanities and #AI/#MachineLearning. I am
developing the #MAGAZINEgts ground-truth storage format for serial
publications (primarily print magazines and newspapers). My format is based
on a ontological stack of CIDOC-CRM, FRBRoo, and PRESSoo, and utilizes a
metamodel subgraph design pattern to support integrated complex document
structure and content depiction models.
In addition, my fellow cancer-surviving wife and I are developing a
reference implementation of the #MAGAZINEgts format for the 48-issue run of
Softalk magazine published between 1980 and 1984 during the dawn of the
Microcomputer and Digital Age. Post-cancer and with determination to
reinvent ourselves as Citizen Scientists in the Digital Humanities, we
funded digitization of our magazine collection into the Internet Archive as
a gift to each other for our 25th wedding anniversary.
Initially, I did all our exploratory #MAGAZINEgts prototyping development in
Python using import/export of data via JSON format with one-off scripts to
format batches of information into XML format to fold into the published
#MAGAZINEgts file for the Softalk magazine reference implementation at the
Archive. Then I found BaseX!!! :D
Upon discovering BaseX, I began using it for direct #MAGAZINEgts XML access
and development. The FactMiners Toolkit is, at this stage, the old pre-BaseX
stuff plus the new features related to the generation of balanced training
datasets for #MachineLearning model training. I am in the process now of
revisiting my early development and refactoring it to incorporate BaseX
direct editing.
I recently completed the generation of a "seed"/master dataset for all the
advertisements in Softalk magazine
(https://github.com/SoftalkAppleProject/datasets_ml_all_ads_1M). This seed
dataset can be used in conjunction with the PRESSoo Issuing Rules of the
Advertising Model in the Metamodel partition of the #MAGAZINEgts file. This
"seed" dataset and the metamodel can be used to generate any "slice and
dice" dataset for #ML model training that is aware of the interrelationships
of advertisements' size, shape, and allowable positions on variously
formatted multi-column pages. (If this domain of research interests anyone,
please do not hesitate to contact me. I can point you to our #DATeCH
papers/posters and Neo4j #GraphGists, etc. for more information. You will
also find relevant links to our research on the GitHub repo cited above.)
Now on to my specific question/request-for-information...
To generate #MachineLearning training datasets, I need to not only
slice-and-dice the "actual" data of the advertisement dataset, I also need
to incorporate page examples which DO NOT include the document structure
being trained to be recognized. In other words, the "non-label" instances of
a #ML training dataset are the members of the set-theoretic complement of
pages that DO have advertisements. Fortunately and by design, the
#MAGAZINEgts format incorporates all the information to do this dataset
generation... the problem is that I have only ever known enough XML-related
technologies to "git'er done." And even whatever I may have known is
long-since gone following my cancer battle.
Now, given the OCD-ness of my basic personality structure, I thought it
would be best to develop a solution to present before asking for community
help... and here I am.
The digital collection of Softalk magazine is in the Internet Archive here:
https://archive.org/details/softalkapple?sort=date
The #MAGAZINEgts file for Softalk magazine is here:
https://archive.org/download/softalkapple/softalkapple_publication.xml
For simplicity's sake during my exploratory/learning experience, I import
the softalkapple_publication.xml file into BaseX with the "strip namespaces"
option checked and rename the imported database as 'MAGgts'.
My problem is not that I cannot generate the non-label dataset, but rather
that this query takes roughly around 30 seconds to execute!? Honestly, the
fact that this query works is all I need. But given the world of modern
near-instantaneous responses to virtually all data inquiries, I am wondering
if there is a far better solution to my problem.
(Also note, the somewhat strange format of the returned results -- 'NoAdPg'
elements wrapped within a 'Result' element is for the convenience of
bringing this data into Python to be parsed by the 'xmltodict' module.)
If it turns out that this 30+-second run is, in fact, justified due to the
unseen actual computations being done, no problem. An explanation of why
that much processing power is needed is one possible response to this
inquiry. On the other hand, there is a vast domain of XQuery know-how that I
am blind to and would value any explanation and optimization that can be
suggested to speed this operation up.
Here is my current 30+-second query:
======
(: Round up a list by issue of the leaf images of Softalk magazine that
DO NOT have an advertisement on them and that excludes Cover 1, inserts
(e.g. 'blow-in' cards), and missing pages... :)
(: The softalkapple_publication.xml file from the Internet Archive's Softalk
collection is imported as 'MAGgts' into BaseX XML database, with stripped
namespaces for interactive development/learning convenience.  :)
(: This NS declaration is not needed when namespaces are stripped.
declare default element namespace
"http://www.factminers.org/MAGAZINE/gts/2019-01-18"; :)
declare variable $ad_index := doc('MAGgts')//AdIndex/*;
(: Gather all issues in the Leaf2ppg map (page image to print page number
map) :)
declare variable $all_issues := doc('MAGgts')//Metadata//Leaf2ppg_map[1]/*;
if (count($all_issues) > 0)
then
(<Result>{for $issue in $all_issues
return (for $leaf_spec in $issue/*
let $leaf_id := $leaf_spec/Issue_id
let $leaf_pgnum := $leaf_spec/PageNum
let $leaf_type := $leaf_spec/PgType
where ($leaf_pgnum != 'Cover1'
and $leaf_pgnum != 'Insert Stub'
and $leaf_pgnum != 'Insert Content'
and $leaf_type != 'MissingPgs')
return (if (exists($ad_index[PageNum = $leaf_pgnum and Issue_id =
$leaf_id]))
(: We're not interested in pages w/ ads, only those WITHOUT ads :)
then ()
else <NoAdPg>
<LeafNum>{data($leaf_spec/@leafnum)}</LeafNum>
{$leaf_spec/(Issue_id | PgType | PageNum)
}</NoAdPg>
))}</Result>)
else
"Nothing found."
======
Again, Happy New Year/Decade. Any explanation or optimization greatly
appreciated.
Happy-Healthy Vibes from Colorado,
-: Jim :-

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] Happy New Year! -- A #DigitialHumanities/#MachineLearning #CitizenScientist needs you help... :-)