Nooby/#CitizenScientist REQUEST for HELP: Python implementation of this XQuery - BaseX-Talk - mailman.uni-konstanz.de

10 Apr 2019


      To the BaseX Community,
I am a 68-year-old #PayItForward cancer-survivor independent
#CitizenScientist doing applied research in #DigitalHumanities and
#MachineLearning and I need your help, please. And please forgive me for
posting such a TL;DR post.
I am not lazy nor a dilettante, I am simply under tight time pressure to get
some development done on my Python-based metadata discovery and curation
toolkit which my fellow cancer-surviving wife, Timlynn, and I will be
showcasing via our poster presentation accepted for next month's #DATeCH2019
conference in Brussels (http://datech.digitisation.eu/).
Our poster is entitled "#MAGAZINEgts and #dhSegment: Using a Metamodel
Subgraph to Generate Synthetic Data of Under-Sampled Complex Document
Structures for Machine-Learning" (ResearchGate preprint:
https://is.gd/factminers_datech2019_poster). #MAGAZINEgts is the XML-based
ground-truth storage format Timlynn and I are developing based on an
ontological "stack" of #cidocCRM/FRBRoo/PRESSoo using a metamodel subgraph
design pattern. The goal of our design is to support an integrated complex
document structure and content depiction model for digital collections of
print magazines and newspapers. (For more, see our #DATeCH2017 poster:
https://is.gd/factminers_datech2017_poster)
We are evolving a reference implementation of the #MAGAZINEgts format for
the collection of Softalk magazine at the Internet Archive. The collection
is here: https://archive.org/details/softalkapple?&sort=date, and the
MAGAZINEgts file (~10+ MB) is linked from the About page of the collection
but is provided here as a shortened link:
https://is.gd/softalk_magazinegts_xml_file.
MY IMMEDIATE GOAL: Rather than keep with the awkward workflow of generating
intermediary JSON metadata files and, in batches, converting to XML and
copy-pasting into appropriate positions in the master publication file, I
want to incorporate direct incremental updating of fine-grained #MAGAZINEgts
metamodels, metadata, and their associated source-document-specific datasets
via integrating BaseX into the FactMiners Toolkit (fmtk). We would _really_
like to be showing this significant enhancement of our toolkit at the DATeCH
conference (8-10 May).
MY CURRENT CONTEXT: I have the latest BaseX installed, working well, and I
have done as much "fast track" learning as I can to come up to "toddler"
speed on BaseX and its Python-based API. I have the Python API extension
installed and working within my PyCharm IDE, and I am on Windows 10.
MY CURRENT NEED: I have used the BaseX GUI to develop a sample XQuery for
updating/adding a machine-learning data spec for curating the bounding boxes
of advertisements on a page within a magazine. The query below is not
parameterized for programmatic dynamic execution. It is simply a hard-coded
test of my evolving understanding of doing BaseX interactions. So the
dataset name ("all_ads"), the issue-page filename (softalk_v2n02pg002.png)
and the various dimension numbers, etc. are explicit rather than variable,
etc., within this sample query. When I run this query and do an export of
the MAGgts master file, the update is there and looks great. Even though my
knowledge and skills with BaseX are small but growing, I feel I have enough
grip on things to forge ahead to at least get BaseX integrated for the #ML
image training dataset feature that we will showcase at #DATeCH2019. (BTW, I
stripped the #MAGgts schema's XML namespace during BaseX database creation
to make things easier during learning. I expect to simply restore it in the
header after exporting and before uploading a new release of Softalk's
reference implementation of this ground-truth format. Either that, or I will
tweak the eventual Python-implementation of the queries to include the
namespace and just leave it intact when importing into the local BaseX
database.)
HERE IS THE SAMPLE QUERY:
===
declare option db:WRITEBACK 'true';
declare variable $new_spec := <ML_training_img_spec
file_name="softalk_v2n02pg002.png">
<ML_image_dim width="940" height="1280"/>
<ML_label_bbox label="ad" status="predicted" left="500"
top="680" width="444" height="580"/>
<ML_label_bbox label="ad" status="actual" left="490"
top="620" width="440" height="575"/>
</ML_training_img_spec>;
update:output("Update successful."), insert node $new_spec as last into
doc("MAGgts")
//Metadata//ML_maxpixel_datasets[@max_pixels = "1000000"]
/ML_dataset[@name="all_ads"]//ML_training_img_specs
===
MY REQUEST FOR ASSISTANCE: It would be _extraordinarily_ helpful, and
Timlynn and I would be most grateful, if someone within the BaseX community
that has familiarity with doing Python-based BaseX integration could provide
a brief implementation -- similar to the examples supplied in the Client
integration samples -- that would show me how to take a BaseX GUI-developed
query and convert it to a usable state in a Python program.
BY WAY OF THANK YOU: If anyone can help us short-circuit my path to
integrating BaseX into our #DATeCH2019 poster presentation, we will gladly
cite you and your assistance in the acknowledgements of the poster and its
2-page companion handout.
Again, I am sorry for hitting folks with such a long and detailed request
for nooby assistance.
In advance, thank you for any help you Good Folks may provide. I look
forward to significantly improving the functionality of the FactMiners
Toolkit by incorporating BaseX into its core platform.
Happy-Healthy Vibes from Colorado USA,
-: Jim Salmons :-
https://www.researchgate.net/profile/Jim_Salmons