USPTO XML format causes BaseX GUI errors - BaseX-Talk - mailman.uni-konstanz.de

27 Apr 2014


      Hi,
I am new to BaseX and will attempt to use it to analyze xml datasets from R
at some point in the near future. I am using a BaseX GUI under Windows-7
operating system and had an error while trying to create a database using
the GUI by using an XML file as input. The file comes from the US Patents
and Trademarks Office (USPTO), and the larger XML datasets they provide
have the same problem.
The error text is:
Command:
CREATE DB ipgb20110104Sample
C:/Users/admin/Downloads/ipgb20110104Sample.xmlError:"C:/Users/admin/Downloads/ipgb20110104Sample.xml"
(Line 306): The processing instruction target matching "[xX][mM][lL]"
is not allowed.
The XML document I am trying to open is contained in this zip file:
http://www.uspto.gov/products/ipgb110104-sample.zip
The link to the document is in this page:
http://www.uspto.gov/products/xml-resources.jsp
under the "Patent Grant Data / XML ST. 36 (ICE) v4.2 (a.k.a. Red Book)
(2007 - 2012)" section of the page, under the "Sample Documents
(Bibliographic)" bulletpoint
...
From searching online, I found that the error is because of poor formatting
in the file. However, larger datasets of the same kind (USPTO bulk download
@ Google) have the same problem. In the specific case of the file I
mention, it has a carriage return in the first line, and then has several
concatenated XML files, which is the case of the larger XML files from the
USPTO.
My question is:
Is there a work around to this error/problem? Can I somehow tell BaseX to
ignore or somehow acknowledge that mistake and load the file(s).
Thank you so much,
Jose
-- 
Jose I. Rey
onlyrey@gmail.com