Hi,
I am new to BaseX and will attempt to use it to analyze xml datasets from R at some point in the near future. I am using a BaseX GUI under Windows-7 operating system and had an error while trying to create a database using the GUI by using an XML file as input. The file comes from the US Patents and Trademarks Office (USPTO), and the larger XML datasets they provide have the same problem.
The error text is:
Command: CREATE DB ipgb20110104Sample C:/Users/admin/Downloads/ipgb20110104Sample.xmlError:"C:/Users/admin/Downloads/ipgb20110104Sample.xml" (Line 306): The processing instruction target matching "[xX][mM][lL]" is not allowed.
The XML document I am trying to open is contained in this zip file: http://www.uspto.gov/products/ipgb110104-sample.zip
The link to the document is in this page: http://www.uspto.gov/products/xml-resources.jsp under the "Patent Grant Data / XML ST. 36 (ICE) v4.2 (a.k.a. Red Book) (2007 - 2012)" section of the page, under the "Sample Documents (Bibliographic)" bulletpoint
From searching online, I found that the error is because of poor formatting
in the file. However, larger datasets of the same kind (USPTO bulk download @ Google) have the same problem. In the specific case of the file I mention, it has a carriage return in the first line, and then has several concatenated XML files, which is the case of the larger XML files from the USPTO.
My question is:
Is there a work around to this error/problem? Can I somehow tell BaseX to ignore or somehow acknowledge that mistake and load the file(s).
Thank you so much,
Jose
Hi Jose,
your .xml file actually contains several XML files, which must first be split in order to be parsed. I've attached one solution in XQuery (there may be other, more elegant solutions):
(: open file :) let $input := unparsed-text('ipgb20110104Sample.xml') (: get document substrings; omit those without angle brackets :) let $docs := tokenize($input, '<?xml version="1.0" encoding="UTF-8"?>\s*<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" \[ \]>\s*')[matches(., '<')] (: generate document names :) let $names := for $n in 1 to count($docs) return $n || '.xml' (: create database with all documents :) return db:create('db', $docs, $names)
Hope this helps, Christian
On Sun, Apr 27, 2014 at 6:52 PM, Jose Rey onlyrey@gmail.com wrote:
Hi,
I am new to BaseX and will attempt to use it to analyze xml datasets from R at some point in the near future. I am using a BaseX GUI under Windows-7 operating system and had an error while trying to create a database using the GUI by using an XML file as input. The file comes from the US Patents and Trademarks Office (USPTO), and the larger XML datasets they provide have the same problem.
The error text is:
Command: CREATE DB ipgb20110104Sample C:/Users/admin/Downloads/ipgb20110104Sample.xmlError:"C:/Users/admin/Downloads/ipgb20110104Sample.xml" (Line 306): The processing instruction target matching "[xX][mM][lL]" is not allowed.
The XML document I am trying to open is contained in this zip file: http://www.uspto.gov/products/ipgb110104-sample.zip
The link to the document is in this page: http://www.uspto.gov/products/xml-resources.jsp under the "Patent Grant Data / XML ST. 36 (ICE) v4.2 (a.k.a. Red Book) (2007 - 2012)" section of the page, under the "Sample Documents (Bibliographic)" bulletpoint
From searching online, I found that the error is because of poor formatting in the file. However, larger datasets of the same kind (USPTO bulk download @ Google) have the same problem. In the specific case of the file I mention, it has a carriage return in the first line, and then has several concatenated XML files, which is the case of the larger XML files from the USPTO.
My question is:
Is there a work around to this error/problem? Can I somehow tell BaseX to ignore or somehow acknowledge that mistake and load the file(s).
Thank you so much,
Jose
-- Jose I. Rey onlyrey@gmail.com
Christian, beautiful. It worked pretty well.
On Sun, Apr 27, 2014 at 1:19 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Jose,
your .xml file actually contains several XML files, which must first be split in order to be parsed. I’ve attached one solution in XQuery (there may be other, more elegant solutions):
(: open file :) let $input := unparsed-text('ipgb20110104Sample.xml') (: get document substrings; omit those without angle brackets :) let $docs := tokenize($input, '<?xml version="1.0" encoding="UTF-8"?>\s*<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" \[ \]>\s*')[matches(., '<')] (: generate document names :) let $names := for $n in 1 to count($docs) return $n || '.xml' (: create database with all documents :) return db:create('db', $docs, $names)
Hope this helps, Christian
On Sun, Apr 27, 2014 at 6:52 PM, Jose Rey onlyrey@gmail.com wrote:
Hi,
I am new to BaseX and will attempt to use it to analyze xml datasets from R at some point in the near future. I am using a BaseX GUI under Windows-7 operating system and had an error while trying to create a database using the GUI by using an XML file as input. The file comes from the US Patents and Trademarks Office (USPTO), and the larger XML datasets they provide have the same problem.
The error text is:
Command: CREATE DB ipgb20110104Sample C:/Users/admin/Downloads/ipgb20110104Sample.xmlError:"C:/Users/admin/Downloads/ipgb20110104Sample.xml" (Line 306): The processing instruction target matching "[xX][mM][lL]" is not allowed.
The XML document I am trying to open is contained in this zip file: http://www.uspto.gov/products/ipgb110104-sample.zip
The link to the document is in this page: http://www.uspto.gov/products/xml-resources.jsp under the "Patent Grant Data / XML ST. 36 (ICE) v4.2 (a.k.a. Red Book) (2007 - 2012)" section of the page, under the "Sample Documents (Bibliographic)" bulletpoint
From searching online, I found that the error is because of poor formatting in the file. However, larger datasets of the same kind (USPTO bulk download @ Google) have the same problem. In the specific case of the file I mention, it has a carriage return in the first line, and then has several concatenated XML files, which is the case of the larger XML files from the USPTO.
My question is:
Is there a work around to this error/problem? Can I somehow tell BaseX to ignore or somehow acknowledge that mistake and load the file(s).
Thank you so much,
Jose
-- Jose I. Rey onlyrey@gmail.com
basex-talk@mailman.uni-konstanz.de