Re: [basex-talk] BUG: Can't parse JSON from GZIP archive

List overview All Threads
Download

newer

older

Reg : Xquery performence

Fw: Re: including xsd files

Rick Graham

25 Nov 2018 25 Nov '18

11:01 a.m.

Would've filed an issue, but the request is to post here first. (?)

Using version 9.1 BaseX app, a GZIP archive https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.zip works fine.

There is no error message, just an empty database is created silently.

IDK if the GZIP problem is more widespread.

Attachments:

attachment.html (text/html — 701 bytes)

Show replies by date

Christian Grün

25 Nov 25 Nov

2:53 p.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Rick,

Would've filed an issue, but the request is to post here first. (?)

...

Thanks. Many GitHub issues in the past were no bugs, but misunderstandings, so we are asking users to write to the list first.

Using version 9.1 BaseX app, a GZIP archive

...

https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.zip works fine.

Do you really want to create a BaseX database from a "JSON database"? If yes, which format has this database?

Or does your archive contain a set of (tarred) JSON files, which you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

Best, Christian

Rick Graham

7:44 p.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Christian,

Thanks for the reply.

I just wanted to use https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

The archive https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.zip seems to be processed fine by basexgui.

This seems to be a basex/basexgui bug or at least a limitation, yes?

Regards, RG

P.S.: Regarding GitHub issues... I know how to search those. How do I search past mailman threads?

On Sun, Nov 25, 2018 at 8:53 PM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Rick,

Would've filed an issue, but the request is to post here first. (?)

...
Thanks. Many GitHub issues in the past were no bugs, but misunderstandings, so we are asking users to write to the list first.

Using version 9.1 BaseX app, a GZIP archive

...
https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.zip works fine.

Do you really want to create a BaseX database from a "JSON database"? If yes, which format has this database?

Or does your archive contain a set of (tarred) JSON files, which you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

Best, Christian

Christian Grün

26 Nov 26 Nov

4:52 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Rick,

...

I just wanted to use https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

I got it. So you were choosing JSON as input format, and the archive input was not chosen for import.

The challenge seems to be that the filename is not stored inside this particular .gz archive, so the ".json" substring in the original file is the only hint that the compressed file is a json file. This is different for ZIP archives, in which filenames must be stored inside the archive (in .gz archives this is optional).

By default, we thus assume that the input of .gz archives is XML. I’ll see if/how we can find a solution for this, and if we the input format choice can be utilized to correctly interpret the file contents.

...

P.S.: Regarding GitHub issues... I know how to search those. How do I search past mailman threads?

You can search via the basex-talk mail archive (see the link on our web site [1]). Classical search engines will give you valuable results from StackOverflow and other sites.

Best, Christian

[1] http://basex.org/about/open-source/

...

On Sun, Nov 25, 2018 at 8:53 PM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
Would've filed an issue, but the request is to post here first. (?)

Thanks. Many GitHub issues in the past were no bugs, but misunderstandings, so we are asking users to write to the list first.

...
Using version 9.1 BaseX app, a GZIP archive of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive works fine.

Do you really want to create a BaseX database from a "JSON database"? If yes, which format has this database?

Or does your archive contain a set of (tarred) JSON files, which you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

Best, Christian

Christian Grün

5:49 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

A new stable snapshot is available [1]. In the updated version, all corner cases should be taken into consideration (such as gzip archive with missing file suffix in the file name).

Hope this helps, Christian

[1] http://files.basex.org/releases/latest/

On Mon, Nov 26, 2018 at 10:52 AM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Rick,

...
I just wanted to use https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

I got it. So you were choosing JSON as input format, and the archive input was not chosen for import.

The challenge seems to be that the filename is not stored inside this particular .gz archive, so the ".json" substring in the original file is the only hint that the compressed file is a json file. This is different for ZIP archives, in which filenames must be stored inside the archive (in .gz archives this is optional).

By default, we thus assume that the input of .gz archives is XML. I’ll see if/how we can find a solution for this, and if we the input format choice can be utilized to correctly interpret the file contents.

...
P.S.: Regarding GitHub issues... I know how to search those. How do I search past mailman threads?

You can search via the basex-talk mail archive (see the link on our web site [1]). Classical search engines will give you valuable results from StackOverflow and other sites.

Best, Christian

[1] http://basex.org/about/open-source/

...
On Sun, Nov 25, 2018 at 8:53 PM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
Would've filed an issue, but the request is to post here first. (?)

Thanks. Many GitHub issues in the past were no bugs, but misunderstandings, so we are asking users to write to the list first.

...
Using version 9.1 BaseX app, a GZIP archive of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive works fine.

Do you really want to create a BaseX database from a "JSON database"? If yes, which format has this database?

Or does your archive contain a set of (tarred) JSON files, which you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

Best, Christian

Rick Graham

8:51 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Christian,

Yes, that feature works fine in the latest snapshot. Thank you. I'm wondering if an email to nvd@nist.gov might encourage them to include filenames in all their archives.

And while you're poking around the BaseX archive stuff ... would you want to set the Database Resource Properties INPUTSIZE to something other than "0 b" when the INPUTPATH is an archive?

Thanks again, RG

On Mon, Nov 26, 2018 at 11:49 AM Christian Grün christian.gruen@gmail.com wrote:

...

A new stable snapshot is available [1]. In the updated version, all corner cases should be taken into consideration (such as gzip archive with missing file suffix in the file name).

Hope this helps, Christian

[1] http://files.basex.org/releases/latest/

On Mon, Nov 26, 2018 at 10:52 AM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
I just wanted to use

https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

...
I got it. So you were choosing JSON as input format, and the archive input was not chosen for import.

The challenge seems to be that the filename is not stored inside this particular .gz archive, so the ".json" substring in the original file is the only hint that the compressed file is a json file. This is different for ZIP archives, in which filenames must be stored inside the archive (in .gz archives this is optional).

By default, we thus assume that the input of .gz archives is XML. I’ll see if/how we can find a solution for this, and if we the input format choice can be utilized to correctly interpret the file contents.

...
P.S.: Regarding GitHub issues... I know how to search those. How do

I search past mailman threads?

...
You can search via the basex-talk mail archive (see the link on our web site [1]). Classical search engines will give you valuable results from StackOverflow and other sites.

Best, Christian

[1] http://basex.org/about/open-source/

...
On Sun, Nov 25, 2018 at 8:53 PM Christian Grün <

christian.gruen@gmail.com> wrote:

...
...
...
Hi Rick,

...
Would've filed an issue, but the request is to post here first. (?)

Thanks. Many GitHub issues in the past were no bugs, but

misunderstandings, so we are asking users to write to the list first.

...
...
...
...
Using version 9.1 BaseX app, a GZIP archive of a JSON database can't

be used to properly create a database. Interestingly, a ZIP archive works fine.

...
...
...
Do you really want to create a BaseX database from a "JSON database"?

If yes, which format has this database?

...
...
...
Or does your archive contain a set of (tarred) JSON files, which you

would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

...
...
...
Best, Christian

Christian Grün

9:29 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

...

... would you want to set the Database Resource Properties INPUTSIZE to something other than "0 b" when the INPUTPATH is an archive?

In contrast to ZIP archives, there seems to be no trivial way in Java to retrieve the uncompressed file size from gzipped input streams. We could do some extra efforts (as e.g. proposed in [1]). As the processed input stream in BaseX may not rely on a local file, I am not sure if there is a generic solution for that.

[1] https://stackoverflow.com/questions/7317243/gets-the-uncompressed-size-of-th...

...

On Mon, Nov 26, 2018 at 11:49 AM Christian Grün christian.gruen@gmail.com wrote:

...
A new stable snapshot is available [1]. In the updated version, all corner cases should be taken into consideration (such as gzip archive with missing file suffix in the file name).

Hope this helps, Christian

[1] http://files.basex.org/releases/latest/

On Mon, Nov 26, 2018 at 10:52 AM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
I just wanted to use https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

I got it. So you were choosing JSON as input format, and the archive input was not chosen for import.

The challenge seems to be that the filename is not stored inside this particular .gz archive, so the ".json" substring in the original file is the only hint that the compressed file is a json file. This is different for ZIP archives, in which filenames must be stored inside the archive (in .gz archives this is optional).

By default, we thus assume that the input of .gz archives is XML. I’ll see if/how we can find a solution for this, and if we the input format choice can be utilized to correctly interpret the file contents.

...
P.S.: Regarding GitHub issues... I know how to search those. How do I search past mailman threads?

You can search via the basex-talk mail archive (see the link on our web site [1]). Classical search engines will give you valuable results from StackOverflow and other sites.

Best, Christian

[1] http://basex.org/about/open-source/

...
On Sun, Nov 25, 2018 at 8:53 PM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
Would've filed an issue, but the request is to post here first. (?)

Thanks. Many GitHub issues in the past were no bugs, but misunderstandings, so we are asking users to write to the list first.

...
Using version 9.1 BaseX app, a GZIP archive of a JSON database can't be used to properly create a database. Interestingly, a ZIP archive works fine.

Do you really want to create a BaseX database from a "JSON database"? If yes, which format has this database?

Or does your archive contain a set of (tarred) JSON files, which you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

Best, Christian

Rick Graham

10:09 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Christian,

Thanks for your quick replies.

"ISIZE (Input SIZE)" from https://tools.ietf.org/html/rfc1952 looks promising for most GZIP archives containing a single file.

N.B.: The Database Resource Properties INPUTSIZE for ZIP archives also shows "0 b".

Thanks and regards, RG

On Mon, Nov 26, 2018 at 3:29 PM Christian Grün christian.gruen@gmail.com wrote:

...

...
... would you want to set the Database Resource Properties INPUTSIZE to

something other than "0 b" when the INPUTPATH is an archive?

In contrast to ZIP archives, there seems to be no trivial way in Java to retrieve the uncompressed file size from gzipped input streams. We could do some extra efforts (as e.g. proposed in [1]). As the processed input stream in BaseX may not rely on a local file, I am not sure if there is a generic solution for that.

[1] https://stackoverflow.com/questions/7317243/gets-the-uncompressed-size-of-th...

...
On Mon, Nov 26, 2018 at 11:49 AM Christian Grün <

christian.gruen@gmail.com> wrote:

...
...
A new stable snapshot is available [1]. In the updated version, all corner cases should be taken into consideration (such as gzip archive with missing file suffix in the file name).

Hope this helps, Christian

[1] http://files.basex.org/releases/latest/

On Mon, Nov 26, 2018 at 10:52 AM Christian Grün christian.gruen@gmail.com wrote:

...
Hi Rick,

...
I just wanted to use

https://nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-recent.json.gz with basexgui. basexgui doesn't seem to process the archive correctly.

...
...
...
I got it. So you were choosing JSON as input format, and the archive input was not chosen for import.

The challenge seems to be that the filename is not stored inside this particular .gz archive, so the ".json" substring in the original file is the only hint that the compressed file is a json file. This is different for ZIP archives, in which filenames must be stored inside the archive (in .gz archives this is optional).

By default, we thus assume that the input of .gz archives is XML. I’ll see if/how we can find a solution for this, and if we the input format choice can be utilized to correctly interpret the file contents.

...
P.S.: Regarding GitHub issues... I know how to search those. How

do I search past mailman threads?

...
...
...
You can search via the basex-talk mail archive (see the link on our web site [1]). Classical search engines will give you valuable results from StackOverflow and other sites.

Best, Christian

[1] http://basex.org/about/open-source/

...
On Sun, Nov 25, 2018 at 8:53 PM Christian Grün <

christian.gruen@gmail.com> wrote:

...
...
...
...
...
Hi Rick,

> Would've filed an issue, but the request is to post here first.

(?)

...
...
...
...
...
Thanks. Many GitHub issues in the past were no bugs, but

misunderstandings, so we are asking users to write to the list first.

...
...
...
...
...
> Using version 9.1 BaseX app, a GZIP archive of a JSON database

can't be used to properly create a database. Interestingly, a ZIP archive works fine.

...
...
...
...
...
Do you really want to create a BaseX database from a "JSON

database"? If yes, which format has this database?

...
...
...
...
...
Or does your archive contain a set of (tarred) JSON files, which

you would like to import in BaseX as XML? Did you try to rename your file suffix to .tgz?

...
...
...
...
...
Best, Christian

Christian Grün

27 Nov 27 Nov

5:30 a.m.

New subject: BUG: Can't parse JSON from GZIP archive

Hi Rick,

...

"ISIZE (Input SIZE)" from https://tools.ietf.org/html/rfc1952 looks promising for most GZIP archives containing a single file.

Yes, this field should be the one that is discussed in the StackOverflow entry. – As the field is limited to values of 2^32 bytes, the file size won’t be correct for files >4 GiB, so a more generic solution might be to count bytes while parsing them, and sum up the processed bytes after parsing.

...

N.B.: The Database Resource Properties INPUTSIZE for ZIP archives also shows "0 b".

I was surprised to read this. Once again, it’s due the contents of the NIST ZIP archives that don’t contain file lengths (just try some other ZIP archives to see the difference).

Best, Christian

2424

Age (days ago)

2426

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

2 participants

tags (0)

participants (2)

Christian Grün
Rick Graham