Whitespace handling on ingest

List overview All Threads
Download

newer

older

Not enough space

GUI enhancement to show only query...

Wendell Piez

20 Feb 2013 20 Feb '13

11:35 a.m.

Hi,

I see the 'CHOP' option, turned on by default, for trimming leading and trailing whitespace and eliminating empty text nodes.

What about going further? Is there a good way to normalize whitespace entirely, collapsing any runs of tab-LF-space into single spaces in my data?

I think I mentioned earlier the idea of specifying an XSLT transformation to filter data on ingest (for a similar requirement, namely removing all comments and PIs). That might be going too far but any hints you can give me (or pointers to docs) about functionality to address this sort of thing in general would be welcome.

Thanks! Wendell

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

Show replies by date

Christian Grün

22 Feb 22 Feb

5:34 a.m.

Hi Wendell,

the CHOP option has been introduced at a verly stage of BaseX, and I’m not sure if we had added it today. We could add one or more additional options to normalize whitespaces or removing PIs/comments from the input, but the wish list, and the exception list, would probably continue to grow, so I believe that it would be more convenient to have a general pre-processing step that takes care of all the normalization steps. I’m not sure, however, what’s the best approach to do this within BaseX. If it’s possible to cache files on disk before adding them to the database, I would recommend XQuery or BaseX command scripts, XProc or anything else to prepare the data and delete it afterwards.

Comments are welcome, Christan ___________________________

On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez wapiez@wendellpiez.com wrote:

...

Hi,

I see the 'CHOP' option, turned on by default, for trimming leading and trailing whitespace and eliminating empty text nodes.

What about going further? Is there a good way to normalize whitespace entirely, collapsing any runs of tab-LF-space into single spaces in my data?

I think I mentioned earlier the idea of specifying an XSLT transformation to filter data on ingest (for a similar requirement, namely removing all comments and PIs). That might be going too far but any hints you can give me (or pointers to docs) about functionality to address this sort of thing in general would be welcome.

Thanks! Wendell

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wendell Piez

9:33 a.m.

Christian,

Indeed, I concur that the wish list would grow; a generalized approach is what we need. I'll let you think about that. :-)

In the meantime, as you suggest, if I'm willing to cache the data first, I have many options. Certainly it's possible in my testing framework but as we build out, it'll be another issue.

Alternatively, once I'm in BaseX -- I'm already deleting unwanted nodes including comments and PIs using a command script. Could I similarly do something like this?

replace value of node //text()[empty(../*)] with normalize-space(//text()[empty(../*)])

(I'm pretty new to XQuery update. I suppose I could always just try it. :-)

Thanks as always, Wendell

On Fri, Feb 22, 2013 at 5:34 AM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Wendell,

the CHOP option has been introduced at a verly stage of BaseX, and I’m not sure if we had added it today. We could add one or more additional options to normalize whitespaces or removing PIs/comments from the input, but the wish list, and the exception list, would probably continue to grow, so I believe that it would be more convenient to have a general pre-processing step that takes care of all the normalization steps. I’m not sure, however, what’s the best approach to do this within BaseX. If it’s possible to cache files on disk before adding them to the database, I would recommend XQuery or BaseX command scripts, XProc or anything else to prepare the data and delete it afterwards.

Comments are welcome, Christan ___________________________

On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez wapiez@wendellpiez.com wrote:

...
Hi,

I see the 'CHOP' option, turned on by default, for trimming leading and trailing whitespace and eliminating empty text nodes.

What about going further? Is there a good way to normalize whitespace entirely, collapsing any runs of tab-LF-space into single spaces in my data?

I think I mentioned earlier the idea of specifying an XSLT transformation to filter data on ingest (for a similar requirement, namely removing all comments and PIs). That might be going too far but any hints you can give me (or pointers to docs) about functionality to address this sort of thing in general would be welcome.

Thanks! Wendell

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

Christian Grün

10:05 a.m.

...

(I'm pretty new to XQuery update. I suppose I could always just try it. :-)

Feel free… ;) This should work:

for $x in //*[empty(../*)] return replace value of node $x with normalize-space($x)

...

On Fri, Feb 22, 2013 at 5:34 AM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Wendell,

the CHOP option has been introduced at a verly stage of BaseX, and I’m not sure if we had added it today. We could add one or more additional options to normalize whitespaces or removing PIs/comments from the input, but the wish list, and the exception list, would probably continue to grow, so I believe that it would be more convenient to have a general pre-processing step that takes care of all the normalization steps. I’m not sure, however, what’s the best approach to do this within BaseX. If it’s possible to cache files on disk before adding them to the database, I would recommend XQuery or BaseX command scripts, XProc or anything else to prepare the data and delete it afterwards.

Comments are welcome, Christan ___________________________

On Wed, Feb 20, 2013 at 5:35 PM, Wendell Piez wapiez@wendellpiez.com wrote:

...
Hi,

I see the 'CHOP' option, turned on by default, for trimming leading and trailing whitespace and eliminating empty text nodes.

What about going further? Is there a good way to normalize whitespace entirely, collapsing any runs of tab-LF-space into single spaces in my data?

I think I mentioned earlier the idea of specifying an XSLT transformation to filter data on ingest (for a similar requirement, namely removing all comments and PIs). That might be going too far but any hints you can give me (or pointers to docs) about functionality to address this sort of thing in general would be welcome.

Thanks! Wendell

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

4679

Age (days ago)

4681

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

3 comments

2 participants

tags (0)

participants (2)

Christian Grün
Wendell Piez