Out of Main Memory

List overview All Threads
Download

newer

older

BaseX and NFS

validate:xsd() questions

Murray, Gregory

12 Mar 2024 12 Mar '24

3:33 p.m.

Hello,

I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed?

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

Attachments:

attachment.html (text/html — 2.9 KB)

Show replies by date

Christian Grün

14 Mar 14 Mar

8:48 a.m.

Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory gregory.murray@ptsem.edu wrote:

...

Hello,

I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed?

Thanks,

Greg

Gregory Murray

Director of Digital Initiatives

Wright Library

Princeton Theological Seminary

Murray, Gregory

5:54 p.m.

Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. The total size of those XML documents on disk is 4.7GB. But that is only a fraction of the total number of documents I want to load into BaseX. The total number is more like 160,000 documents. Assuming that the documents I’ve loaded so far are a representative sample, and I believe that’s true, then the total size of the XML documents on disk, prior to loading them into BaseX, would be about 4.7GB * 13 = 61.1GB.

Normally the OCR text, once loaded, almost never changes. But the metadata fields do change as corrections are made. Also we add more XML documents routinely as we digitize more books over time. Therefore updates and additions are commonplace, such that keeping indexes up to date is important, to allow full-text searches to stay performant. I’m wondering if there are techniques for optimizing such quantities of text.

Thanks, Greg

From: Christian Grün christian.gruen@gmail.com Date: Thursday, March 14, 2024 at 8:48 AM To: Murray, Gregory gregory.murray@ptsem.edu Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

Bridger Dyson-Smith

6:43 p.m.

Hi Greg,

Have you tried experimenting with the ADDCACHE[1] option when building your database? While it's been a bit, I recall having good results with, especially in a RAM-constrained environment. Hope that's helpful! Best, Bridger

[1] https://docs.basex.org/wiki/Options#ADDCACHE

On Thu, Mar 14, 2024 at 9:55 PM Murray, Gregory gregory.murray@ptsem.edu wrote:

...

Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. The total size of those XML documents on disk is 4.7GB. But that is only a fraction of the total number of documents I want to load into BaseX. The total number is more like 160,000 documents. Assuming that the documents I’ve loaded so far are a representative sample, and I believe that’s true, then the total size of the XML documents on disk, prior to loading them into BaseX, would be about 4.7GB * 13 = 61.1GB.

Normally the OCR text, once loaded, almost never changes. But the metadata fields do change as corrections are made. Also we add more XML documents routinely as we digitize more books over time. Therefore updates and additions are commonplace, such that keeping indexes up to date is important, to allow full-text searches to stay performant. I’m wondering if there are techniques for optimizing such quantities of text.

Thanks,

Greg

*From: *Christian Grün christian.gruen@gmail.com *Date: *Thursday, March 14, 2024 at 8:48 AM *To: *Murray, Gregory gregory.murray@ptsem.edu *Cc: *basex-talk@mailman.uni-konstanz.de < basex-talk@mailman.uni-konstanz.de> *Subject: *Re: [basex-talk] Out of Main Memory

Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best,

Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory gregory.murray@ptsem.edu wrote:

Hello,

I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed?

Thanks,

Greg

Gregory Murray

Director of Digital Initiatives

Wright Library

Princeton Theological Seminary

Murray, Gregory

15 Mar 15 Mar

9:17 a.m.

Hi Bridger,

Thank you for this tip. It looks like it might apply only to adding new documents, whereas my main problem at the moment is reindexing existing documents, but I will look into it further.

Thanks, Greg

From: Bridger Dyson-Smith bdysonsmith@gmail.com Date: Thursday, March 14, 2024 at 6:43 PM To: Murray, Gregory gregory.murray@ptsem.edu Cc: Christian Grün christian.gruen@gmail.com, basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory You don't often get email from bdysonsmith@gmail.com. Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification Hi Greg, Have you tried experimenting with the ADDCACHE[1] option when building your database? While it's been a bit, I recall having good results with, especially in a RAM-constrained environment. Hope that's helpful! Best, Bridger

[1] https://docs.basex.org/wiki/Options#ADDCACHE

On Thu, Mar 14, 2024 at 9:55 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

Thanks, Greg

From: Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> Date: Thursday, March 14, 2024 at 8:48 AM To: Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> Cc: basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Out of Main Memory Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

Murray, Gregory

9:21 a.m.

PS. I could ask the IT department here to set up a virtual server for me that would have ample memory and disk space. Do you have any idea how much memory would be needed to optimize something like 60+ GB of text?

From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Murray, Gregory gregory.murray@ptsem.edu Date: Thursday, March 14, 2024 at 5:55 PM To: Christian Grün christian.gruen@gmail.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

Thanks, Greg

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

Christian Grün

11:51 a.m.

Hi Greg,

I would have guessed that 12 GB is enough for 4.7 GB; but it sometimes depends on the input. If you like, you can share a single typical document with us, and we can have a look at it. 61 GB will be too large for a complete full-text index, though. However, it’s always possible to distribute documets across multiple databases and access them with a single query [1].

The full-text index is not incremental (in opposition to the other index structures), which means it must be re-created it after updates. However, it’s possible to re-index updated database instances and query fully indexed databases at the same time.

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Databases

On Thu, Mar 14, 2024 at 10:58 PM Murray, Gregory gregory.murray@ptsem.edu wrote:

...

Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. The total size of those XML documents on disk is 4.7GB. But that is only a fraction of the total number of documents I want to load into BaseX. The total number is more like 160,000 documents. Assuming that the documents I’ve loaded so far are a representative sample, and I believe that’s true, then the total size of the XML documents on disk, prior to loading them into BaseX, would be about 4.7GB * 13 = 61.1GB.

Normally the OCR text, once loaded, almost never changes. But the metadata fields do change as corrections are made. Also we add more XML documents routinely as we digitize more books over time. Therefore updates and additions are commonplace, such that keeping indexes up to date is important, to allow full-text searches to stay performant. I’m wondering if there are techniques for optimizing such quantities of text.

Thanks,

Greg

*From: *Christian Grün christian.gruen@gmail.com *Date: *Thursday, March 14, 2024 at 8:48 AM *To: *Murray, Gregory gregory.murray@ptsem.edu *Cc: *basex-talk@mailman.uni-konstanz.de < basex-talk@mailman.uni-konstanz.de> *Subject: *Re: [basex-talk] Out of Main Memory

Hi Greg,

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best,

Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory gregory.murray@ptsem.edu wrote:

Hello,

I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed?

Thanks,

Greg

Gregory Murray

Director of Digital Initiatives

Wright Library

Princeton Theological Seminary

Murray, Gregory

12:12 p.m.

Thanks, Christian. Distributing documents across many databases sounds fine, as long as XPath expressions and full-text searching remain reasonably efficient. In the documentation, the example of addressing multiple databases uses a loop:

for $i in 1 to 100 return db:get('books' || $i)//book/title

Is that the preferred technique?

Also, is it possible to perform searches in the same manner without interfering with relevance scores?

Thanks, Greg

From: Christian Grün christian.gruen@gmail.com Date: Friday, March 15, 2024 at 11:51 AM To: Murray, Gregory gregory.murray@ptsem.edu Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory Hi Greg,

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Databases

On Thu, Mar 14, 2024 at 10:58 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface.

Thanks, Greg

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

Thompson, Timothy

4:05 p.m.

Hi, Greg,

Assuming you have multiple cores available, you can also execute a search in parallel using the (BaseX-specific) xquery:fork-join function[1]. That’s what I usually do when searching across databases.

All best, Tim

[1] https://docs.basex.org/wiki/XQuery_Module#xquery:fork-join

-- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompsonhttp://www.linkedin.com/in/timathompson

From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Murray, Gregory gregory.murray@ptsem.edu Date: Friday, March 15, 2024 at 12:12 PM To: Christian Grün christian.gruen@gmail.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Out of Main Memory Thanks, Christian. Distributing documents across many databases sounds fine, as long as XPath expressions and full-text searching remain reasonably efficient. In the documentation, the example of addressing multiple databases uses a loop:

for $i in 1 to 100 return db:get('books' || $i)//book/title

Is that the preferred technique?

Also, is it possible to perform searches in the same manner without interfering with relevance scores?

Thanks, Greg

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Databases

Thanks, Greg

A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information).

How large is the total size of your input documents?

Best, Christian

[1] https://docs.basex.org/wiki/Indexes#Selective_Indexing

On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.murray@ptsem.edumailto:gregory.murray@ptsem.edu> wrote: Hello,

Thanks, Greg

Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary

489

Age (days ago)

492

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

4 participants

tags (0)

participants (4)

Bridger Dyson-Smith
Christian Grün
Murray, Gregory
Thompson, Timothy