Huge CSV

List overview All Threads
Download

newer

older

Question ft:mark()

Different interpretation of regex...

Giuseppe Celano

10 Aug 2018 10 Aug '18

7:35 a.m.

Hi,

I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.

Giuseppe

Show replies by date

Christian Grün

10 Aug 10 Aug

7:37 a.m.

As there are many different ways to process large CSV data with BaseX… What did you try so far?

On Fri, Aug 10, 2018 at 1:36 PM Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

...

Hi,

I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.

Giuseppe

Giuseppe Celano

7:43 a.m.

I uploaded the file, as it is, in the database, but this does not help. The idea was to preliminary transform the file into xml and then query it, but this cannot be done on the fly. So the only thing I can think of is to parcel the original csv file into multiple csv files and then tranform each of them in xml, and then query these latter. Are there alternatives? Thanks.

Giuseppe

Universität Leipzig Institute of Computer Science, NLP Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://asv.informatik.uni-leipzig.de/en/staff/Giuseppe_Celano Web site 2: https://sites.google.com/site/giuseppegacelano/

...

On Aug 10, 2018, at 1:37 PM, Christian Grün christian.gruen@gmail.com wrote:

As there are many different ways to process large CSV data with BaseX… What did you try so far?

On Fri, Aug 10, 2018 at 1:36 PM Giuseppe Celano celano@informatik.uni-leipzig.de wrote:

...
Hi,

I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.

Giuseppe

Christian Grün

7:50 a.m.

...

I uploaded the file, as it is, in the database

So you uploaded the file as binary? Did you try to import it as XML, too? Does »upload« mean that you used the simple REST API?

Giuseppe Celano

7:59 a.m.

I uploaded it as csv (it is csv) via the GUI and it is then converted into XML (this conversion probably makes it too big)

...

On Aug 10, 2018, at 1:50 PM, Christian Grün christian.gruen@gmail.com wrote:

...
I uploaded the file, as it is, in the database

So you uploaded the file as binary? Did you try to import it as XML, too? Does »upload« mean that you used the simple REST API?

Liam R. E. Quin

3:09 p.m.

On Fri, 2018-08-10 at 13:43 +0200, Giuseppe Celano wrote:

...

I uploaded the file, as it is, in the database,

i'd probably look for an XSLT transformation to turn it into XSLT - of there are python and perl scripts or other programs that can do it - and then load the result intoa database.

It's not all that large a file, so maybe it'd help if you described the exact problems you were having -- what did you try, what did you expect to happen, what actually happen, what steps did you take to investigate...

Liam

-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.

Giuseppe Celano

12 Aug 12 Aug

5:58 p.m.

Hi Liam,

Thanks for answering. The problem is not only the XML transformation per se, but also the subsequent query of the documents. I see that if I parcel the big csv into smaller (XML) documents and query them sequentially, I have no performance problems. This is also the case in the database, as far as I can see: more documents accessed sequentially is better than one big file.

Ciao, Giuseppe

...

On Aug 10, 2018, at 9:09 PM, Liam R. E. Quin liam@fromoldbooks.org wrote:

On Fri, 2018-08-10 at 13:43 +0200, Giuseppe Celano wrote:

...
I uploaded the file, as it is, in the database,

i'd probably look for an XSLT transformation to turn it into XSLT - of there are python and perl scripts or other programs that can do it - and then load the result intoa database.

It's not all that large a file, so maybe it'd help if you described the exact problems you were having -- what did you try, what did you expect to happen, what actually happen, what steps did you take to investigate...

Liam

-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.

Liam R. E. Quin

6:04 p.m.

On Sun, 2018-08-12 at 23:58 +0200, Giuseppe Celano wrote:

...

more documents accessed sequentially is better than one big file.

Are you building indexes in the database? Do yourqueries make use of them?

You may find using the full text extensions useful.

Liam

Giuseppe Celano

6:13 p.m.

Yes, I build them, but I do not use them explicitly all the time.

...

On Aug 13, 2018, at 12:04 AM, Liam R. E. Quin liam@fromoldbooks.org wrote:

On Sun, 2018-08-12 at 23:58 +0200, Giuseppe Celano wrote:

...
more documents accessed sequentially is better than one big file.

Are you building indexes in the database? Do yourqueries make use of them?

You may find using the full text extensions useful.

Liam

-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.

2584

Age (days ago)

2586

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

3 participants

tags (0)

participants (3)

Christian Grün
Giuseppe Celano
Liam R. E. Quin