将较大的csv文件加载到neo4j中

将较大的csv文件加载到neo4j中

问题描述:

我想加载一个包含Wikipedia类别rels.csv之间的关系的csv(类别之间的400万个关系).我试图通过更改以下参数值来修改设置文件:

I want to load a csv that contains relationships between Wikipedia categories rels.csv (4 million of relations between categories). I tried to modify the setting file by changing the following parameter values:

dbms.memory.heap.initial_size=8G 
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=9G

我的查询如下:

USING PERIODIC COMMIT 10000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

此外,我在catId和catName上创建了索引. 尽管进行了所有这些优化,查询仍在运行(自昨天开始).

Moreover, I created indexes on catId and catName. Despite all these optimizations, the query still running (since yesterday).

您能告诉我是否需要执行更多优化来加载此CSV文件吗?

Can you tell me if there are more optimization that should be done to load this CSV file?

这花费了太多时间.数百万的关系应该花费几分钟甚至几分钟的时间.

It's taking too much time. 4 Millions of relationships should take a few minutes if not seconds.

我刚刚在321秒内从您共享的链接(Cats-90和Rels-231)中加载了所有数据,而我的笔记本电脑的内存设置还不到一半.

I just loaded all the data from the link you shared in 321 seconds (Cats-90, and Rels-231) with less than half of your memory settings on my personal laptop.

dbms.memory.heap.initial_size=1G  
dbms.memory.heap.max_size=4G 
dbms.memory.pagecache.size=1512m

这不是限制,可以进一步改进.

稍作修改的查询:增加了LIMIT十次

Slightly Modified query: Increased LIMIT 10 times

USING PERIODIC COMMIT 100000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

一些建议:

  1. 在用于搜索节点的字段上创建索引. (加载数据时无需在其他字段上建立索引,以后可以完成,它会消耗不必要的内存)

  1. Create an index on the fields that are used to search nodes. (No need to index on others fields while loading data it can be done later, it consumes unnecessary memory)

请勿将最大堆大小设置为已满系统RAM.将其设置为RAM的50%.

Don't set the max heap size to full of system RAM. Set it to 50% of RAM.

下次运行加载CSV查询时,请不要忘记删除以前的数据,因为它会创建重复项.

Don't forget to delete previous data when you run load CSV query next time, as it will create duplicates.

注意::我将文件下载到笔记本电脑上并使用了文件,所以没有下载时间.

NOTE: I downloaded the files to the laptop and used same so there is no download time.