确保SQLite3中唯一行的有效方法

确保SQLite3中唯一行的有效方法

问题描述:

我在一个项目中使用 SQLite3 ,并且我需要确保插入到表中的行与关于他们的一些专栏的组合.在大多数情况下,插入的行在这方面会有所不同,但是在匹配的情况下,新行必须更新/替换现有行.

I am using SQLite3 in one of my projects and I need to ensure that the rows that are inserted into a table are unique with regard to a combination of some of their columns. In most cases the rows inserted will differ in that respect, but in case of a match the new row must update/replace the existing one.

显而易见的解决方案是使用带有冲突子句的复合主键来处理冲突.因此:

The obvious solution was to use a composite primary key, with a conflict clause to handle collisions. Thefore this:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT);

成为这个:

CREATE TABLE Event (Id INTEGER, Fld0 TEXT, Fld1 INTEGER, Fld2 TEXT, Fld3 TEXT, Fld4 TEXT, Fld5 TEXT, Fld6 TEXT, PRIMARY KEY (Fld0, Fld2, Fld3) ON CONFLICT REPLACE);

这确实可以按照我的要求强制执行唯一性约束.不幸的是,此更改还会导致性能损失,超出我的预期.我做了 使用sqlite3命令行实用程序进行了几次测试,以确保我的其余代码中没有错误.测试涉及一次输入100,000行 交易或每100笔交易,每笔交易1,000行.我得到以下结果:

This does indeed enforce the uniqueness constraint as I need it to. Unfortunately, this change also incurs a performance penalty that is way beyond what I expected. I did a few tests using the sqlite3 command line utility to ensure that there is not a fault in the rest of my code. The tests involve entering 100,000 rows, either in a single transaction or in 100 transactions of 1,000 rows each. I got the following results:

                                | 1 * 100,000   | 10 * 10,000   | 100 * 1,000   |
                                |---------------|---------------|---------------|
                                | Time  | CPU   | Time  | CPU   | Time  | CPU   |
                                | (sec) | (%)   | (sec) | (%)   | (sec) | (%)   |
--------------------------------|-------|-------|-------|-------|-------|-------|
No primary key                  | 2.33  | 80    | 3.73  | 50    | 15.1  | 15    |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld3               | 5.19  | 84    | 23.6  | 21    | 226.2 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld2, Fld3         | 5.11  | 88    | 24.6  | 22    | 258.8 | 3     |
--------------------------------|-------|-------|-------|-------|-------|-------|
Primary key: Fld0, Fld2, Fld3   | 5.38  | 87    | 23.8  | 23    | 232.3 | 3     |

我的应用程序当前最多执行1000行事务,而性能下降15倍令我感到惊讶.我预计吞吐量最多会下降3倍,CPU使用率也会增加,如10万笔交易中所示.我猜想维护主键约束所涉及的索引需要大量同步DB操作,因此在这种情况下使我的硬盘成为瓶颈.

My application currently performs transactions of at most 1,000 rows and I was surprised by the 15-fold drop in performance. I expected at most a 3-fold drop in throughput and a rise in CPU usage, as seen in the 100k-transaction case. I guess the indexing involved in maintaining the primary key constraints requires a significantly larger number of synchronous DB operations, thus making my hard drives the bottleneck in this case.

使用 WAL模式确实会产生一些效果-性能提高约15%.不幸的是,仅凭这一点还不够. PRAGMA synchronous = NORMAL 似乎没有任何作用.

Using WAL mode does have some effect - a performance increase of about 15%. Unfortunately that is not enough on its own. PRAGMA synchronous = NORMAL did not seem to have any effect.

可能可以通过增加事务大小来恢复某些性能,但由于内存使用量增加以及对响应速度和性能的担忧,我宁愿不这样做. 可靠性.

I might be able to recover some performance by increasing the transaction size, but I'd rather not do that, due to the increased memory usage and concerns about responsiveness and reliability.

每行中的文本字段的平均可变长度约为250个字节.查询性能并不重要,但是插入性能非常重要.我的应用程序代码使用C语言,并且(至少应该)可移植到Linux和Windows.

The text fields in each row have variable lengths of about 250 bytes in average. The query performance does not matter too much, but the insert performance is very important. My application code is in C and is (supposed to be) portable to at least Linux and Windows.

是否可以在不增加事务大小的情况下提高插入性能?是SQLite中的某些设置(是永久性地迫使数据库进入异步操作,还是永久性的),还是在我的应用程序代码中以编程方式进行的设置?例如,有没有一种方法可以在不使用索引的情况下确保行的唯一性?

Is there a way to improve the insert performance without increasing the transaction size? Either some setting in SQLite (anything but permanently forcing the DB into asynchronous operation, that is) or programmatically in my application code? For example, is there a way to ensure row uniqueness without using an index?

赏金:

通过使用我自己的答案中所述的哈希/索引方法,我设法在某种程度上将性能下降降低到了我的应用程序可以接受的程度. 但是,随着表中行数的增加,索引的出现似乎使插入变得越来越慢.

By using the hashing/indexing method described in my own answer, I managed to somewhat moderate the performance drop to a point where it's probably acceptable for my application. It seems, however, that as the number of rows in the table increases, the presence of the index makes inserts slower and slower.

只要它不涉及破解SQLite3代码或以其他方式导致项目无法维护,我都会对在此特定用例中可以提高性能的任何技术或微调设置感兴趣.

I am interested in any technique or fine-tuning setting that will increase performance in this particular use case, as long as it does not involve hacking the SQLite3 code or otherwise cause the project to become unmaintainable.

我使用sqlite在运行时插入数百万行,这是我用来提高性能的方法:

I have used sqlite to insert millions of rows at runtime and this is what I have used to increase performance:

  • 使用尽可能少的交易.
  • 将参数化命令用于 插入数据(准备 命令一次,只需更改 循环中的参数值)
  • 设置 PRAGMA同步关闭(不确定 如何与WAL配合使用)
  • 增加数据库的页面大小.
  • 增加缓存大小.这是一个重要的设置,因为它将导致sqlite实际将数据写入磁盘的次数减少,并且将在内存中运行更多的操作,从而使整个过程更快.
  • 如果需要索引,请通过运行必要的sqlite命令在插入行后将其添加.在这种情况下,您需要像现在一样确保自己的唯一性.
  • Use as few transactions as possible.
  • Use parametrized commands for inserting the data (prepare the command once and just change paramater values in the loop)
  • Set PRAGMA synchronous OFF (not sure how it works with WAL)
  • Increase page size of the database.
  • Increase cache size. This is an important setting as it will cause sqlite to actually write the data to the disk fewer times and will run more operations in memory making the whole process faster.
  • If you need an index add it after inserting the rows by running the necessary sqlite command. In this case you will need to ensure uniqueness yourself as you are currently doing it now.

如果尝试这些,请发布测试结果.我相信这对每个人都会很有趣.

If you try these please post your test results. I believe it will be interesting for everyone.