SQL Server锁定超时超出循环中的删除记录
我正在测试一次删除很多记录的过程.它不能TRUNCATE TABLE
,因为其中有一些记录需要保留.
I am testing a process that deletes many, many records at once. It cannot TRUNCATE TABLE
, because there are records in there that need to stay.
由于数量巨大,我将删除操作分成了类似于以下的循环:
Because of the volume, I have broken the delete into a loop similar to this:
-- Do not block if records are locked.
SET LOCK_TIMEOUT 0
-- This process should be chosen as a deadlock victim in the case of a deadlock.
SET DEADLOCK_PRIORITY LOW
SET NOCOUNT ON
DECLARE @Count
SET @Count = 1
WHILE @Count > 0
BEGIN TRY
BEGIN TRANSACTION -- added per comment below
DELETE TOP (1000) FROM MyTable WITH (ROWLOCK, READPAST) WHERE MyField = SomeValue
SET @Count == @@ROWCOUNT
COMMIT
END TRY
BEGIN CATCH
exec sp_lock -- added to display the open locks after the timeout
exec sp_who2 -- shows the active processes
IF @@TRANCOUNT > 0
ROLLBACK
RETURN -- ignoring this error for brevity
END CATCH
MyTable是一个群集表. MyField在聚集索引的第一列中.它表示记录的逻辑分组,因此MyField = SomeValue
通常选择许多记录.我不在乎删除它们的顺序,只要一次处理一组即可.该表上没有其他索引.
MyTable is a clustered table. MyField is in the first column in the clustered index. It indicates a logical grouping of records, so MyField = SomeValue
often selects many records. I don't care in which order they are deleted so long as one group is processed at a time. There are no other indexes on this table.
我添加了ROWLOCK
提示,以尝试避免在生产中看到锁升级.我添加了READPAST
提示,以避免删除其他进程锁定的记录.那永远不会发生,但我正在努力确保安全.
I added the ROWLOCK
hint to try to avoid lock escalations we have seen in production. I added the READPAST
hint to avoid deleting records locked by other processes. That should never happen, but I am trying to be safe.
问题:有时,当唯一运行时,此循环会遇到锁定超时1222超出了锁定请求超时时间".
我确定在测试此过程时该系统上没有其他活动,因为它是我自己的开发人员箱,没有其他人连接,没有其他进程在运行,并且探查器未显示任何活动.
I am positive there is no other activity on this system while I am testing this process, because it is my own developer box, nobody else is connected, there are no other processes running on it, and the profiler shows no activity.
一秒钟后,我可以重新运行同一脚本,它从中断处开始取乐,愉快地删除记录-直到下一个锁定超时为止.
I can re-run the same script a second later and it picks up where it left off, happily deleting records-- until the next lock timeout.
我尝试使用BEGIN TRY
/BEGIN CATCH
来忽略1222错误并重试删除,但是由于相同的锁定超时错误,它立即再次失败.如果在重试之前添加了短暂的延迟,它也会再次失败.
I have tried a BEGIN TRY
/ BEGIN CATCH
to ignore the 1222 error and retry the delete, but it fails again immediately with the same lock timeout error. It also fails again if I add a short delay before retrying.
我认为锁定超时是由于诸如页面拆分之类的原因,但是我不确定为什么这会与当前循环迭代冲突.先前的delete语句应该已经完成,并且我认为这意味着所有页面拆分也已完成.
I assume the lock timeouts are because of something like a page split, but I am not sure why this would conflict with the current loop iteration. The prior delete statement should have already completed, and I thought that meant any page splits were also finished.
为什么DELETE循环对其自身造成锁定超时?
该过程是否有办法避免此锁定超时或检测到可以安全恢复?
这是在SQL Server 2005上.
This is on SQL Server 2005.
-编辑-
我将Lock:Timeout事件添加到了探查器.删除期间在PAGELOCK上超时:
I added the Lock:Timeout event to the profiler. It's timing out on a PAGELOCK during the delete:
Event Class: Lock:Timeout
TextData: 1:15634 (one example of several)
Mode: 7 - IU
Type: 6 - PAGE
DBCC PAGE报告这些页面不在主数据库(ID 1)的范围内.
DBCC PAGE reports these pages are outside of the range of the master database (ID 1).
-编辑2-
我添加了BEGIN TRY
/BEGIN CATCH
并在catch块中运行了exec sp_lock
.这是我所看到的:
I added a BEGIN TRY
/ BEGIN CATCH
and ran an exec sp_lock
in the catch block. Here is what I saw:
spid dbid ObjId IndId Type Resource Mode Status
19 2 1401108082 1 PAG 1:52841 X GRANT (tempdb.dbo.MyTable)
19 2 1401108082 0 TAB IX GRANT (tempdb.dbo.MyTable)
Me 2 1401108082 0 TAB IX GRANT (tempdb.dbo.MyTable)
Me 1 1115151018 0 TAB IS GRANT (master..spt_values) (?)
SPID 19是SQL Server任务管理器.为什么这些任务管理器之一会在MyTable上获取锁?
SPID 19 is a SQL Server TASK MANAGER. Why would one of these task managers be acquiring locks on MyTable?
我找到了答案:我的循环删除操作与虚影清理程序冲突.
I've found the answer: my looped delete is conflicting with the ghost cleanup proc.
根据尼古拉斯的建议,我添加了BEGIN TRANSACTION
和COMMIT
.我将删除循环包装在BEGIN TRY
/BEGIN CATCH
中.在BEGIN CATCH
中,在ROLLBACK
之前,我分别运行sp_lock
和sp_who2
. (我在上面的问题中添加了代码更改.)
Using Nicholas's suggestion, I added a BEGIN TRANSACTION
and a COMMIT
. I wrapped the delete loop in a BEGIN TRY
/ BEGIN CATCH
. In the BEGIN CATCH
, right before a ROLLBACK
, I ran sp_lock
and sp_who2
. (I added the code changes in the question above.)
当我的进程被阻止时,我看到了以下输出:
When my process blocked, I saw the following output:
spid dbid ObjId IndId Type Resource Mode Status
------ ------ ----------- ------ ---- -------------------------------- -------- ------
20 2 1401108082 0 TAB IX GRANT
20 2 1401108082 1 PAG 1:102368 X GRANT
SPID Status Login HostName BlkBy DBName Command CPUTime DiskIO
---- ---------- ----- -------- ----- ------ ------------- ------- ------
20 BACKGROUND sa . . tempdb GHOST CLEANUP 31 0
供将来参考,当SQL Server删除记录时,它会在记录上设置一些位,以将其标记为幽灵记录".每隔几分钟,就会运行一个称为幽灵清理"的内部过程,以回收已完全删除的记录页面(即,所有记录都是幽灵记录).
For future reference, when SQL Server deletes records, it sets a bit on them to just mark them as "ghost records". Every few minutes, an internal process called ghost cleanup runs to reclaim pages of records that have been fully deleted (i.e. all records are ghost records).
此问题中有关ServerFault的幻影清理过程已得到讨论.
可以使用跟踪标志来禁用幻影清理过程.但是在这种情况下,我不必这样做.
It is possible to disable the ghost cleanup process with a trace flag. But I did not have to do so in this case.
我最终添加了100毫秒的锁定等待超时.这会导致在虚影记录清理过程中偶尔出现锁定等待超时,但这是可以接受的.我还添加了一个我们的循环,该循环最多可重试5次锁定超时.有了这两个更改,我的过程现在通常可以完成.现在,只有在有一个非常长的过程将大量数据推入到表或页面锁的过程中,超时才会导致超时.
I ended up adding a lock wait timeout of 100 ms. This causes occasional lock wait timeouts in the ghost record cleanup process, but that is acceptable. I also added an our loop that retries lock timeouts up to 5 times. With these two changes, my process now usually completes. Now it only gets a timeout if there is a very long process pushing lots of data around that acquires table or page locks on the data that my process needs to clean up.
编辑2016-07-20
最终代码如下:
-- Do not block long if records are locked.
SET LOCK_TIMEOUT 100
-- This process volunteers to be a deadlock victim in the case of a deadlock.
SET DEADLOCK_PRIORITY LOW
DECLARE @Error BIT
SET @Error = 0
DECLARE @ErrMsg VARCHAR(1000)
DECLARE @DeletedCount INT
SELECT @DeletedCount = 0
DECLARE @LockTimeoutCount INT
SET @LockTimeoutCount = 0
DECLARE @ContinueDeleting BIT,
@LastDeleteSuccessful BIT
SET @ContinueDeleting = 1
SET @LastDeleteSuccessful = 1
WHILE @ContinueDeleting = 1
BEGIN
DECLARE @RowCount INT
SET @RowCount = 0
BEGIN TRY
BEGIN TRANSACTION
-- The READPAST below attempts to skip over locked records.
-- However, it might still cause a lock wait error (1222) if a page or index is locked, because the delete has to modify indexes.
-- The threshold for row lock escalation to table locks is around 5,000 records,
-- so keep the deleted number smaller than this limit in case we are deleting a large chunk of data.
-- Table name, field, and value are all set dynamically in the actual script.
SET @SQL = N'DELETE TOP (1000) MyTable WITH(ROWLOCK, READPAST) WHERE MyField = SomeValue'
EXEC sp_executesql @SQL, N'@ProcGuid uniqueidentifier', @ProcGUID
SET @RowCount = @@ROWCOUNT
COMMIT
SET @LastDeleteSuccessful = 1
SET @DeletedCount = @DeletedCount + @RowCount
IF @RowCount = 0
BEGIN
SET @ContinueDeleting = 0
END
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK
IF Error_Number() = 1222 -- Lock timeout
BEGIN
IF @LastDeleteSuccessful = 1
BEGIN
-- If we hit a lock timeout, and we had already deleted something successfully, try again.
SET @LastDeleteSuccessful = 0
END
ELSE
BEGIN
-- The last delete failed, too. Give up for now. The job will run again shortly.
SET @ContinueDeleting = 0
END
END
ELSE -- On anything other than a lock timeout, report an error.
BEGIN
SET @ErrMsg = 'An error occurred cleaning up data. Table: MyTable Column: MyColumn Value: SomeValue. Message: ' + ERROR_MESSAGE() + ' Error Number: ' + CONVERT(VARCHAR(20), ERROR_NUMBER()) + ' Line: ' + CONVERT(VARCHAR(20), ERROR_LINE())
PRINT @ErrMsg -- this error message will be included in the SQL Server job history
SET @Error = 1
SET @ContinueDeleting = 0
END
END CATCH
END
IF @Error <> 0
RAISERROR('Not all data could be cleaned up. See previous messages.', 16, 1)