SQL Server - 重叠数据的累积总和 - 获取总和达到给定值的日期

SQL Server - 重叠数据的累积总和 - 获取总和达到给定值的日期

问题描述:

在我们公司,我们的客户会执行我们在不同表中登录的各种活动 - 面试出勤、课程出勤和其他一般活动.我有一个数据库视图,它将所有这些表中的数据联合起来,为我们提供了如下所示的 ActivityView.正如您所看到的,有些活动是重叠的 - 例如,在参加面试时,客户可能一直在执行简历更新活动.

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities. I have a database view that unions data from all of these tables giving us the ActivityView that looks like this. As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.

+----------------------+---------------+---------------------+-------------------+
| activity_client_id   | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
|                  112 | Interview     | 2015-06-01 09:00    | 2015-06-01 11:00  |
|                  112 | CV updating   | 2015-06-01 09:30    | 2015-06-01 11:30  |
|                  112 | Course        | 2015-06-02 09:00    | 2015-06-02 16:00  |
|                  112 | Interview     | 2015-06-03 09:00    | 2015-06-03 10:00  |
+----------------------+---------------+---------------------+-------------------+

每个客户都有一个注册日期",记录在客户表上,这是他们加入我们计划的时间.这是我们的示例客户:

Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:

+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
|       112 | 2015-05-20          |
+-----------+---------------------+

我需要创建一个将显示以下列的报告:

I need to create a report that will show the following columns:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+

我们需要这份报告来了解我们的计划有多有效.该计划的一个重要目标是我们让每位客户尽快完成至少 5 小时的活动.所以这份报告会告诉我们每个客户从注册到达到这个数字需要多长时间.

We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible. So this report will tell us how long from sign up does it take each client to achieve this figure.

更棘手的是,当我们计算 5 小时的总活动时,我们必须扣除重叠的活动:

What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:

在上面的示例数据中,客户在 09:00 到 11:00 之间参加了一次采访.
同一天,他们还从09:30到11:30进行了简历更新活动.对于我们的计算,这将使他们一天的总活动时间为 2.5 小时(150 分钟)——我们只计算 30 分钟的简历更新,因为面试将其重叠到 11:00.

In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30. For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.

因此,我们示例客户的报告将给出以下结果:

So the report for our sample client would give the following result:

+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
|       112 | 2015-05-20          | 2015-06-02                                 |
+-----------+---------------------+--------------------------------------------+

所以我的问题是如何使用 select 语句创建报告?我可以通过编写一个存储过程来解决这个问题,该过程将遍历视图并将结果写入报告表.但我更愿意避免使用存储过程,并使用一个 select 语句来即时提供报告.

So my question is how can I create the report using a select statement ? I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table. But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.

我使用的是 SQL Server 2005.

I am using SQL Server 2005.

这是一种方法:

;WITH CTErn AS (
   SELECT activity_client_id, activity_type,
          activity_start_date, activity_end_date,
          ROW_NUMBER() OVER (PARTITION BY activity_client_id 
                             ORDER BY activity_start_date) AS rn
   FROM activities
),   
CTEdiff AS (
   SELECT c1.activity_client_id, c1.activity_type,
          x.activity_start_date, c1.activity_end_date,
          DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
          ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id 
                             ORDER BY x.activity_start_date) AS seq
   FROM CTErn AS c1
   LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
   CROSS APPLY (SELECT CASE 
                          WHEN c1.activity_start_date < c2.activity_end_date
                             THEN c2.activity_end_date
                          ELSE c1.activity_start_date
                       END) x(activity_start_date)    
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date, 
             hoursOfActivicty               
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id                     
CROSS APPLY (SELECT SUM(diff) / 60.0
             FROM CTEdiff AS c3
             WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq

Common Table ExpressionsROW_NUMBER() 是在 SQL Server 2005 中引入的,因此上述查询应该适用于该版本.

Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.

此处演示

第一个 CTE,即 CTErn,产生以下输出:

The first CTE, i.e. CTErn, produces the following output:

client_id   activity_type   start_date          end_date          rn
112         Interview       2015-06-01 09:00    2015-06-01 11:00  1
112         CV updating     2015-06-01 09:30    2015-06-01 11:30  2
112         Course          2015-06-02 09:00    2015-06-02 16:00  3
112         Interview       2015-06-03 09:00    2015-06-03 10:00  4

第二个CTE,即CTEdiff,使用上面的表格表达式来计算每条记录的时间差,考虑到与前一条记录的重叠:

The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:

client_id activity_type start_date       end_date         diff  seq
112       Interview     2015-06-01 09:00 2015-06-01 11:00 120   1
112       CV updating   2015-06-01 11:00 2015-06-01 11:30 30    2
112       Course        2015-06-02 09:00 2015-06-02 16:00 420   3
112       Interview     2015-06-03 09:00 2015-06-03 10:00 60    4

最终查询计算时间差的累计总和,并选择超过 5 小时活动的第一条记录.

The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.

上述查询适用于简单间隔重叠,即当活动的结束日期与下一个活动的开始日期重叠时.

The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.