根据带有条件的时间戳获取数据帧中的最新记录
我的问题标题可能不准确,但我希望我能够解释我的问题 所以我有一个像下面的数据框
My Question heading might not be accurate but i hope i will be able to explain my question So i have a data frame like below
DataPartition_1|^|PartitionYear_1|^|TimeStamp|^|OrganizationId|^|AnnualPeriodId|^|InterimPeriodId|^|InterimNumber_1|^|FFAction_1
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|5|^|21|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629599|^|4295858941|^|1|^|22|^|2|^|I|!|
SelfSourcedPublic|^|2002|^|1510044629600|^|4295858941|^|1|^|23|^|2|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629624|^|4295858941|^|null|^|35|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
SelfSourcedPublic|^|2016|^|1510044629626|^|4295858941|^|null|^|37|^|null|^|D|!|
SelfSourcedPublic|^|2001|^|1510044629596|^|4295858941|^|19|^|5|^|1|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629597|^|4295858941|^|20|^|5|^|2|^|I|!|
SelfSourcedPublic|^|2001|^|1510044629598|^|4295858941|^|21|^|5|^|2|^|I|!|
所以我数据框的主键是
OrganizationId", "AnnualPeriodId","InterimPeriodId"
下面是我的代码,该代码根据时间戳获取最新记录,并按主键seq排列.
below is my code to get the latest records based on time stamp and arranged by seq of primary key .
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
现在我的问题是有些时候我在某些主键列(例如带有时间戳记1510044629624
的记录)中为空.
Now my issue is some time i get null in some of the primary key column like record with time stamp 1510044629624
.
现在我的要求是,下面的记录具有相同的主键,但第一个记录为null.在这种情况下,我仍然只需要一个具有最新时间戳的记录
Now my requirement is that below records have same primary key except first one has null .In this case i still need only one records with latest time stamp
SelfSourcedPublic|^|2016|^|1510044629601|^|4295858941|^|35|^|36|^|1|^|I|!|
SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
我应该得到SelfSourcedPublic|^|2016|^|1510044629625|^|4295858941|^|null|^|36|^|null|^|D|!|
由于空..
我希望我的问题很清楚.
I hope my question is clear.
据我所知,您正在使用一个额外的列作为主键.
From what I understood from your question and problem is that you are using an extra column as primary key.
AnnualPeriodId
列正在获取null
,并且由于您正在使用partitionBy
中的该字段,因此导致null成为单独的组,从而导致单独的row
AnnualPeriodId
column is getting null
and since you are using that field in partitionBy
, it is causing the null to be separate group and thus separate row
val windowSpec = Window.partitionBy("OrganizationId", "AnnualPeriodId","InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
因此解决方案是将其从partitionBy
中删除,以使上面的行变为
So the solution is to remove it from the partitionBy
so that the above line becomes
val windowSpec = Window.partitionBy("OrganizationId", "InterimPeriodId").orderBy($"TimeStamp".cast(LongType).desc)
我希望这能解决您遇到的问题.
I hope this should solve the issue you are having.