Python-分组和分配异常规则

Python-分组和分配异常规则

问题描述:

如果要最接近负负diff到位置86为位置1,我想首先通过分配组1来对列表进行分组,如果要最接近负0到0的负差异是位置90,则我想分配组2。那么第3组将是位置86和90最接近的位置。运行此集后,我将重新运行代码,并在未分配组的任何地方重新开始从组4开始分配,以免覆盖先前的组分配。

I would like to group by list first by assigning group 1, if the closest negative diff to 0 is Location 86 as Group 1, and I would like to assign Group 2 if the closest negative diff to 0 is Location 90. And then group 3 would be if Location 86 and 90 are the closest. After this set is run, I would rerun the code and anywhere a Group has not been assigned, it begins assigning starting from Group 4 and on, so as to not override the previous group assignments.

groupby的发生基于ID,位置,并且最接近锚点列。

The groupby is occurring based on ID, Location, and closest to the Anchor column.

在下面的示例中,我们跳过了位置66作为例外,我将使用 df ['diff']。where(df ['diff']。le(0)& df ['Anchor Date']。ne('Y')& df ['Location']。ne (66))

Note in the below example, we skip over Location 66 as an exception, where I would use df['diff'].where(df['diff'].le(0)&df['Anchor Date'].ne('Y')&df['Location'].ne(66))

输入:

ID  Location Anchor Date       Diff
111 86       N      5/2/2020  -1
111 87       Y      5/3/2020   0
111 90       N      5/4/2020  -2
111 90       Y      5/6/2020   0
123 86       N      1/4/2020  -1
123 90       N      1/4/2020  -1
123 91       Y      1/5/2020   0
456 64       N      2/3/2020  -2
456 66       N      2/4/2020  -1
456 91       Y      2/5/2020   0

输出:

ID  Location Anchor Date       Diff  Group
111 86       N      5/2/2020  -1     1
111 87       Y      5/3/2020   0
111 90       N      5/4/2020  -2     2
111 90       Y      5/6/2020   0
123 86       N      1/4/2020  -1     3
123 90       N      1/4/2020  -1     3
123 91       Y      1/5/2020   0     
456 64       N      2/3/2020  -2     4
456 66       N      2/4/2020  -1     
456 91       Y      2/5/2020   0


在您的例外规则中,有两个86和90给代码增加了一些复杂性,因为需要获取由两个位置组成的组的值。通常,如果要比较相同的差异,则要捕获多个位置这一事实很难。这是一种方法。创建具有不同组值和掩码的系列

Among your exception rules, the one with both 86 and 90 adds some complexity to the code as one need to get a value for this group composed of two locations. In general the fact that you want to catch several location if same diff is harder. Here is one way. Create series with different groups values and masks

#catch each group per ID and up until a 0
gr = (df['ID'].ne(df['ID']).shift()|df['Anchor'].shift().eq('Y')).cumsum()
# where the diff per group is equal to the last value possible before anchor
mask_last = (df['Diff'].where(df['Diff'].le(0)&df['Anchor'].ne('Y')&df['Location'].ne(66))
                       .groupby(gr).transform('last')
                       .eq(df['Diff']))
# need this info to create unique fake Location value, especially if several
loc_max = df['Location'].max()+1
#create groups based on Location value
gr2 = (df['Location'].where(mask_last).groupby(gr)
                     .transform(lambda x:(x.dropna().sort_values()
                                          *loc_max**np.arange(len(x.dropna()))).sum()))

现在您可以创建组:

#now create the column group
d_exception = {86:1, 90:2, 86 + 90*loc_max:3} #you can add more
df['group'] = ''
#exception
for key, val in d_exception.items():
    df.loc[mask_last&gr2.eq(key), 'group'] = val
#the rest of the groups
idx = df.index[mask_last&~gr2.isin(d_exception.keys())]
df.loc[idx, 'group'] = pd.factorize(df.loc[idx, 'Location'])[0]+len(d_exception)+1
print (df)
    ID  Location Anchor      Date  Diff group
0  111        86      N  5/2/2020    -1     1
1  111        87      Y  5/3/2020     0      
2  111        90      N  5/4/2020    -2     2
3  111        90      Y  5/6/2020     0      
4  123        86      N  1/4/2020    -1     3
5  123        90      N  1/4/2020    -1     3
6  123        91      Y  1/5/2020     0      
7  456        64      N  2/3/2020    -2     4
8  456        66      N  2/4/2020    -1      
9  456        91      Y  2/5/2020     0