dtype对象上的累积操作

问题描述：

我试图弄清楚如何将累积函数应用于对象.对于数字，有多种选择，例如cumsum和cumcount.还有 df.expanding 可以与apply一起使用的a>.但是我传递给apply的函数不适用于对象.

I am trying to figure out how I can apply cumulative functions to objects. For numbers there are several alternatives like cumsum and cumcount. There is also df.expanding which can be used with apply. But the functions I pass to apply do not work on objects.

import pandas as pd
df = pd.DataFrame({"C1": [1, 2, 3, 4], 
                   "C2": [{"A"}, {"B"}, {"C"}, {"D"}], 
                   "C3": ["A", "B", "C", "D"], 
                   "C4": [["A"], ["B"], ["C"], ["D"]]})

df
Out: 
   C1   C2 C3   C4
0   1  {A}  A  [A]
1   2  {B}  B  [B]
2   3  {C}  C  [C]
3   4  {D}  D  [D]

在数据框中，我有整数值，集合，字符串和列表.现在，如果我尝试expanding().apply(sum)，我就有了累加的总和:

In the dataframe I have integer values, sets, strings and lists. Now, if I try expanding().apply(sum) I have the cumulative sum:

df.expanding().apply(sum)
Out[69]: 
     C1   C2 C3   C4
0   1.0  {A}  A  [A]
1   3.0  {B}  B  [B]
2   6.0  {C}  C  [C]
3  10.0  {D}  D  [D]

我的期望是，由于求和是在列表和字符串上定义的，所以我会得到如下信息:

My expectation was, since summation is defined on lists and strings, I would get something like this:

     C1   C2  C3     C4
0   1.0  {A}  A      [A]
1   3.0  {B}  AB     [A, B]
2   6.0  {C}  ABC    [A, B, C]
3  10.0  {D}  ABCD   [A, B, C, D]

我也尝试过这样的事情:

I also tried something like this:

df.expanding().apply(lambda r: reduce(lambda x, y: x+y**2, r))
Out: 
     C1   C2 C3   C4
0   1.0  {A}  A  [A]
1   5.0  {B}  B  [B]
2  14.0  {C}  C  [C]
3  30.0  {D}  D  [D]

它按我的预期工作:以前的结果是x，当前行的值是y.但是例如，我不能减少使用x.union(y).

It works as I expect: previous result is x and the current row value is y. But I cannot reduce using x.union(y), for example.

所以，我的问题是:我可以在对象上使用expanding的替代方法吗?该示例仅显示expanding().apply()在对象dtypes上不起作用.我正在寻找一种通用解决方案，该解决方案支持将函数应用于这两个输入:先前的结果和当前的元素.

So, my question is: Are there any alternatives to expanding that I can use on objects? The example is just to show that expanding().apply() is not working on object dtypes. I am looking for a general solution that supports applying functions to those two inputs: previous result and the current element.

答

结果证明，此操作无法完成.

Turns out this cannot be done.

继续同一示例:

def burndowntheworld(ser):
    print('Are you sure?')
    return ser/0

df.select_dtypes(['object']).expanding().apply(burndowntheworld)
Out: 
    C2 C3   C4
0  {A}  A  [A]
1  {B}  B  [B]
2  {C}  C  [C]
3  {D}  D  [D]

如果列的类型是object，则永远不会调用该函数.熊猫没有其他适用于对象的替代品. rolling().apply() .

If the column's type is object, the function is never called. And pandas doesn't have an alternative that works on objects. It's the same for rolling().apply().

从某种意义上讲，这是一件好事，因为带有自定义函数的expanding.apply具有O(n ** 2)复杂度.在特殊情况下，例如cumsum，ewma等，操作的递归性质可以将线性时间的复杂度降低，但是在最一般的情况下，它应该为前n个元素计算函数，然后为前n +个元素计算函数1个元素，依此类推.因此，特别是对于仅依赖于当前值和函数先前值的函数，扩展效率很低.更不用说将列表或集合存储在DataFrame中从来不是一个好主意.

In some sense, this is a good thing because expanding.apply with a custom function has O(n**2) complexity. With special cases like cumsum, ewma etc, the recursive nature of the operations can decrease the complexity to linear time but in the most general case it should calculate the function for the first n elements, and then for the first n+1 elements and so on. Therefore, especially for a function which is only dependent on the current value and function's previous value, expanding is quite inefficient. Not to mention storing lists or sets in a DataFrame is never a good idea to begin with.

因此答案是:如果您的数据不是数字，并且函数取决于先前的结果和当前元素，则只需使用for循环即可.无论如何，它将更加高效.

So the answer is: if your data is not numeric and the function is dependent on the previous result and the current element, just use a for loop. It will be more efficient anyway.

相关推荐