在for循环中，值的长度与索引的长度不匹配

问题描述：

我有一个像这样的数据集( df )

I have a dataset (df) like this

Name1 Name2 Score
John    NaN  NaN
Patty    NaN  NaN

其中 Name2 和 Score 初始化为 NaN .一些数据，如下所示

where Name2 and Score are initialized to NaN. Some data, like the following

name2_list=[[Chris, Luke, Martin], [Martin]]
score_list=[[1,2,4],[3],[]]

在函数的每个循环中生成.这两个列表需要添加到我的 df 中的 Name2 和 Score 列中，以便:

is generated at each loop from a function. These two lists need to be added to columns Name2 and Score in my df, in order to have:

Name1 Name2         Score
John    [Chris, Luke, Martin]  [1,2,4]
Patty    [Martin]  [3]

然后，由于我想拥有值而不是 Name2 和 Score 中的列表，因此我扩展了数据集:

Then, since I want to have values and not lists in Name2 and Score, I expand the dataset:

Name1 Name2  Name3
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3

我的目标是将所有值都存储在 Name1 中的 Name2 中.但是，正如我提到的，我有一个功能如下:对于 Name2 中的每个元素，而不是 Name1 中的每个元素，它都会检查是否还有其他值.生成的这些值类似于在 name2_list 和 score_list 中看到的值.例如，假设在第二次迭代中， Chris 具有从函数生成的值等于 [Patty] 和 9 ； Luke 的值为 [Martin] 和 1 ； Martin 的值为 [Laura] 和 3 .然后，我需要将这些值再次添加到我的原始 df 中，以具有(爆炸之前)

My goal is to have all values in Name2 in Name1. However, as I mentioned, I have a function that works as follows: for each element in Name2, not in Name1, it checks if there are further values. These values generated are similar to those ones seen for name2_list and score_list. For example, let's say that, at the second iteration, Chris has values generated from the function equal to [Patty] and 9; Luke has values [Martin] and 1; Martin has values [Laura] and 3. I need then to add these values again to my original df in order to have (before exploding)

Name1 Name2  Score
John    Chris    1
John    Luke     2
John    Martin   4
Patty   Martin   3
Chris   Patty    9
Luke    Martin   1
Martin  Laura    3

只有一个值 Laura 不在 Name1 中，因此我将需要再次运行该函数:如果输出已经包含在 Name1中，然后我的循环停止，我得到了最终的数据集；否则，我将需要重新运行该函数，并查看是否需要更多的循环.为了使本示例更简短，我们假设运行该函数后 Laura 的值是 John ， 3 . John 已经在 Name1 中，因此我不需要重新运行该函数.

Only one value, Laura, is not in Name1 yet, so I will need to run again the function: if the output is already included in Name1, then my loop stops and I get the final dataset; otherwise, I will need to rerun the function and see if more loops are required. To make it shorter in this example, let's suppose that the value of Laura after running the function is John, 3. John is already in Name1 so I do not need to rerun the function.

我要做的是以下事情:

name2_list, score_list = [],[]   # Initialize lists. These two lists need to store outputs from my function

name2 = df['name2']              # Append new name2 to this list as I iterate
name1 = df['name1']              # Append new name1 to this list as I iterate
distinct_name1 = set(name1)      # distinct name1. I need this to calculate the difference
diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0


if df.Name2.isnull().all():  # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2

    if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round 
         for x in diff: # I run it first for John, then for Patty
            collected_data = fun(df, diff) # I will explain below what this function does and how it looks like
    
        df = df.apply(pd.Series.explode) # in this step I explode the dataset

        name2 = df['Name2']             # I am updating the list of values in Name2 to calculate the difference after each iteration. 
        name1 = df['Name1']             # I am updating the list of values in Name1 to calculate the difference after each iteration. 
        distinct_name1 = set(name1)    # calculate the new difference
        diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty

当我在函数中考虑此步骤 df ['Name2'] = name2_list 时发生错误:

An error occurs when I consider this step df['Name2'] = name2_list in the function:

--->33 df ['Name2'] = name2_list

---> 33 df['Name2'] = name2_list

说:

ValueError:值(6)的长度与索引(8)的长度不匹配.

ValueError: Length of values (6) does not match length of index (8).

(圆括号内的值可能与通过此示例获得的值不同)

(the values inside the round brackets may be different from those ones that you could get by using this example)

我的函数当前不关心数据框中有多少行，它正在创建具有不同长度的新列表.我需要找到一种方法来调和这一点.我正在调试，可以确认错误来自函数中的 df ['Name2'] = name2_list .我能够正确打印新的name2值的列表，但不能打印该列.也许，一种可能的解决方案可能是在 for 循环之外一次构建df，但是我需要爆炸 df ['Name2'] 并建立存储结果的列表通过网络.

My function currently does not care how many rows are in the dataframe and it is creating new lists of some different length. I would need to find a way to reconcile this. I was debugging and I can confirm that the error comes from df['Name2'] = name2_list in the function. I am able to correctly print the list of new name2 values, but not the column. Maybe, a possible solution could be to build the df once outside of the for loop, but I need to explode df['Name2'] and build lists where to store results from the web.

答

我认为使用熊猫解决此类问题不是一个好主意.如果您适合使用简单的python作为中间步骤，则可以执行以下操作:

I think it is not a good idea to use pandas for this kind of problem. If you are fine with plain python for intermediate steps, you could do this:

import pandas as pd


def get_links(source_name):
    """Dummy function with data from OP.
    
    Note that it processes one name at a time instead of batch like in OP.
    """
    dummy_output = {
        'John': (
            ['Chris', 'Luke', 'Martin'],
            [1, 2, 4]
        ),
        'Patty': (
            ['Martin'],
            [9]
        ),
        'Chris': (
            ['Patty'],
            [9]
        ),
        'Luke': (
            ['Martin'],
            [1]
        ),
        'Martin': (
            ['Laura'],
            [3]
        ),
        'Laura': (
            ['John'],
            [3]
        )
    }
    target_names, scores = dummy_output.get(source_name, ([], []))

    return [
        {'name1': source_name, 'name2': target_name, 'score': score}
        for target_name, score in zip(target_names, scores)
    ]


todo = ['John', 'Patty']

seen = set(todo)
data = []

while todo:
    source_name = todo.pop(0)  # If you don't care about order can .pop() to get last element (more efficient)
    # get new data
    new_data = get_links(source_name)
    data += new_data

    # add new names to queue if we haven't seen them before
    new_names = set([row['name2'] for row in new_data]).difference(seen)
    seen.update(new_names)
    todo += list(new_names)

pd.DataFrame(data)

输出:

    name1   name2  score
0    John   Chris      1
1    John    Luke      2
2    John  Martin      4
3   Patty  Martin      9
4   Chris   Patty      9
5    Luke  Martin      1
6  Martin   Laura      3
7   Laura    John      3

在for循环中，值的长度与索引的长度不匹配

相关推荐