在for循环中,值的长度与索引的长度不匹配
我有一个像这样的数据集( df
)
I have a dataset (df
) like this
Name1 Name2 Score
John NaN NaN
Patty NaN NaN
其中 Name2
和 Score
初始化为 NaN
.一些数据,如下所示
where Name2
and Score
are initialized to NaN
. Some data, like the following
name2_list=[[Chris, Luke, Martin], [Martin]]
score_list=[[1,2,4],[3],[]]
在函数的每个循环中生成.这两个列表需要添加到我的 df
中的 Name2
和 Score
列中,以便:
is generated at each loop from a function. These two lists need to be added to columns Name2
and Score
in my df
, in order to have:
Name1 Name2 Score
John [Chris, Luke, Martin] [1,2,4]
Patty [Martin] [3]
然后,由于我想拥有值而不是 Name2
和 Score
中的列表,因此我扩展了数据集:
Then, since I want to have values and not lists in Name2
and Score
, I expand the dataset:
Name1 Name2 Name3
John Chris 1
John Luke 2
John Martin 4
Patty Martin 3
我的目标是将所有值都存储在 Name1
中的 Name2
中.但是,正如我提到的,我有一个功能如下:对于 Name2
中的每个元素,而不是 Name1
中的每个元素,它都会检查是否还有其他值.生成的这些值类似于在 name2_list
和 score_list
中看到的值.例如,假设在第二次迭代中, Chris
具有从函数生成的值等于 [Patty]
和 9
; Luke
的值为 [Martin]
和 1
; Martin
的值为 [Laura]
和 3
.然后,我需要将这些值再次添加到我的原始 df
中,以具有(爆炸之前)
My goal is to have all values in Name2
in Name1
. However, as I mentioned, I have a function that works as follows: for each element in Name2
, not in Name1
, it checks if there are further values. These values generated are similar to those ones seen for name2_list
and score_list
.
For example, let's say that, at the second iteration, Chris
has values generated from the function equal to [Patty]
and 9
; Luke
has values [Martin]
and 1
; Martin
has values [Laura]
and 3
. I need then to add these values again to my original df
in order to have (before exploding)
Name1 Name2 Score
John Chris 1
John Luke 2
John Martin 4
Patty Martin 3
Chris Patty 9
Luke Martin 1
Martin Laura 3
只有一个值 Laura
不在 Name1
中,因此我将需要再次运行该函数:如果输出已经包含在 Name1中
,然后我的循环停止,我得到了最终的数据集;否则,我将需要重新运行该函数,并查看是否需要更多的循环.为了使本示例更简短,我们假设运行该函数后 Laura
的值是 John
, 3
. John
已经在 Name1
中,因此我不需要重新运行该函数.
Only one value, Laura
, is not in Name1
yet, so I will need to run again the function: if the output is already included in Name1
, then my loop stops and I get the final dataset; otherwise, I will need to rerun the function and see if more loops are required.
To make it shorter in this example, let's suppose that the value of Laura
after running the function is John
, 3
. John
is already in Name1
so I do not need to rerun the function.
我要做的是以下事情:
name2_list, score_list = [],[] # Initialize lists. These two lists need to store outputs from my function
name2 = df['name2'] # Append new name2 to this list as I iterate
name1 = df['name1'] # Append new name1 to this list as I iterate
distinct_name1 = set(name1) # distinct name1. I need this to calculate the difference
diff = set(name2) ^ distinct_name1 # This calculates the difference. I need to iterate until this list is empty, i.e., when len(diff)=0
if df.Name2.isnull().all(): # this condition is to start the process. At the beginning I have only values in Name1. No values in Name2
if len(diff)>0: # in the example the difference is 2 at the beginning, i.e., John and Patty; at the second round 3 (Chris, Luke, Martin); at the third round is only for Laura. There is no fourth round
for x in diff: # I run it first for John, then for Patty
collected_data = fun(df, diff) # I will explain below what this function does and how it looks like
df = df.apply(pd.Series.explode) # in this step I explode the dataset
name2 = df['Name2'] # I am updating the list of values in Name2 to calculate the difference after each iteration.
name1 = df['Name1'] # I am updating the list of values in Name1 to calculate the difference after each iteration.
distinct_name1 = set(name1) # calculate the new difference
diff = filter(None, (set(name2) ^ distinct_name1) ) # calculate the new difference. Iterate until this is empty
当我在函数中考虑此步骤 df ['Name2'] = name2_list
时发生错误:
An error occurs when I consider this step df['Name2'] = name2_list
in the function:
--->33 df ['Name2'] = name2_list
---> 33 df['Name2'] = name2_list
说:
ValueError:值(6)的长度与索引(8)的长度不匹配.
ValueError: Length of values (6) does not match length of index (8).
(圆括号内的值可能与通过此示例获得的值不同)
(the values inside the round brackets may be different from those ones that you could get by using this example)
我的函数当前不关心数据框中有多少行,它正在创建具有不同长度的新列表.我需要找到一种方法来调和这一点.我正在调试,可以确认错误来自函数中的 df ['Name2'] = name2_list
.我能够正确打印新的name2值的列表,但不能打印该列.也许,一种可能的解决方案可能是在 for
循环之外一次构建df,但是我需要爆炸 df ['Name2']
并建立存储结果的列表通过网络.
My function currently does not care how many rows are in the dataframe and it is creating new lists of some different length. I would need to find a way to reconcile this. I was debugging and I can confirm that the error comes from df['Name2'] = name2_list
in the function. I am able to correctly print the list of new name2 values, but not the column.
Maybe, a possible solution could be to build the df once outside of the for
loop, but I need to explode df['Name2']
and build lists where to store results from the web.
我认为使用熊猫解决此类问题不是一个好主意.如果您适合使用简单的python作为中间步骤,则可以执行以下操作:
I think it is not a good idea to use pandas for this kind of problem. If you are fine with plain python for intermediate steps, you could do this:
import pandas as pd
def get_links(source_name):
"""Dummy function with data from OP.
Note that it processes one name at a time instead of batch like in OP.
"""
dummy_output = {
'John': (
['Chris', 'Luke', 'Martin'],
[1, 2, 4]
),
'Patty': (
['Martin'],
[9]
),
'Chris': (
['Patty'],
[9]
),
'Luke': (
['Martin'],
[1]
),
'Martin': (
['Laura'],
[3]
),
'Laura': (
['John'],
[3]
)
}
target_names, scores = dummy_output.get(source_name, ([], []))
return [
{'name1': source_name, 'name2': target_name, 'score': score}
for target_name, score in zip(target_names, scores)
]
todo = ['John', 'Patty']
seen = set(todo)
data = []
while todo:
source_name = todo.pop(0) # If you don't care about order can .pop() to get last element (more efficient)
# get new data
new_data = get_links(source_name)
data += new_data
# add new names to queue if we haven't seen them before
new_names = set([row['name2'] for row in new_data]).difference(seen)
seen.update(new_names)
todo += list(new_names)
pd.DataFrame(data)
输出:
name1 name2 score
0 John Chris 1
1 John Luke 2
2 John Martin 4
3 Patty Martin 9
4 Chris Patty 9
5 Luke Martin 1
6 Martin Laura 3
7 Laura John 3