如何使用熊猫从嵌套JSON数组中提取值

问题描述:

我有一个很大的JSON文件(400k行)。我试图隔离以下内容:

I have a large JSON file (400k lines). I am trying to isolate the following:

策略-描述

策略项-用户和数据库值

policy items - "users" and "database values"

JSON文件- https:// pastebin。 com / hv8mLfgx

熊猫的预期输出: https://imgur.com/a/FVcNGsZ

政策项目之后的所有内容都重复了整个文件。我尝试了下面的代码来隔离用户。似乎不起作用,我正尝试将所有这些都转储为CSV。

Everything after "Policy Items" is re-iterated the exact same throughout the file. I have tried the code below to isolate "users". It doesn't seem to work, I'm trying to dump all of this into a CSV.

编辑*这是我尝试尝试的解决方案,但可以无法正常工作-

Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    for item in jsonDF['policies'][0]['policyItems'][0]:
        print ('{} - {} - {}'.format(jsonDF['users']))

编辑2:我有一些工作代码可以捕获一些USERS,但不能捕获所有USERS。 25个中只有11个。

EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. Only 11 out of 25.

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
    print(pNode.head(500))

编辑3:这是最终的工作副本,但是我仍然没有复制所有的TABLE数据。我设置了一个循环以简单地忽略所有内容。捕获所有内容,然后在Excel中对其进行排序,是否有人知道为什么无法捕获所有TABLE值?

EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. I set a loop to simply ignore everything. Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values?

    json_data = json.load(file)
    with open("test.csv", 'w', newline='') as fd:
        wr = csv.writer(fd)
        wr.writerow(('Database name', 'Users', 'Description', 'Table'))
        for policy in json_data['policies']:
            desc = policy['description']
            db_values = policy['resources']['database']['values']
            db_tables = policy['resources']['table']['values']
            for item in policy['policyItems']:
                users = item['users']
                for dbT in db_tables:
                    for user in users:
                        for db in db_values:
                            _ = wr.writerow((db, user, desc, dbT))```


Pandas在这里过大了:csv标准模块是e够了您只需迭代策略以提取描述值和数据库值,然后在policyItems上提取用户:

Pandas is overkill here: the csv standard module is enough. You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users:

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
    wr = csv.writer(fd)
    _ = wr.writerow(('Database name', 'Users', 'Description'))
    for policy in js['policies']:
        desc = policy['description']
        db_values = policy['resources']['database']['values']
        for item in policy['policyItems']:
            users = item['users']
            for user in users:
                for db in db_values:
                    if db != '*':
                        _ = wr.writerow((db, user, desc))