Google Cloud Storage加入多个csv文件

Google Cloud Storage加入多个csv文件


鉴于BigQuery导出的文件大小为99个csv文件,我将数据集从Google BigQuery导出到Google Cloud Storage.

I exported a dataset from Google BigQuery to Google Cloud Storage, given the size of the file BigQuery exported the file as 99 csv files.


However now I want to connect to my GCP Bucket and perform some analysis with Spark, yet I need to join all 99 files into a single large csv file to run my analysis.


如果BigQuery是 gsutil工具合并,选中此官方文档,以了解如何使用gsutil执行对象组合.

BigQuery splits the data exported into several files if it is larger than 1GB. But you can merge these files with the gsutil tool, check this official documentation to know how to perform object composition with gsutil.


As BigQuery export the files with the same prefix, you can use a wildcard * to merge them into one composite object:

gsutil compose gs://example-bucket/component-obj-* gs://example-bucket/composite-object


Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.

此选项的缺点是每个.csv文件的标题行都将添加到复合对象中.但是您可以通过修改 jobConfig 进行设置来避免这种情况 print_header参数 .

The downside of this option is that the header row of each .csv file will be added in the composite object. But you can avoid this by modifiyng the jobConfig to set the print_header parameter to False.


Here is a Python sample code, but you can use any other BigQuery Client library:

from import bigquery
client = bigquery.Client()
bucket_name = 'yourBucket'

project = 'bigquery-public-data'
dataset_id = 'libraries_io'
table_id = 'dependencies'

destination_uri = 'gs://{}/{}'.format(bucket_name, 'file-*.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)

job_config = bigquery.job.ExtractJobConfig(print_header=False)

extract_job = client.extract_table(
    # Location must match that of the source table.
    job_config=job_config)  # API request

extract_job.result()  # Waits for job to complete.

print('Exported {}:{}.{} to {}'.format(
    project, dataset_id, table_id, destination_uri))


Finally, remember to compose an empty .csv with just the headers row.