

我尝试了几种方法来制作 xarray (xr)数据集多个.h5文件.这些文件包含来自 SMAP 项目的有关土壤水分含量的数据以及其他有用变量.每个变量代表一个二维数组.每个文件中变量的数量及其标签均相等.问题是尺寸x和y的尺寸大小不相等.

I tried several methods to make a xarray (xr) dataset out of multiple .h5 files. The files contain data from SMAP project on soil moisture content along with other useful variables. Each variable represent a 2-D Array. The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.

Example dataset load via xr.open_dataset()

Dimensions:                                     (x: 54, y: 129)
    EASE_column_index_3km                       (x, y) float32 ...
    EASE_column_index_apm_3km                   (x, y) float32 ...
    EASE_row_index_3km                          (x, y) float32 ...
    EASE_row_index_apm_3km                      (x, y) float32 ...
    latitude_3km                                (x, y) float32 ...
    latitude_apm_3km                            (x, y) float32 ...
    longitude_3km                               (x, y) float32 ...
    longitude_apm_3km                           (x, y) float32 ...
Dimensions without coordinates: x, y
Data variables:
    SMAP_Sentinel_overpass_timediff_hr_3km      (x, y) timedelta64[ns] ...
    SMAP_Sentinel_overpass_timediff_hr_apm_3km  (x, y) timedelta64[ns] ...
    albedo_3km                                  (x, y) float32 ...
    albedo_apm_3km                              (x, y) float32 ...
    bare_soil_roughness_retrieved_3km           (x, y) float32 ...
    bare_soil_roughness_retrieved_apm_3km       (x, y) float32 ...
    beta_tbv_vv_3km                             (x, y) float32 ...
    beta_tbv_vv_apm_3km                         (x, y) float32 ...
    disagg_soil_moisture_3km                    (x, y) float32 ...
    disagg_soil_moisture_apm_3km                (x, y) float32 ...
    disaggregated_tb_v_qual_flag_3km            (x, y) float32 ...
    disaggregated_tb_v_qual_flag_apm_3km        (x, y) float32 ...
    gamma_vv_xpol_3km                           (x, y) float32 ...
    gamma_vv_xpol_apm_3km                       (x, y) float32 ...
    landcover_class_3km                         (x, y) float32 ...
    landcover_class_apm_3km                     (x, y) float32 ...
    retrieval_qual_flag_3km                     (x, y) float32 ...
    retrieval_qual_flag_apm_3km                 (x, y) float32 ...
    sigma0_incidence_angle_3km                  (x, y) float32 ...
    sigma0_incidence_angle_apm_3km              (x, y) float32 ...
    sigma0_vh_aggregated_3km                    (x, y) float32 ...
    sigma0_vh_aggregated_apm_3km                (x, y) float32 ...
    sigma0_vv_aggregated_3km                    (x, y) float32 ...
    sigma0_vv_aggregated_apm_3km                (x, y) float32 ...
    soil_moisture_3km                           (x, y) float32 ...
    soil_moisture_apm_3km                       (x, y) float32 ...
    soil_moisture_std_dev_3km                   (x, y) float32 ...
    soil_moisture_std_dev_apm_3km               (x, y) float32 ...
    spacecraft_overpass_time_seconds_3km        (x, y) timedelta64[ns] ...
    spacecraft_overpass_time_seconds_apm_3km    (x, y) timedelta64[ns] ...
    surface_flag_3km                            (x, y) float32 ...
    surface_flag_apm_3km                        (x, y) float32 ...
    surface_temperature_3km                     (x, y) float32 ...
    surface_temperature_apm_3km                 (x, y) float32 ...
    tb_v_disaggregated_3km                      (x, y) float32 ...
    tb_v_disaggregated_apm_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_apm_3km              (x, y) float32 ...
    vegetation_opacity_3km                      (x, y) float32 ...
    vegetation_opacity_apm_3km                  (x, y) float32 ...
    vegetation_water_content_3km                (x, y) float32 ...
    vegetation_water_content_apm_3km            (x, y) float32 ...
    water_body_fraction_3km                     (x, y) float32 ...
    water_body_fraction_apm_3km                 (x, y) float32 ...


Example variable dataset.soil_moisture_3km

<xarray.DataArray 'soil_moisture_3km' (x: 54, y: 129)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
    EASE_column_index_3km      (x, y) float32 ...
    EASE_column_index_apm_3km  (x, y) float32 ...
    EASE_row_index_3km         (x, y) float32 ...
    EASE_row_index_apm_3km     (x, y) float32 ...
    latitude_3km               (x, y) float32 ...
    latitude_apm_3km           (x, y) float32 ...
    longitude_3km              (x, y) float32 ...
    longitude_apm_3km          (x, y) float32 ...
Dimensions without coordinates: x, y
    units:        cm**3/cm**3
    valid_min:    0.0
    long_name:    Representative soil moisture measurement for the 3 km Earth...
    coordinates:  /Soil_Moisture_Retrieval_Data_3km/latitude_3km /Soil_Moistu...
    valid_max:    0.75


First i tried to open the files with:

test = xr.open_mfdataset(list_of_paths)


ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {129, 132}


Then i try combine by coords

test = xr.open_mfdataset(list_of_paths, combine='by_coords')


ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation


test = xr.open_mfdataset(list_of_paths, coords=['latitude_3km', 'longitude_3km'], combine='by_coords')



Then i try to open every file with xr.open_dataset() and try every method i can find on documentation page for combining data like merge, combine, broadcast_like, align & combine... but every time end up with the same problem that the dimensions are not equal. What is the common approach to reshape, align the dimensions or whatever is possible to solve this problem ?


I found a workaround for my problem, but first I think I have forgotten to mention that the different files which I try to concatenate along the dimension time have different coordinates and dimensions. The images I try to build my model from all have overlapping areas with same longitude and latitude values but also parts with no overlapping.


The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.


Sorry, is len(x) the same in every file? And the len(y) the same? Otherwise this can't be handled immediately by open_mfdataset.


If they are the same, you should in theory be able to do this in two different ways.


Then you have a 2D concatenation problem: you need to arrange the datasets such that when joined up along x and y, they make a larger dataset which also has dimensions x and y.


1) Using combine='nested'


You can manually specify the order that you need them joined up in. xarray allows you to do this by passing the datasets as a grid, specified as a nested list. In your case, if we had 4 files (named [upper_left, upper_right, lower_left, lower_right]), we would combine them like so:

from xarray import open_mfdataset

grid = [[upper_left, upper_right], 
        [lower_left, lower_right]]

ds = open_mfdataset(grid, concat_dim=['x', 'y'], combine='nested')

我们必须告诉open_mfdataset网格的行和列对应于数据的哪些维度,因此它将知道将数据串联在一起的维度.这就是为什么我们需要通过concat_dim=['x', 'y'].

We had to tell open_mfdataset which dimensions of the data the rows and colums of the grid corresponded to, so it would know which dimensions to concatenate the data along. That's why we needed to pass concat_dim=['x', 'y'].


2) Using combine='by_coords'

但是您的数据中已经有坐标-xarray不能仅使用它们以正确的顺序排列数据集吗?这就是combine='by_coords'选项的作用,但是不幸的是,它需要一维坐标(也称为维坐标)来排列数据.您的文件没有任何文件(这就是为什么打印输出显示Dimensions without coordinates: x, y的原因.)

But your data has coordinates in it already - can't xarray just use those to arrange the datasets in the right order? That is what the combine='by_coords' option is for, but unfortunately, it requires 1-dimensional coordinates (also known as dimensional coordinates) to arrange the data. Your files don't have any of those (that's why the printout says Dimensions without coordinates: x, y).


If you can add 1-dimensional coordinates to your files first, then you could use combine='by_coords', then you could just pass a list of all the files in any order. But otherwise you'll have to use combine='nested' in this case.


(You don't need the coords argument here, that's to do with how different coordinates are to be joined up, not the arrangement of datasets to use.)