--- file_format: mystnb kernelspec: name: python3 --- ```{code-cell} :tags: [remove-cell] import os os.chdir("../../_data") ``` # DataCollection Use {py:class}`xdas.DataCollection` when your experiment is composed of several acquisitions on a single or on multiple cables/fibers. In the following section, we will discuss how to use this functionality. {py:class}`xdas.DataCollection` is a dict-like container of {py:class}`xdas.DataArray`. It is mainly used to save several `DataArray`s in a single file. Unlike {py:class}`xarray.Dataset`, the different `DataArray`s do not need to share coordinates. `DataCollection` can, for example, be useful to save a list of Regions of Interest (ROIs). A {py:class}`xdas.DataCollection` can be viewed as a flexible way of organizing data. It has different hierarchical levels corresponding to your file tree. In this example below, we can see different levels for one global experiment that included several cables, with several acquisitions: - 1rst (mapping) level: can be the "Network" code, relative to the experience name. - 2nd (mapping) level: contains the "Node" codes relative to the places where the fibers/cables are. - 3rd (mapping) level: concerns the fibers/cables names. - 4th (sequence) level: is related to the number of acquisitions with changing parameters. ![](/_static/datacollection.svg) ## Case A: DataCollection as a set of DataArrays In the example, our `DataCollection` will be a sequence of {py:class}`xdas.DataArray`. ```{code-cell} import numpy as np import xdas as xd # Reopen dataarray from previous section as a virtual source da = xd.open("dataarray.nc") # Create a DataCollection dc = xd.DataCollection( { "event_1": da.sel(time=slice("2023-01-01T00:00:10", "2023-01-01T00:00:20")), "event_2": da.sel(time=slice("2023-01-01T00:00:40", "2023-01-01T00:00:50")), } ) dc ``` If the DataArrays are opened from files (having as data a {py:class}`xdas.virtual.VirtualSource`) then the data collection can be saved virtually to minimize redundant data writing. ```{code-cell} # Write the DataCollection dc.to_netcdf("datacollection.nc", virtual=False) ``` ```{code-cell} # Read a DataCollection dc = xd.open("datacollection.nc") dc ``` ## Case B: DataCollection comprising a set of acquisitions You can also create a DataCollection to gather different acquisitions on a same fiber with {py:func}`xdas.open`to opens a directory tree structure as a data collection. The tree structure is described by a path descriptor provided as a string containing placeholders. Two flavors of placeholder can be provided: - `{field}`: this level of the tree will behave as a dict. It will use the directory/file names as keys. - `[field]`: this level of the tree will behave as a list. The directory/file names are not considered (as if the placeholder was replaced by a `*`) and files are gathered and combined as usual. Several dict placeholders with different names can be provided. They must be followed by one or more list placeholders that must share a unique name. The resulting data collection will be a nesting of dicts down to the lower level which will be a list of dataarrays. ### Gather all your DataCollections In this example, for the 19th of November 2023, our network REKA featured 2 cables (RK1 and RK2), with the RK1 cable having 3 different acquisitions and RK2 one acquisition. If your data paths are something like: "/data/REKA/RK1/20231119/proc/*.hdf5" and "/data/REKA/RK2/20231119/proc/*.hdf5", you can define your data path as "/data/{network}/{cable}/20231119/proc/[acquisition].hdf5". You free to choose other words to define "network" and "cable", you juste have to replace them. ```python path = "/data/{network}/{cable}/20231119/proc/[acquisition].hdf5" dc = xd.open(path, engine='asn') dc ``` ```text Network: REKA: Cable: RK1: Acquisition: 0: 1: 2: RK2: Acquisition: 0: ``` ```python # Write it as your global DataCollection in .nc with the virtual argument True dc.to_netcdf("datacollection.nc", virtual=True) ``` ```python # Read your global DataCollection with open_datacollection dc = xd.open("datacollection.nc") dc ``` ```text Network: REKA: Cable: RK1: Acquisition: 0: 1: 2: RK2: Acquisition: 0: ``` If you have several DataCollections, you can gather them in one file using {py:func}`xdas.open` and write it to one single DataCollection. ### Extend your DataCollection You can extend your {py:class}`xdas.DataCollection` by inserting new {py:class}`xdas.DataArray` to the acquisitons list. ```python # Read your global DataCollection with open_datacollection dc = xd.open("datacollection.nc") # Read the dataarray you want to add da = xd.open("dataarray.nc") da ``` ```text VirtualSource: 72.5TB (float32) Coordinates: * time (time): 2021-10-27T15:44:10.722 to 2021-12-03T15:45:18.419 * distance (distance): 0.000 to 204255.953 ``` ```python # Add the dataarray to the datacollection at the acquisition number 0 dc['REKA']['RK2'].insert(0, da) dc ``` ```text Network: REKA: Cable: RK1: Acquisition: 0: 1: 2: RK2: Acquisition: 0: 1: ``` You now have 2 acquisitions in your acquisitions list.