Download and convert to Zarr¶
This downloads SWOT Pixel Cloud products from hydroweb.next (API-Key necessary) based on a region and a period of interest. Then is extracts information contained in the area of interest for your study, stores everything in a Zarr Database (based on the zcollection package) for future use. Zarr (and the way we partitionned data with zcollection) is very efficient for computation. However, it is not (yet) compatible with QGIS compared to Geopackage.
Setting the region and period of interest¶
Using a geopackage layer, preliminary created with, e.g. QGIS, to limit data download and database
[1]:
import pixcdust
from pixcdust.downloaders.hydroweb_next import PixCDownloader
import geopandas as gpd
from datetime import datetime
[2]:
# reading the area of interest polygon
gdf_geom = gpd.read_file("../data/aoi.gpkg")
dates = (
datetime(2023,4,6),
datetime(2023,4,8),
)
Download¶
This will unfortunately lead to downloading many big files (that will be removed later). This is the only way right now, but the hydroweb.next team is working on improving that.
[3]:
pixcdownloader = PixCDownloader(
gdf_geom,
dates,
verbose=1,
path_download='/tmp/pixc',
)
pixcdownloader.search_download()
Extraction¶
Now we have all necessary files, let us extract key variables within area of interest in a Zarr (zcollection) database. This Zarr partionned format is very efficient for time analysis, but is not currently accessible in GIS softwares such as QGIS We are using the same geodataframe to limit the data to the area of interest
[4]:
from pixcdust.converters.zarr import Nc2ZarrConverter
from glob import glob
[6]:
# You can specify conditions on variables to filter data
conditions= {"sig0":{'operator': "gt", 'threshold': 20}, # sig0 > 20
"classification":{'operator': "ge", 'threshold': 3}, # classification >= 3
}
pixc = Nc2ZarrConverter(
path_in = glob(pixcdownloader.path_download+'/*/*nc'),
variables=['height', 'sig0', 'classification'],
area_of_interest=gdf_geom,
conditions=conditions,
)
pixc.database_from_nc(path_out='/tmp/pixc_zarr')
2025-02-13 11:35:44,674 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:37803' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 9), 'shuffle-taker-323ec52c54e9ef65b83a529dc1b42178', ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 6), ('astype-concatenate-93c733841bd864520e81d845338ea5ab', 14), 'shuffle-taker-88878d00edde8017b6bab09fff810c14', 'original-open_dataset-latitude-251dad577f030b76afa777d4127c621a', ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 7), ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 0)} (stimulus_id='handle-worker-cleanup-1739442944.6742508')
2025-02-13 11:35:44,674 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:37803
Traceback (most recent call last):
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/tornado/iostream.py", line 861, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/tornado/iostream.py", line 1116, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/worker.py", line 2056, in gather_dep
response = await get_data_from_worker(
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/worker.py", line 2874, in get_data_from_worker
response = await send_recv(
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
convert_stream_closed_error(self, e)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:52358 remote=tcp://127.0.0.1:37803>: ConnectionResetError: [Errno 104] Connection reset by peer
2025-02-13 11:35:44,682 - distributed.nanny - WARNING - Restarting worker
2025-02-13 11:35:45,843 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:40795' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 19), ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 22), 'shuffle-taker-9f8e6acf428c3f847de7d4b31c83861e', ('astype-concatenate-93c733841bd864520e81d845338ea5ab', 14), ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 15), ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 5), 'shuffle-taker-b27c1aa82e93a6b5f8d3f13a5d45a5fe', ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 11), 'shuffle-taker-88878d00edde8017b6bab09fff810c14', ('shuffle-split-d180ccccca7f1b0ad240c52ca6a9e922', 23)} (stimulus_id='handle-worker-cleanup-1739442945.843789')
2025-02-13 11:35:45,843 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:40795
Traceback (most recent call last):
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/tornado/iostream.py", line 861, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/tornado/iostream.py", line 1116, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/worker.py", line 2056, in gather_dep
response = await get_data_from_worker(
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/worker.py", line 2874, in get_data_from_worker
response = await send_recv(
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
convert_stream_closed_error(self, e)
File "/home/vschaffn/Documents/swot_pixc_study/pixc-env/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:50056 remote=tcp://127.0.0.1:40795>: ConnectionResetError: [Errno 104] Connection reset by peer
2025-02-13 11:35:45,851 - distributed.nanny - WARNING - Restarting worker
database has been succesfully created, we can remove the raw files
[7]:
# import shutil
# shutil.rmtree('/tmp/pixc')
Read the database¶
previous steps are not necessary
Now we can open this database in a xarray, or dataframe, or GeoDataFrame
[8]:
from pixcdust.readers.zarr import ZarrReader
import datetime
pixc_read = ZarrReader(
"/tmp/pixc_zarr"
)
pixc_read.read((datetime.datetime(2023,4,6), datetime.datetime(2023,4,8)))
pixc_read.data
[8]:
<xarray.Dataset> Size: 2MB
Dimensions: (points: 49160)
Dimensions without coordinates: points
Data variables:
time (points) datetime64[ns] 393kB dask.array<chunksize=(23399,), meta=np.ndarray>
sig0 (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
height (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
pass_number (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
longitude (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
tile_number (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
classification (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
latitude (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
cycle_number (points) float32 197kB dask.array<chunksize=(23399,), meta=np.ndarray>
Attributes:
azimuth_offset: 3
description: cloud of geolocated interferogram pixels
interferogram_size_azimuth: 3245
interferogram_size_range: 4857
looks_to_efflooks: 1.5340684990936673
num_azimuth_looks: 7.0[9]:
gdf_pixc = pixc_read.to_geodataframe()
gdf_pixc
/home/vschaffn/Documents/swot_pixc_study/pixcdust/readers/base_reader.py:142: UserWarning: No active geometry column to be set. The resulting object will be a pandas.DataFrame with geopandas.GeometryArray(s) containing geometry and CRS information. Use `.set_geometry()` to set an active geometry and upcast to the geopandas.GeoDataFrame manually.
gdf = self.data.xvec.to_geodataframe()
[9]:
| time | sig0 | height | pass_number | longitude | tile_number | classification | latitude | cycle_number | |
|---|---|---|---|---|---|---|---|---|---|
| points | |||||||||
| 0 | 2023-04-06 09:46:18 | 179.789597 | 210.770126 | 16.0 | 1.375725 | 78.0 | 3.0 | 43.519089 | 482.0 |
| 1 | 2023-04-06 09:46:18 | 165.948349 | 210.261169 | 16.0 | 1.375776 | 78.0 | 4.0 | 43.519096 | 482.0 |
| 2 | 2023-04-06 09:46:18 | 107.777306 | 210.359619 | 16.0 | 1.375956 | 78.0 | 4.0 | 43.519131 | 482.0 |
| 3 | 2023-04-06 09:46:18 | 50.774342 | 210.037994 | 16.0 | 1.376048 | 78.0 | 4.0 | 43.519146 | 482.0 |
| 4 | 2023-04-06 09:46:18 | 25.940098 | 210.448334 | 16.0 | 1.376294 | 78.0 | 3.0 | 43.519192 | 482.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49155 | 2023-04-07 09:36:56 | 20.804575 | 173.779526 | 16.0 | 1.432744 | 78.0 | 3.0 | 43.681229 | 483.0 |
| 49156 | 2023-04-07 09:36:56 | 30.022316 | 175.180923 | 16.0 | 1.432419 | 78.0 | 3.0 | 43.682388 | 483.0 |
| 49157 | 2023-04-07 09:36:56 | 70.519905 | 173.066010 | 16.0 | 1.430424 | 78.0 | 3.0 | 43.682846 | 483.0 |
| 49158 | 2023-04-07 09:36:56 | 89.743431 | 173.031128 | 16.0 | 1.430551 | 78.0 | 3.0 | 43.682869 | 483.0 |
| 49159 | 2023-04-07 09:36:56 | 32.024288 | 176.172485 | 16.0 | 1.428071 | 78.0 | 3.0 | 43.685673 | 483.0 |
49160 rows × 9 columns
Enjoy!