Example Notebooks
medaprep.skim and medaprep.visualize modules to investigate Sentinel-2 data¶
In this notebook, we will show how using medaprep.skim.features,medaprep.visualize.query, and medaprep.visualize.distributions can help prepare and visualize large sets of data. We will load a set of Sentinel-2 imagery over Austin, TX and quickly retreive statistics about the features within to assist with downstream processing. We will use the following process:
- Initialize Dask cluster to load
xarray Datasetfrom a public S3 bucket. visualize.querybounding boxes to check geographic coverage.skim.featuresfrom the dataset for a quick set of summary statistics.visualize.distributionsof variables within the dataset.
from bokeh.io import show, output_notebook
from bokeh.layouts import layout
import dask.distributed
import folium
import geopandas as gpd
from IPython.display import HTML, display
from odc.stac import configure_rio, stac_load
import pandas as pd
from medaprep import skim, visualize
from pystac_client import Client
import shapely.geometry
import xarray
client = dask.distributed.Client()
configure_rio(cloud_defaults=True, aws={"aws_unsigned": True}, client=client)
/Users/dylan/opt/anaconda3/envs/odc-stac/lib/python3.10/site-packages/distributed/node.py:179: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 49607 instead warnings.warn(
Load Sentinel-2 data from public STAC catalog¶
The following cell loads an xarray Dataset containing Sentinel-2 imagery from a STAC catalog. We initialize the center point of our query to be the center of Austin, TX and search for imagery from July, 2022 within a 100 km radius.
config = {
"sentinel-s2-l2a-cogs": {
"assets": {
"*": {"data_type": "uint16", "nodata": 0},
"SCL": {"data_type": "uint8", "nodata": 0},
"visual": {"data_type": "uint8", "nodata": 0},
},
"aliases": {"red": "B04", "green": "B03", "blue": "B02"},
},
"*": {"warnings": "ignore"},
}
km2deg = 1.0 / 111
x, y = (-97.744, 30.266)
r = 100 * km2deg
bbox = (x - r, y - r, x + r, y + r)
catalog = Client.open("https://earth-search.aws.element84.com/v0")
query = catalog.search(
collections=["sentinel-s2-l2a-cogs"], datetime="2022-07-01/2022-07-31", limit=100, bbox=bbox
)
items = list(query.get_items())
stac_json = query.get_all_items_as_dict()
crs = "epsg:3857"
zoom = 2**5
data = stac_load(
items,
crs=crs,
resolution=10*zoom,
chunks={},
groupby="solar_day",
stac_cfg=config,
)
data
<xarray.Dataset>
Dimensions: (y: 1142, x: 1137, time: 12)
Coordinates:
* y (y) float64 3.716e+06 3.715e+06 ... 3.351e+06 3.351e+06
* x (x) float64 -1.102e+07 -1.102e+07 ... -1.066e+07 -1.066e+07
spatial_ref int32 3857
* time (time) datetime64[ns] 2022-07-03T17:25:22 ... 2022-07-30T17:...
Data variables: (12/16)
visual (time, y, x) uint8 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B01 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B02 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B03 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B04 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B05 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
... ...
B09 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B11 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
B12 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
AOT (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
WVP (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
SCL (time, y, x) uint8 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>visualize the bounding boxes of the input query and the returned data using visualize.query¶
visualize.query returns a visualization of the input and returned bounding boxes.
gdf = gpd.GeoDataFrame.from_features(stac_json, 3857)
bbox2 = (gdf.bounds['minx'].min(),gdf.bounds['miny'].min(),gdf.bounds['maxx'].max(),gdf.bounds['maxy'].max())
bbox3 = tuple([b+2 for b in bbox2])
visualize.query(bbox=[bbox,bbox2],
name=["Query","Returned"],
folium_map=folium.Map(),
color=["red","green"])
skim through features of the data using skim.features¶
skim.features returns basic statistics and properties of an xarray Dataset.
small_data = data.sel(time='2022-07-03').compute()
df = skim.features(small_data)
df
| variables | data_types | NaNs | mean | std | maximums | minimums | |
|---|---|---|---|---|---|---|---|
| 0 | visual | uint8 | False | 83.206896 | 85.090139 | 255 | 0 |
| 1 | B01 | uint16 | False | 694.111545 | 1057.890177 | 12671 | 0 |
| 2 | B02 | uint16 | False | 715.116638 | 1019.690505 | 11615 | 0 |
| 3 | B03 | uint16 | False | 831.311709 | 1033.465046 | 10787 | 0 |
| 4 | B04 | uint16 | False | 897.211163 | 1064.686861 | 10284 | 0 |
| 5 | B05 | uint16 | False | 1144.481014 | 1258.273113 | 11399 | 0 |
| 6 | B06 | uint16 | False | 1453.429606 | 1447.058901 | 10691 | 0 |
| 7 | B07 | uint16 | False | 1585.919234 | 1543.215470 | 10361 | 0 |
| 8 | B08 | uint16 | False | 1592.048660 | 1549.807719 | 10519 | 0 |
| 9 | B8A | uint16 | False | 1737.786114 | 1662.727114 | 10352 | 0 |
| 10 | B09 | uint16 | False | 2445.116097 | 3098.090223 | 15918 | 0 |
| 11 | B11 | uint16 | False | 1869.718284 | 1790.813024 | 8431 | 0 |
| 12 | B12 | uint16 | False | 1358.351765 | 1357.018681 | 8239 | 0 |
| 13 | AOT | uint16 | False | 59.987095 | 57.874945 | 173 | 0 |
| 14 | WVP | uint16 | False | 1847.768434 | 1681.527584 | 6201 | 0 |
| 15 | SCL | uint8 | False | 3.550229 | 3.664158 | 10 | 0 |
visualize the distributions of the variables using visualize.distributions¶
visualize.distributions returns a list of bokeh figures containing estimated distributions of the variables.