Example Notebooks
medaprep.skim
and medaprep.visualize
modules to investigate Sentinel-2 data¶
In this notebook, we will show how using medaprep.skim.features
,medaprep.visualize.query
, and medaprep.visualize.distributions
can help prepare and visualize large sets of data. We will load a set of Sentinel-2 imagery over Austin, TX and quickly retreive statistics about the features within to assist with downstream processing. We will use the following process:
- Initialize Dask cluster to load
xarray Dataset
from a public S3 bucket. visualize.query
bounding boxes to check geographic coverage.skim.features
from the dataset for a quick set of summary statistics.visualize.distributions
of variables within the dataset.
from bokeh.io import show, output_notebook
from bokeh.layouts import layout
import dask.distributed
import folium
import geopandas as gpd
from IPython.display import HTML, display
from odc.stac import configure_rio, stac_load
import pandas as pd
from medaprep import skim, visualize
from pystac_client import Client
import shapely.geometry
import xarray
client = dask.distributed.Client()
configure_rio(cloud_defaults=True, aws={"aws_unsigned": True}, client=client)
/Users/dylan/opt/anaconda3/envs/odc-stac/lib/python3.10/site-packages/distributed/node.py:179: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 49607 instead warnings.warn(
Load Sentinel-2 data from public STAC catalog¶
The following cell loads an xarray Dataset
containing Sentinel-2 imagery from a STAC catalog. We initialize the center point of our query to be the center of Austin, TX and search for imagery from July, 2022 within a 100 km radius.
config = {
"sentinel-s2-l2a-cogs": {
"assets": {
"*": {"data_type": "uint16", "nodata": 0},
"SCL": {"data_type": "uint8", "nodata": 0},
"visual": {"data_type": "uint8", "nodata": 0},
},
"aliases": {"red": "B04", "green": "B03", "blue": "B02"},
},
"*": {"warnings": "ignore"},
}
km2deg = 1.0 / 111
x, y = (-97.744, 30.266)
r = 100 * km2deg
bbox = (x - r, y - r, x + r, y + r)
catalog = Client.open("https://earth-search.aws.element84.com/v0")
query = catalog.search(
collections=["sentinel-s2-l2a-cogs"], datetime="2022-07-01/2022-07-31", limit=100, bbox=bbox
)
items = list(query.get_items())
stac_json = query.get_all_items_as_dict()
crs = "epsg:3857"
zoom = 2**5
data = stac_load(
items,
crs=crs,
resolution=10*zoom,
chunks={},
groupby="solar_day",
stac_cfg=config,
)
data
<xarray.Dataset> Dimensions: (y: 1142, x: 1137, time: 12) Coordinates: * y (y) float64 3.716e+06 3.715e+06 ... 3.351e+06 3.351e+06 * x (x) float64 -1.102e+07 -1.102e+07 ... -1.066e+07 -1.066e+07 spatial_ref int32 3857 * time (time) datetime64[ns] 2022-07-03T17:25:22 ... 2022-07-30T17:... Data variables: (12/16) visual (time, y, x) uint8 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B01 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B02 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B03 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B04 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B05 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> ... ... B09 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B11 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> B12 (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> AOT (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> WVP (time, y, x) uint16 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray> SCL (time, y, x) uint8 dask.array<chunksize=(1, 1142, 1137), meta=np.ndarray>
visualize
the bounding boxes of the input query and the returned data using visualize.query
¶
visualize.query
returns a visualization of the input and returned bounding boxes.
gdf = gpd.GeoDataFrame.from_features(stac_json, 3857)
bbox2 = (gdf.bounds['minx'].min(),gdf.bounds['miny'].min(),gdf.bounds['maxx'].max(),gdf.bounds['maxy'].max())
bbox3 = tuple([b+2 for b in bbox2])
visualize.query(bbox=[bbox,bbox2],
name=["Query","Returned"],
folium_map=folium.Map(),
color=["red","green"])
skim
through features
of the data using skim.features
¶
skim.features
returns basic statistics and properties of an xarray Dataset
.
small_data = data.sel(time='2022-07-03').compute()
df = skim.features(small_data)
df
variables | data_types | NaNs | mean | std | maximums | minimums | |
---|---|---|---|---|---|---|---|
0 | visual | uint8 | False | 83.206896 | 85.090139 | 255 | 0 |
1 | B01 | uint16 | False | 694.111545 | 1057.890177 | 12671 | 0 |
2 | B02 | uint16 | False | 715.116638 | 1019.690505 | 11615 | 0 |
3 | B03 | uint16 | False | 831.311709 | 1033.465046 | 10787 | 0 |
4 | B04 | uint16 | False | 897.211163 | 1064.686861 | 10284 | 0 |
5 | B05 | uint16 | False | 1144.481014 | 1258.273113 | 11399 | 0 |
6 | B06 | uint16 | False | 1453.429606 | 1447.058901 | 10691 | 0 |
7 | B07 | uint16 | False | 1585.919234 | 1543.215470 | 10361 | 0 |
8 | B08 | uint16 | False | 1592.048660 | 1549.807719 | 10519 | 0 |
9 | B8A | uint16 | False | 1737.786114 | 1662.727114 | 10352 | 0 |
10 | B09 | uint16 | False | 2445.116097 | 3098.090223 | 15918 | 0 |
11 | B11 | uint16 | False | 1869.718284 | 1790.813024 | 8431 | 0 |
12 | B12 | uint16 | False | 1358.351765 | 1357.018681 | 8239 | 0 |
13 | AOT | uint16 | False | 59.987095 | 57.874945 | 173 | 0 |
14 | WVP | uint16 | False | 1847.768434 | 1681.527584 | 6201 | 0 |
15 | SCL | uint8 | False | 3.550229 | 3.664158 | 10 | 0 |
visualize
the distributions of the variables using visualize.distributions
¶
visualize.distributions
returns a list of bokeh figures containing estimated distributions of the variables.