The USGS topo maps are awesome … especially when using them for the outdoors or doing historical research.

The maps contain a great deal of information on public lands, national parks, nature preserves and most other United States public lands

The highlighted features include: trails, amenities, ranger stations, boundaries, elevation data, bathymetry, roads, contour lines, and hill shading, etc.

Even better the USGS has topo maps going all the way back to the year 1882. 🤩

Here is an example of what the latest year 2023 USGS topo map looks like for Mt. Olympus when viewed on topoView.

Mt. Olympus USGS Topo Map

Overview

  1. Background
  2. Goals
  3. Prep Work
  4. Download USGS GeoPDFs
  5. Convert to GeoTiffs
  6. Clip Collars
  7. Creating Map Tiles
  8. Creating PMTiles
  9. Upload PMTiles to S3
  10. Results
  11. Lessons Learned
  12. Why?
  13. Next Steps

Background

If we wanted to use these maps in our own app or as map tiles we have some issues:

  1. topoView is an application from the USGS where you can pan around and download 7.5 minute quadrangles in JPEG, GeoTiff, KMZ, or GeoPDF. ⚠️ However, you have to download individual images, GeoPDFs, or GeoTiffs.
  2. topoBuilder is a marketplace of sorts where you can select a 7.5 minute quadrangle or 100K topo and they will email it to you. ⚠️ Email doesn’t work at scale and is definitely not a mosaic.
  3. National Map this map is a tad different style than the normal USGS topos… not sure why, but nonetheless it is a neat map, but some issues:
    1. They do have an ArcGIS endpoint, but I don’t want to depend on someone else’s API that is free and could change anytime.
    2. Their tile server seems pretty slow.
    3. I’m not sure their usage policy for offline / bulk downloads from their ArcGIS services.
  4. The maps GeoTiffs/GeoPDFs have collars around them by default which means if you tried to directly place them together there would be whitespace around them like this when viewed in QGIS: 🤮 USGS Topo Map Collar Issue

Goals

So what are we going to do then?

  1. Make a seamless mosaic USGS topo map with the latest map from each 7.5 quadrangle.
  2. Self host the map tiles with Protomaps.
  3. Add this base map to Topo Lock Maps.
  4. Have fun making maps, working with raster data, GDAL, Rasterio, and doing data engineering stuff.

Let’s get started.


“If you spend too much time thinking about a thing, you’ll never get it done.” -Bruce Lee


Prep Work

There are ~65,000 7.5 minute quadrangles that compose the entire continental US, Alaska, Hawaii, Virgin Islands, and Puerto Rico.

Downloading all of the USGS quadrangles, including temporary storage takes up nearly 4 TB of space.

I spoke to my good buddy Seth who is hardware guru and I specified the data I’m working with and what all is involved. Since it is going to be a crap ton of reads and writes and we want the best performance I went with a 4TB M.2 NVMe drive and external enclosure for SPEED! 💾 🚀

  • M.2 NVMe: WD Black SN850X NVMe
  • Enclosure: Acasis 40Gbps M.2 NVMe USB 4 Thunderbolt 3/4 SSD

USGS Topo NVMe Drive

I formatted the drive with APFS because I have had a lot of good success dealing with large files on APFS.

See this X conversation with me and @shademap discussing drive formats… basically, choose your drive format wisely.

Download USGS GeoPDFs

Let’s scrape all of the appropriate US Topo GeoPDFs quadrangles from AWS S3. There are ~65,241 of them!

The USGS topo maps can be found on this s3 bucket.

Example script for Alaska:

1
% aws s3 cp s3://prd-tnm/StagedProducts/Maps/USTopo/Current/PDF/AK/ . --recursive --exclude "*" --include "*.pdf"

I downloaded every single current USGS topo GeoPDF and put them on a folder on my NVMe drive.

Convert to GeoTiffs

The great thing about downloading GeoPDFs is we can selectively enable and disable individual layers when converting to GeoTiffs.

💡 We can use this in the future for building different versions of the map; for example maybe we wouldn’t want a map with hill shading or contour lines, or maybe just aerial imagery only.

So let’s figure out which layers we want on this version of the map.

Let’s take an example of NC_Clingmans_Dome_20220916_TM_geo.pdf.

We can view each GeoPDFs layer by using GDAL, and the parts we are specifically looking for right now is under the Metadata (LAYERS): section:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
% gdalinfo NC_Clingmans_Dome_20220916_TM_geo.pdf -mdd LAYERS
Driver: PDF/Geospatial PDF
Files: NC_Clingmans_Dome_20220916_TM_geo.pdf
Size is 3600, 4350
Coordinate System is:
PROJCRS["GCS North American 1983 UTM Zone 17N (Calculated)",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6269]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["UTM zone 17N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-81,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]],
        ID["EPSG",16017]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]
Data axis to CRS axis mapping: 1,2
GeoTransform =
  272013.1482500283, 4.063210391105159, -0.1004904645502494
  3946737.528798919, -0.1008348647451016, -4.062893054227724
Metadata:
  CREATION_DATE=D:20220916141000Z
  CREATOR=Esri ArcSOC 10.8.1.14362
  NEATLINE=POLYGON ((271583.489587898 3928993.90646951,271583.144898591 3946737.30471295,286640.259145719 3946737.61317269,286640.483846401 3928993.07269854,271583.489587898 3928993.90646951))
  TITLE=Clingmans Dome
Metadata (LAYERS):
  LAYER_00_NAME=Labels
  LAYER_01_NAME=Map_Collar
  LAYER_02_NAME=Map_Collar.Map_Elements
  LAYER_03_NAME=Map_Frame
  LAYER_04_NAME=Map_Frame.Projection_and_Grids
  LAYER_05_NAME=Map_Frame.Geographic_Names
  LAYER_06_NAME=Map_Frame.Structures
  LAYER_07_NAME=Map_Frame.Transportation
  LAYER_08_NAME=Map_Frame.Transportation.Road_Names_and_Shields
  LAYER_09_NAME=Map_Frame.Transportation.Road_Features
  LAYER_10_NAME=Map_Frame.Transportation.Trails
  LAYER_11_NAME=Map_Frame.Wetlands
  LAYER_12_NAME=Map_Frame.Hydrography
  LAYER_13_NAME=Map_Frame.Terrain
  LAYER_14_NAME=Map_Frame.Terrain.Contours
  LAYER_15_NAME=Map_Frame.Terrain.Shaded_Relief
  LAYER_16_NAME=Map_Frame.Woodland
  LAYER_17_NAME=Map_Frame.Boundaries
  LAYER_18_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries
  LAYER_19_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.State_or_Territory
  LAYER_20_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.County_or_Equivalent
  LAYER_21_NAME=Map_Frame.Boundaries.Federal_Administrated_Lands
  LAYER_22_NAME=Map_Frame.Boundaries.Federal_Administrated_Lands.National_Park_Service
  LAYER_23_NAME=Images
  LAYER_24_NAME=Images.Orthoimage
  LAYER_25_NAME=Barcode
Corner Coordinates:
Upper Left  (  272013.148, 3946737.529) ( 83d31' 4.39"W, 35d38'17.42"N)
Lower Left  (  271576.015, 3929063.944) ( 83d31' 3.80"W, 35d28'43.93"N)
Upper Right (  286640.706, 3946374.523) ( 83d21'22.93"W, 35d38'17.41"N)
Lower Right (  286203.572, 3928700.938) ( 83d21'23.49"W, 35d28'43.88"N)
Center      (  279108.360, 3937719.234) ( 83d26'13.65"W, 35d33'30.76"N)
Band 1 Block=3600x1 Type=Byte, ColorInterp=Red
  Overviews: 1800x2175, 900x1087, 450x544, 225x272, 112x136
Band 2 Block=3600x1 Type=Byte, ColorInterp=Green
  Overviews: 1800x2175, 900x1087, 450x544, 225x272, 112x136
Band 3 Block=3600x1 Type=Byte, ColorInterp=Blue
  Overviews: 1800x2175, 900x1087, 450x544, 225x272, 112x136

We have 26 layers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Metadata (LAYERS):
  LAYER_00_NAME=Labels
  LAYER_01_NAME=Map_Collar
  LAYER_02_NAME=Map_Collar.Map_Elements
  LAYER_03_NAME=Map_Frame
  LAYER_04_NAME=Map_Frame.Projection_and_Grids
  LAYER_05_NAME=Map_Frame.Geographic_Names
  LAYER_06_NAME=Map_Frame.Structures
  LAYER_07_NAME=Map_Frame.Transportation
  LAYER_08_NAME=Map_Frame.Transportation.Road_Names_and_Shields
  LAYER_09_NAME=Map_Frame.Transportation.Road_Features
  LAYER_10_NAME=Map_Frame.Transportation.Trails
  LAYER_11_NAME=Map_Frame.Wetlands
  LAYER_12_NAME=Map_Frame.Hydrography
  LAYER_13_NAME=Map_Frame.Terrain
  LAYER_14_NAME=Map_Frame.Terrain.Contours
  LAYER_15_NAME=Map_Frame.Terrain.Shaded_Relief
  LAYER_16_NAME=Map_Frame.Woodland
  LAYER_17_NAME=Map_Frame.Boundaries
  LAYER_18_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries
  LAYER_19_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.State_or_Territory
  LAYER_20_NAME=Map_Frame.Boundaries.Jurisdictional_Boundaries.County_or_Equivalent
  LAYER_21_NAME=Map_Frame.Boundaries.Federal_Administrated_Lands
  LAYER_22_NAME=Map_Frame.Boundaries.Federal_Administrated_Lands.National_Park_Service
  LAYER_23_NAME=Images
  LAYER_24_NAME=Images.Orthoimage
  LAYER_25_NAME=Barcode

To quickly test how the map will look with different layers enabled/disabled, open the GeoPDF with a PDF viewer.

For example, here is an example of what it looks like when opening NC_Clingmans_Dome_20220916_TM_geo.pdf with the standard Acrobat Reader on macOS:

USGS Topo PDF Layers

As you can see on the right side most of the layers are enabled in this particular GeoPDF.

For example, if we disable everything except for Orthoimage, this is what we will see (only aerial imagery):

USGS Topo PDF Layer Aerial Only

So now it’s time to code because we need to automate all this!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
gdalwarp \
-t_srs EPSG:3857 \
-co COMPRESS=JPEG \
-co PHOTOMETRIC=YCBCR \
-co TILED=YES \
-co QUALITY=50 \
-of GTiff {name} {name}.tif \
--config GDAL_CACHEMAX 1000 \
--config GDAL_NUM_THREADS "ALL_CPUS" \
--config GDAL_PDF_LAYERS "ALL" \
--config GDAL_PDF_LAYERS_OFF "Map_Frame.PLSS,Barcode,Images,Images.Orthoimage,Map_Collar,Map_Frame.Projection_and_Grids" \
--config GDAL_PDF_DPI 300'

This gdalwarp command does the following:

  • Line 2 - t_srs EPSG:3857 - Reproject/transform SRS to EPSG:3857
  • Line 3 - COMPRESS=JPEG - Set compression to JPEG, I found this to be the best balance of quality vs. size of the tiles
  • Line 4 - PHOTOMETRIC=YCBCR - I read Paul Ramsey’s post about GeoTiff Compression for Dummies.
  • Line 5 - TILED=YES - It’s a best practice to set this for performance reasons documented on the GeoTiff driver
  • Line 6 - QUALITY=50 - I played around with different quality settings and this greatly reduces the storage size of the resulting tiffs with very minimal impact to quality. Paul Ramsey’s post about GeoTiff Compression for Dummies talks about this too.
  • Line 7 - of GTiff - Output to a GeoTiff
  • Line 8 - GDAL_CACHEMAX 1000 - Set the GDAL block cache size, read about it here.
  • Line 9 - GDAL_NUM_THREADS "ALL_CPUS" - speeeeeed 🚀
  • Line 10 - GDAL_PDF_LAYERS "ALL" - Turn on all PDF Layers first, see I needed to do this below.
  • Line 11 - GDAL_PDF_LAYERS_OFF - Turn off PDF layers we don’t want
  • Line 12 - GDAL_PDF_DPI 300 - Set GDAL DPI to 300

⚠️ The reason for using both GDAL_PDF_LAYERS and GDAL_PDF_LAYERS_OFF is I found out I needed to enable all the layers first, then selectively disable the ones I didn’t want. The reason is because some of the GeoPDFs have different settings “saved” internally so if we just disable then we will get mixed results when creating a mosaic.

I put this entire script in a python script and ran it in a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import subprocess
import os
import time
from os import listdir
from os.path import isfile, join
import glob

topopath = '/Volumes/gstore/usgs-topos/'
onlyfiles = [f for f in listdir(topopath) if isfile(join(topopath, f))]
files = onlyfiles
files = glob.glob(f'{topopath}/*.pdf')

processes = set()
max_processes = 24

os.chdir(topopath)

for name in files:
    processes.add(subprocess.Popen(f'gdalwarp -t_srs EPSG:3857 -co COMPRESS=JPEG -co PHOTOMETRIC=YCBCR -co TILED=YES -co QUALITY=50 -of GTiff {name} {name}.tif --config GDAL_CACHEMAX 1000 --config GDAL_NUM_THREADS "ALL_CPUS" --config GDAL_PDF_LAYERS "ALL" --config GDAL_PDF_LAYERS_OFF "Map_Frame.PLSS,Barcode,Images,Images.Orthoimage,Map_Collar,Map_Frame.Projection_and_Grids" --config GDAL_PDF_DPI 300', shell=True, cwd=mypath))
    
    if len(processes) >= max_processes:
        os.wait()
        processes.difference_update(
            [p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
    if p.poll() is None:
        p.wait()

✅ Now we have all 65,000+ GeoTiffs reprojected, compressed, with only the layers we want.

🔥 This part of the process on my mac M2 Max 95 GB RAM took 40 hours and ~300 GB of GeoTiffs.

1
2
% python parallel-convert-pdf-to-gtiff.py
    1653678.94s user 36472.89s system 1147% cpu 40:54:15.92 total

Clip Collars

These pesky collars! There is so much discussion around the Internet on the best way to remove the collars from around individual 7.5 minute quadrangles.

Let’s use Rasterio to clip / mask the USGS topo map collars:

  1. The USGS maintains the 7.5 Minute Map Indices: https://prd-tnm.s3.amazonaws.com/index.html?prefix=StagedProducts/MapIndices/National/GDB/
  2. Read the the map indices geopackage with GeoPandas:
    1
    2
    3
    4
    
    national_7_5_minute_indices = geopandas.read_file(
        DATA_DIRECTORY / "MapIndices_National_GDB.gdb/",
        layer="CellGrid_7_5Minute")
    national_7_5_minute_indices = national_7_5_minute_indices.to_crs('EPSG:3857')
    
  3. For each GeoTiff get the centroid:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    def get_geo_tiff_centroid_point(geo_tiff):
        transform = geo_tiff.transform
        crs = geo_tiff.crs
        width = geo_tiff.width
        height = geo_tiff.height
        middle_x = width // 2
        middle_y = height // 2
        # Convert middle pixel indices to coordinates
        lon, lat = transform * (middle_x, middle_y)
        centroidOfTif = Point(lon, lat)
        return centroidOfTif
    
  4. For each centroid of the GeoTiff, do a spacial join and find the corresponding quadrangle geometry and mask the GeoTiff to that using Rasterio:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    
    def mask_geo_tiff(geo_tiff_file_name):
        geo_tiff = rasterio.open(geo_tiff_file_name)
        centroid = get_geo_tiff_centroid_point(geo_tiff)
        centroidSeries = geopandas.GeoSeries([centroid], crs='EPSG:3857')
    
        centroidDataFrame = geopandas.GeoDataFrame({
            'geometry': centroidSeries
        }, crs='EPSG:3857')
    
        assert national_7_5_minute_indices.crs == centroidDataFrame.crs
    
        indices_join = national_7_5_minute_indices.sjoin(
            centroidDataFrame,
            how="inner",
            predicate="contains"
        )
        shapes = [indices_join["geometry"] for r in indices_join]
        shape = shapes[0]
    
        filext = ".mask.tif"
        #https://rasterio.readthedocs.io/en/stable/topics/masking-by-shapefile.html 
        with rasterio.open(geo_tiff_file_name, crs=dst_crs) as src:
            out_image, out_transform = rasterio.mask.mask(src, shape, nodata=255, crop=True)
            out_meta = src.meta
    
        out_meta.update({"driver": "GTiff",
                        "height": out_image.shape[1],
                        "width": out_image.shape[2],
                        "transform": out_transform})
    
        with rasterio.open(f"{geo_tiff_file_name}{filext}", "w", **out_meta, compress="JPEG", tiled=True, blockxsize=256, blockysize=256, photometric="YCBCR", JPEG_QUALITY=50) as dest:
            dest.write(out_image)
    

The key part of this process is using Rasterio masks:

Here is an example of what the GeoTiffs look like with and without collars:

With CollarsWithout Collars
USGS Topo Map With CollarUSGS Topo Map Without Collar

Results so far:

  1. At this point we now have all of the 65,000+ USGS GeoTiffs, reprojected, collars clipped, and ready to make some map tiles!

Creating Map Tiles

I did a lot of testing with different tools to see what performed best and created the best looking tiles.

A few things we need to deal with:

  1. The USGS topos are not contiguous since it is only for the US and especially around the coastline there are quadrangles that the USGS does not provide. Therefore, we don’t want to write black opaque tiles in those places because that would take up unnecessary space and it looks bad. WEBP transparency will help here.
  2. We will need a GDAL virtual dataset, however building a GDAL virtual dataset with 65,000+ files won’t work with the standard command of gdalbuildvrt -addalpha all.vrt *.tif.
    1. Instead we will need a script to build the virtual dataset. Create a file named build-vrt.sh and add the following to it:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      
      rm allfiles.txt
      ls A*.tif >> allfiles.txt
      ls B*.tif >> allfiles.txt
      ls C*.tif >> allfiles.txt
      ls D*.tif >> allfiles.txt
      ls E*.tif >> allfiles.txt
      ls F*.tif >> allfiles.txt
      ls G*.tif >> allfiles.txt
      ls H*.tif >> allfiles.txt
      ls I*.tif >> allfiles.txt
      ls J*.tif >> allfiles.txt
      ls K*.tif >> allfiles.txt
      ls L*.tif >> allfiles.txt
      ls M*.tif >> allfiles.txt
      ls N*.tif >> allfiles.txt
      ls O*.tif >> allfiles.txt
      ls P*.tif >> allfiles.txt
      ls Q*.tif >> allfiles.txt
      ls R*.tif >> allfiles.txt
      ls S*.tif >> allfiles.txt
      ls T*.tif >> allfiles.txt
      ls U*.tif >> allfiles.txt
      ls V*.tif >> allfiles.txt
      ls W*.tif >> allfiles.txt
      ls X*.tif >> allfiles.txt
      ls Y*.tif >> allfiles.txt
      ls Z*.tif >> allfiles.txt
      
    2. Run build-vrt.sh
    3. Now we can run gdalbuildvrt -addalpha all.vrt -input_file_list allfiles.txt which will work with 65,000+ files.
    4. We want to use the option -addalpha so that tiles with no data we have alpha since we are making WEBP tiles. You can read more about that on the docs.

Tiling Options:

I did a bunch of testing with what would be the best performance that gives us WEBP tiles. This video from FOSS4G 2023 Tiling big piles of raster data using open source software and MapTiler Engine has a pretty good overview of the situation.

We want to stay open source so I’m going to try:

  1. gdal2tiles.py
  2. gdal2tiles_parallel.py
  3. gdal_translate & gdaladdo
  • ✅ First Option: gdal2tiles.py:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    gdal2tiles.py \
    --exclude \
    --tilesize=512 \
    --processes=12 \
    --tiledriver=WEBP \
    --resampling=bilinear \
    --zoom="0-15" \
    --xyz \
    all.vrt tiles
    
    • ⭐ This is the option I have decided to go with for now and works pretty well.
    • It creates a directory of XYZ tiles.
    • It output some weird webp.aux.xml files.
      • I’m not sure what the heck these things are. But I figured it might hurt performance if I don’t need them and gdal2tiles is having to write them.
      • It seems like these are for ArcGIS or something? I dunno. You can read about them here.
      • I turned them off by running:
        1
        
        export GDAL_PAM_ENABLED=NO
        
    • Since this outputs XYZ tiles we need to package them up in mbtiles:
      1. Install mb-util
      2. Add the metadata.json to the root of the tiles folder that is used by mb-util
        1
        2
        3
        4
        5
        6
        7
        
        {
            "name": "USGS 2024 Current Topo Maps",
            "description": "USGS Topo Maps",
            "version": "1",
            "minzoom": 0,
            "maxzoom": 15
        }
        
      3. Package .mbtiles:
        1
        2
        3
        4
        5
        
        mb-util \
        --silent \
        --scheme=xyz \
        --image_format=webp \
        input_tile_folder out.mbtiles
        
  • 🚫 Second Option: gdal2tiles_parallel.py:

    • ❓ This seems like an old script that was put together and I’m not sure if at some point this was merged into GDAL.
    • ❓ There are several variations and versions floating around on Github.
    • 🚀 This was the fastest I tried.
    • ⛔ Does not support WEBP.
  • 🚫 Third Option gdal_translate & gdaladdo:

    • 🦥 This option is VERY VERY slow. Like sloth slow. I let this run overnight one night trying to process all GeoTiffs and it never even reached 1% done.
    • ⛔ Does not look multi threaded
    • ⛔ Does not support WEBP
    • ⛔ Have to separately run gdaladdo after gdal_translate which takes additional processing time.
    • Sample Commands I tried:
      • gdal_translate --config GDAL_NUM_THREADS ALL_CPUS --config GDAL_CACHEMAX 25% -of MBTILES -co TILE_FORMAT=JPEG -co QUALITY=50 -co BLOCKSIZE=512 all.vrt output.mbtiles
      • gdaladdo -r average output-gdaladdo-average.mbtiles --config COMPRESS_OVERVIEW JPEG --config JPEG_QUALITY_OVERVIEW 50 --config PHOTOMETRIC_OVERVIEW YCBCR --config GDAL_NUM_THREADS ALL_CPUS --config GDAL_CACHEMAX 75% --config GDAL_TIFF_OVR_BLOCKSIZE 512

💯 So at this point, we have used mb-util to package the tiles into .mbtiles, from the XYZ tiles generated from gdal2tiles.py, that were generated from the GeoTiffs that we extracted, clipped, and reprojected from the GeoPDFs. Whew. 😅

The hard part is done.

Creating PMTiles

Using the protomaps go library we can convert our mbtiles to pmtiles!

Let’s see how fast we can convert 300 GB of mbtiles:

% time ./pmtiles convert usgs-topo-2024.mbtiles usgs-topo-2024.pmtiles 
2024/02/29 22:41:06 convert.go:260: Pass 1: Assembling TileID set
2024/02/29 22:41:10 convert.go:291: Pass 2: writing tiles
100% |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| (20332169/20332169, 5866 it/s)            
2024/02/29 23:38:56 convert.go:345: # of addressed tiles:  20332169
2024/02/29 23:38:56 convert.go:346: # of tile entries (after RLE):  19768407
2024/02/29 23:38:56 convert.go:347: # of tile contents:  19594193
2024/02/29 23:38:59 convert.go:362: Root dir bytes:  9385
2024/02/29 23:38:59 convert.go:363: Leaves dir bytes:  42486391
2024/02/29 23:38:59 convert.go:364: Num leaf dirs:  3501
2024/02/29 23:38:59 convert.go:365: Total dir bytes:  42495776
2024/02/29 23:38:59 convert.go:366: Average leaf dir bytes:  12135
2024/02/29 23:38:59 convert.go:367: Average bytes per addressed tile: 2.09
2024/02/29 23:41:09 convert.go:340: Finished in  1h0m3.023499209s
./pmtiles convert usgs-topo-2024.mbtiles usgs-topo-2024.pmtiles  
    572.24s user 611.80s system 32% cpu 1:00:03.89 total

⭐ Only 1 hour! Protomaps and pmtiles are just so good. Protomaps is totally going to change the mapping landscape long term!

This results in a 300 GB .pmtiles file!

Upload PMTiles to S3

Using the same protomaps go library we can upload our pmtiles to our s3 bucket:

export AWS_ACCESS_KEY_ID=GET_YOUR_OWN_KEY
export AWS_SECRET_ACCESS_KEY=GET_YOUR_OWN_KEY
export AWS_REGION=us-east-2
./pmtiles upload usgstopo.pmtiles usgstopo.pmtiles --bucket=s3://some-awesome-bucket

My home Internet upload speed is terrible, so I had 👍 my friend Jason upload it for me. So yes, I shipped him an encrypted SSD with the pmtiles 🤣

Results

This is what the resulting mosaic USGS topo map looks like zoomed out to include the continental US, Alaska, Hawaii, Puerto Rico, and Virgin Islands:

USGS Topo Mosaic Overview

I have also included this map in Topo Lock Maps!

Lessons Learned

The entire process in the end is: GeoPDFs > GeoTiffs > Masked GeoTiffs > .mbtiles > .pmtiles > great success

  1. 💪 This was my first time processing this much raster data on my local machine and it was a huge challenge… glad I pushed through because the map looks amazing!
  2. 💾 Always process large geospatial data on an external drive, preferably an NVMe because there is a lot of reading and writing of files/tiles and you can get huge performance boosts from an NVMe rather than an SSD.
  3. 🚀 Ensure your NVMe drive is an appropriate disk format for very large files. I went with APFS. See this X conversation with me and @shademap discussing drive formats… basically, choose your drive format wisely.
  4. 👌 Along the way I ran a data quality checks to ensure I had every single quadrangle and there were no missing quadrangle holes… I wanted a seamless mosaic!
    1. I iterated each of GeoPDFs using Python & GeoPandas. For each GeoPDF get the centroid and write the centroids to a dataframe.
    2. Take the entire 7.5 minute USGS Grid Geopackage and using all the centroids from the GeoPDFs, I found any missing GeoPDFs.
    3. This proved useful because during the initial download of the GeoPDFs some of the downloads did not finish even though I never saw any errors in my terminal.
  5. 😱 One night while I was creating mbtiles of the continental U.S., I filled up my entire macOS internal SSD because I accidentally created the tiles on my internal macOS SSD rather than my external NVMe. 😠 My mac was unresponsive and I had to boot into safe mode. I couldnโ€™t even do a rm -rf on files because how APFS works. Disk util said I had 3.6 MB free. Crap. I couldnโ€™t even delete Time Machine snapshots. I tried to go down the path of deleting the recovery volume. But end the end, one night I had to just sit there, in the macOS system restore terminal and copy off the temporary tile data that I had just created over the last 24 hours to my external NVMe. The copy took hours because of all the hundreds of millions of WEBP tiles 🤣. What a crazy event. Good thing I had Time Machine backups for my entire laptop for my normal day-to-day business data.

Why?

Some people ask why self host maps and instead just directly point to the USGS data and use something like STAC?

Mostly because it is fun and I’m crazy, but there are other reasons too.

So the issue for me is if I point Topo Lock Maps directly to a someone else’s API that I don’t have control over then I have to deal with situations like:

  1. What if the underlying data changes?
  2. What if the underlying data is deleted?
  3. How often is the data updated?
  4. What if the performance degrades?
  5. Do they support bulk downloads?
  6. Or other unknown unknowns

Kyle Barron played around with doing this with COGs.

Now if I’m paying for a service / API / data and there is a contract and licensing terms, then that is a different story, but in this situation this public free data with no guarantees.

So I can’t risk that.

In order to make the best possible experience in Topo Lock Maps I’m trying to self host so I have full control over the API, data, and privacy considerations.

Next Steps

I actually have all this running in Dagster and automated so I can keep the map updated in Topo Lock Maps so I can materialize a USGS topo pmtiles file all with a single button click!

So maybe I’ll write about data engineering and orchestration on down the road.

🤝 Huge thanks to the USGS for making such great maps!

🤝 Thanks to Seth for hardware knowledge.

🤝 Thanks to Jason for his Internet.

🤝 Thanks to protomaps for being freaking awesome.


“Do not pray for an easy life, pray for the strength to endure a difficult one.” -Bruce Lee