S3 "shovel"

What is this?

If you find yourself needing to get files from S3 onto your Greengrass device feel free to use "the shovel".

Notes

This page contains a Greengrass Lambda function that can be used to grab a tar.gz or ZIP file from S3 and will copy it onto the host.

It uses the Greengrass ML Inference feature to accomplish this but it has nothing to do with machine learning. Greengrass ML Inference was designed to deliver large assets, usually related to machine learning, to Greengrass groups in a consistent way.

For reference the resource definitions I used while testing are at the bottom of this document.

What do you need to do?

  • tar.gz compress or ZIP your asset
  • Upload the asset to S3
  • Give the Greengrass service role (not the Greengrass group role) permission to access the asset in S3
    • For development purposes you can temporarily grant it read-only access to all S3 assets in your account using the managed policy with ARN arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
  • Create a Greengrass resource definition with two entries:
    • The destination entry - this maps a local volume resource into the container with read-write permissions that the Lambda function will copy the assets into once they are downloaded
    • The source entry - this is the S3 URI of the asset as well as the path that it should be extracted to inside of the Lambda function container
  • Create a Greengrass function definition with these settings:
    • Long running/pinned, so it runs immediately
    • References both of the resource definition entries above
    • Sets environment variable AWS_IOT_THING_NAME to the thing name of the Greengrass core thing
    • Sets environment variable INPUT_PATH to the S3 resource's path. For example, if the S3 resource's path is /tmp/extract then INPUT_PATH must be /tmp/extract
    • Sets the environment variable OUTPUT_PATH to the volume resource's destination path with an additional directory name appended to it. For example, if the volume resource's destination path is /roottmp/s3shovel the OUTPUT_PATH could be /roottmp/s3shovel/extracted. The volume resource's destination path and the OUTPUT_PATH can not be the same.
  • Make sure that ggc_user has the correct permissions to create the output directory. The output directory is the source path of the volume resource plus the additional path information added to OUTPUT_PATH. I used /tmp as my source path so ggc_user needs to have the correct permissions to create /tmp/extracted on the host.
  • Use the shovel code below

S3ShovelPython3.py

# ML Inference downloads a file from S3, unzips it to a known location, and this function copies that function to the
# host using a local volume resource

import json
import logging
import os
import platform
import shutil
from pathlib import Path

import greengrasssdk

# Creating a greengrass core sdk client
client = greengrasssdk.client('iot-data')

# Retrieving platform information to send from Greengrass Core
my_platform = platform.platform()
python_version = platform.python_version()

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
streamHandler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
streamHandler.setFormatter(formatter)
logger.addHandler(streamHandler)

THING_NAME = os.environ['AWS_IOT_THING_NAME']
INPUT_PATH = os.environ['INPUT_PATH']
OUTPUT_PATH = os.environ['OUTPUT_PATH']

if (THING_NAME is None):
    raise RuntimeError("You must fill in the AWS_IOT_THING_NAME environment variable")

if (INPUT_PATH is None):
    raise RuntimeError("You must fill in the INPUT_PATH environment variable")

if (OUTPUT_PATH is None):
    raise RuntimeError("You must fill in the OUTPUT_PATH environment variable")

TOPIC = THING_NAME + '/python3/s3/shovel'
COMPLETED_FILE = OUTPUT_PATH + '/completed.txt'

payload = {}


def log(message):
    payload["message"] = message
    client.publish(topic=TOPIC, payload=json.dumps(payload))


def copy_directory(src, dest):
    shutil.rmtree(dest, ignore_errors=True)

    try:
        # Try to remove the flag file in the event that we can't remove the directory so users know it is not ready yet
        os.remove(COMPLETED_FILE)
    except:
        # File may not exist, ignore this
        pass

    try:
        log("Src: " + src)
        log("Dest: " + dest)

        shutil.copytree(src, dest)
    except shutil.Error as e:
        # Directories are the same
        log("shutil error: Directory not copied. Error: %s" % e)

    except OSError as e:
        # Any error saying that the directory doesn't exist
        log("oserror: Directory not copied. Error: %s" % e)


def shovel():
    log("Copying")
    copy_directory(INPUT_PATH, OUTPUT_PATH)

    # Indicate that we're done
    Path(COMPLETED_FILE).touch()

    log("Copied")
    log("output listdir: " + json.dumps(os.listdir(OUTPUT_PATH)))


# Start shoveling!
shovel()


# This is a dummy handler and will not be invoked
# Instead the code above will be executed in an infinite loop for our example
def function_handler(event, context):
    return

Reference files

Test function definition

{
  "defaultConfig": {
    "execution": {
      "isolationMode": "GreengrassContainer"
    }
  },
  "functions": [
    {
      "functionArn": "arn:aws:lambda:::function:GGIPDetector:1",
      "functionConfiguration": {
        "memorySize": 32768,
        "pinned": true,
        "timeout": 3
      },
      "id": "009179b1-4e84-4cf7-8278-1f20f0518be6"
    },
    {
      "functionArn": "arn:aws:lambda:us-east-1:541589084637:function:pi-S3ShovelPython3:PROD",
      "functionConfiguration": {
        "encodingType": "json",
        "environment": {
          "accessSysfs": false,
          "execution": {
            "isolationMode": "GreengrassContainer"
          },
          "resourceAccessPolicies": [
            {
              "permission": "rw",
              "resourceId": "tmp--roottmp-s3shovel"
            },
            {
              "permission": "rw",
              "resourceId": "tmp-extract"
            }
          ],
          "variables": {
            "AWS_IOT_THING_NAME": "pi_Core",
            "OUTPUT_PATH": "/roottmp/s3shovel/extracted",
            "AWS_IOT_THING_ARN": "arn:aws:iot:us-east-1:541589084637:thing/pi_Core",
            "LOCAL_LAMBDA_S3ShovelPython3": "arn:aws:lambda:us-east-1:541589084637:function:pi-S3ShovelPython3:PROD",
            "AWS_GREENGRASS_GROUP_NAME": "pi",
            "INPUT_PATH": "/tmp/extract",
            "GROUP_ID": "ee4bffa1-37cc-41c2-a568-132bf6f4d77c"
          }
        },
        "memorySize": 131072,
        "pinned": true,
        "timeout": 60
      },
      "id": "23fc3f54-8ceb-4166-81ca-9dea043acc52"
    }
  ]
}

Test resource definition:

{
  "value": {
    "resources": [
      {
        "id": "tmp--roottmp-s3shovel",
        "name": "tmp--roottmp-s3shovel",
        "resourceDataContainer": {
          "localVolumeResourceData": {
            "destinationPath": "/roottmp/s3shovel",
            "groupOwnerSetting": {
              "autoAddGroupOwner": true
            },
            "sourcePath": "/tmp"
          }
        }
      },
      {
        "id": "tmp-extract",
        "name": "tmp-extract",
        "resourceDataContainer": {
          "s3MachineLearningModelResourceData": {
            "destinationPath": "/tmp/extract",
            "s3Uri": "s3://timmatt-big-file-test/00016.MTS.zip"
          }
        }
      }
    ]
  }
}