Advanced S3 Operations & FAQ

Generate a pre-signed URL for an S3 object. This allows anyone who receives the pre-signed URL to retrieve the S3 object with an HTTP GET request.

A pre-signed URL expires at a set date/time, or default of 1 hour if not specified.
It is not possible to create a one-time use pre-signed URL.
It is not possible to invalidate a pre-signed URL.
However, pre-signing uses the Access Key for the user profile performing the signing. If permissions are removed from the User linked to the Access Key, the pre-signed URL will no longer function.

Configuration

For our environment you must configure your aws profile to use S3 version 4 signing and ensure a region is specified.

You may provide a named profile or specify to use the 'default' profile.

$aws configure set profile.YourProfileName.s3.signature_version s3v4
$aws configure set profile.YourProfileName.region us-east-1
# Or replace profile.YourProfileName. with default. if you only have that profile.


# Verify your aws profile is configured correctly:
$ aws configure list --profile YourProfileName
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile               YourProfileName       manual    --profile
access_key     ****************DNQ8 shared-credentials-file    
secret_key     ****************m+p9 shared-credentials-file    
    region                us-east-1      config-file    ~/.aws/config

2. Parallel S3 Operations

You can try a parallel wrapper for s3 cli / boto, etc. like https://github.com/mishudark/s3-parallel-put though note you made need to adjust max_concurrent_requests as per
Using filters with aws cp can help parallelize, if dir or file names nicely fit patterns, for example see here

3. Upload Object With Custom User Metadata

Add an object with customer user metadata during cp, mv, and sync (client version =>1.9.10)

After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.

(awscli) pae@koolaid:~$ aws s3 cp NVIDIA-Linux-x86_64-340.98.run s3://cades-s3user-pae --metadata '{"x-amz-meta-cms-id":"juicyfruit"}' --profile rda_eby

Retrieve metadata for an object (key):

(awscli) pae@koolaid:~$ aws s3api head-object --bucket cades-s3user-pae --key NVIDIA-Linux-x86_64-340.98.run --profile rda_eby
{
    "LastModified": "Tue, 21 Aug 2018 19:31:14 GMT",
    "ContentLength": 69984344,
    "ETag": "\"cfbe7baeaeae7bea413754ace19891ce-9\"",
    "Metadata": {
        "x-amz-meta-cms-id": "juicyfruit"
    }
}

4. Encrypt Objects During Upload

To be documented. See https://sixfeetup.com/blog/hidden-features-via-aws-cli

5. Writing Data Directly From an Application

If your application would be sped up by skipping writing data to disk and instead writing directly to S3 this is possible in a sizable number of programming languages. Of relevance to scientific computing are the C++, Python, and Java SDKs. For a complete list please see the AWS Tools page (https://aws.amazon.com/tools/). We have tested the Python interface and have found it to be highly performant. Example Python script that puts the contents of the data string into a file called 'test.txt'. This works for serializable objects.

#!/usr/bin/env python
import boto3
s3 = boto3.resource('s3')
data = 'This is some test data in a string for S3'
s3.Bucket('cades-8d73a078-94c6-4a73-a668-345fc6ee8618').put_object(Key='test.txt', Body=data)

6. Utilization Query

Requires access to s3 api endpoint and port, and iam permissions for such.

# awss3api list-objects --bucket mydata --output json --query "[sum(Contents[].Size), length(Contents[])]"
[
    98620405600492,
    109899
]

https://www.exratione.com/2016/11/analyzing-the-contents-of-very-large-s3-buckets/

FAQ

awscli_plugin_endpoint

If you see the error:

ModuleNotFoundError: No module named 'awscli_plugin_endpoint'

You need to run the following command:

module load python

Additional Developer Resources

Hortonworks Cloud Data Access book.