Advanced S3 Operations & FAQ
1. Pre-signed URL Sharing
Generate a pre-signed URL for an S3 object. This allows anyone who receives the pre-signed URL to retrieve the S3 object with an HTTP GET request.
- A pre-signed URL expires at a set date/time, or default of 1 hour if not specified.
- It is not possible to create a one-time use pre-signed URL.
-
It is not possible to invalidate a pre-signed URL.
-
However, pre-signing uses the Access Key for the user profile performing the signing. If permissions are removed from the User linked to the Access Key, the pre-signed URL will no longer function.
Configuration
For our environment you must configure your aws profile to use S3 version 4 signing and ensure a region is specified.
You may provide a named profile or specify to use the 'default' profile.
$aws configure set profile.YourProfileName.s3.signature_version s3v4
$aws configure set profile.YourProfileName.region us-east-1
# Or replace profile.YourProfileName. with default. if you only have that profile.
# Verify your aws profile is configured correctly:
$ aws configure list --profile YourProfileName
Name Value Type Location
---- ----- ---- --------
profile YourProfileName manual --profile
access_key ****************DNQ8 shared-credentials-file
secret_key ****************m+p9 shared-credentials-file
region us-east-1 config-file ~/.aws/config
2. Parallel S3 Operations
-
You can try a parallel wrapper for s3 cli / boto, etc. like https://github.com/mishudark/s3-parallel-put though note you made need to adjust max_concurrent_requests as per
-
Using filters with aws cp can help parallelize, if dir or file names nicely fit patterns, for example see here
3. Upload Object With Custom User Metadata
Add an object with customer user metadata during cp, mv, and sync (client version =>1.9.10)
- After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata.
(awscli) pae@koolaid:~$ aws s3 cp NVIDIA-Linux-x86_64-340.98.run s3://cades-s3user-pae --metadata '{"x-amz-meta-cms-id":"juicyfruit"}' --profile rda_eby
Retrieve metadata for an object (key):
(awscli) pae@koolaid:~$ aws s3api head-object --bucket cades-s3user-pae --key NVIDIA-Linux-x86_64-340.98.run --profile rda_eby
{
"LastModified": "Tue, 21 Aug 2018 19:31:14 GMT",
"ContentLength": 69984344,
"ETag": "\"cfbe7baeaeae7bea413754ace19891ce-9\"",
"Metadata": {
"x-amz-meta-cms-id": "juicyfruit"
}
}
4. Encrypt Objects During Upload
To be documented. See https://sixfeetup.com/blog/hidden-features-via-aws-cli
5. Writing Data Directly From an Application
If your application would be sped up by skipping writing data to disk and instead writing directly to S3 this is possible in a sizable number of programming languages. Of relevance to scientific computing are the C++, Python, and Java SDKs. For a complete list please see the AWS Tools page (https://aws.amazon.com/tools/). We have tested the Python interface and have found it to be highly performant. Example Python script that puts the contents of the data string into a file called 'test.txt'. This works for serializable objects.
#!/usr/bin/env python
import boto3
s3 = boto3.resource('s3')
data = 'This is some test data in a string for S3'
s3.Bucket('cades-8d73a078-94c6-4a73-a668-345fc6ee8618').put_object(Key='test.txt', Body=data)
6. Utilization Query
Requires access to s3 api endpoint and port, and iam permissions for such.
# awss3api list-objects --bucket mydata --output json --query "[sum(Contents[].Size), length(Contents[])]"
[
98620405600492,
109899
]
https://www.exratione.com/2016/11/analyzing-the-contents-of-very-large-s3-buckets/
FAQ
awscli_plugin_endpoint
If you see the error:
ModuleNotFoundError: No module named 'awscli_plugin_endpoint'
You need to run the following command:
module load python
Additional Developer Resources
Hortonworks Cloud Data Access book.