CloudQuery is an open-source data integration platform that allows you to export data from any source to any destination.
The CloudQuery AWS plugin allows you to sync data from AWS to any destination, including S3. It's free, open source, requires no account, and takes only minutes to get started.
Ready? Let's dive right in!
Step 1. Install the CloudQuery CLI
The CloudQuery CLI is a command-line tool that runs the sync. It supports MacOS, Linux and Windows.
brew installcloudquery/tap/cloudquery
Step 2. Configure the AWS source plugin
Create a configuration file for the AWS plugin and set up authentication.
Configuration
Create a file called aws.yaml and add the following contents:
Fine-tune this configuration to match your needs. For more information, see the AWS Plugin ↗ page in the docs.
Authentication
Step 3. Configure the S3 destination plugin
Create a configuration file for the S3 plugin and set up authentication.
Configuration
Create a file called s3.yaml and add the following contents:
Fine-tune this configuration to match your needs. For more information, see the S3 Plugin ↗ page in the docs.
Authentication
Step 4. Start the Sync
Run the following command in your terminal to start the sync
And away we go! 🚀 The sync will run until completion, fetching all selected tables from AWS. Any errors will be logged to a file called cloudquery.log.
Further Reading
Now that you've seen the basics of syncing AWS to S3, you should know that there's a lot more you can do. Check out the CloudQuery Documentation, Source Code and How-to Guides for more details.
This example uses the parquet format, to create parquet files in s3://bucket_name/path/to/files, with each table placed in its own directory.
kind:destinationspec:name:"s3"path:"cloudquery/s3"version:"v4.8.0"spec:bucket:"bucket_name"region:"region-name"# Example: us-east-1path:"path/to/files/{{TABLE}}/{{UUID}}.parquet"format:"parquet"# options: parquet, json, csvformat_spec:# CSV-specific parameters:# delimiter: ","# skip_header: false# Optional parameters# compression: "" # options: gzip# no_rotate: false# athena: false # <- set this to true for Athena compatibility# test_write: true # tests the ability to write to the bucket before processing the data# endpoint: "" # Endpoint to use for S3 API calls.# endpoint_skip_tls_verify # Disable TLS verification if using an untrusted certificate# use_path_style: false# batch_size: 10000 # 10K entries# batch_size_bytes: 52428800 # 50 MiB# batch_timeout: 30s # 30 seconds
It is also possible to use {{YEAR}}, {{MONTH}}, {{DAY}} and {{HOUR}} in the path to create a directory structure based on the current time. For example:
Note that the S3 plugin only supports append write-mode. The (top level) spec section is described in the Destination Spec Reference.
The plugin needs to be authenticated with your account(s) in order to sync information from your cloud setup.
The plugin requires only PutObject permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it PutObject permissions.
There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:
The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and
AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the
AWS guide (opens in a new tab).
To export the environment variables (On Linux/Mac - similar for Windows):
The plugin can use credentials from your credentials and config files in the .aws directory in your home folder.
The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file.
Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):
exportAWS_PROFILE=myprofile
IAM Roles for AWS Compute Resources
The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).
If you configured your AWS compute resources with IAM, the plugin will use these roles automatically.
For more information on configuring IAM, see the AWS docs here (opens in a new tab) and here (opens in a new tab).
User Credentials with MFA
In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here (opens in a new tab).
If you are using a custom S3 endpoint, you can specify it using the endpoint spec option. If you're using authentication, the region option in the spec determines the signing region used.
kind:sourcespec:# Source spec sectionname:awspath:cloudquery/awsversion:"v22.15.2"tables:["aws_ec2_instances"]destinations:["s3"]spec:# Optional parameters# regions: []# accounts: []# org: nil# concurrency: 50000# initialization_concurrency: 4# aws_debug: false# max_retries: 10# max_backoff: 30# custom_endpoint_url: ""# custom_endpoint_hostname_immutable: nil # required when custom_endpoint_url is set# custom_endpoint_partition_id: "" # required when custom_endpoint_url is set# custom_endpoint_signing_region: "" # required when custom_endpoint_url is set# use_paid_apis: false# table_options: nil# scheduler: dfs # options are: dfs, round-robin or shuffle
The plugin needs to be authenticated with your account(s) in order to sync information from your cloud setup.
The plugin requires only read permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it read-only permissions.
There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:
The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and
AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the
AWS guide (opens in a new tab).
To export the environment variables (On Linux/Mac - similar for Windows):
The plugin can use credentials from your credentials and config files in the .aws directory in your home folder.
The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file.
The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).
If you configured your AWS compute resources with IAM, the plugin will use these roles automatically.
For more information on configuring IAM, see the AWS docs here (opens in a new tab) and here (opens in a new tab).
User Credentials with MFA
In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here (opens in a new tab).