AWS introduces S3 Tables, a new bucket type for data analytics

One of the most significant API changes since S3 was launched, AWS VP tells us

Re:Invent There are two important changes to AWS's ubiquitous S3 storage service arriving soon: first, a new Table bucket type aimed at data analytics; and second, a previewed metadata feature that uses S3 tables to enable fast query of S3 data.

At the Re:Invent conference under way in Las Vegas, Andy Warfield, VP and distinguished engineer, told The Reg that "S3 is 18 years old this year ... we are launching the two most significant API level changes in the almost two decades that S3 has run."

S3 stores data in buckets, where each bucket can store an unlimited number of binary objects. Until now there have been just two bucket types: the original general purpose bucket; and the directory bucket associated with S3 Express, introduced at re:Invent in 2023, which has better performance and supports hierarchical storage.

Creating an S3 Table bucket

The new bucket type is S3 Table, for storing data in Apache Iceberg format. Iceberg is an open table format (OTF) used for storing data for analytics, and with richer features than Parquet alone. Parquet is the format used by Hadoop and by many data processing frameworks.

Parquet and Iceberg are already widely used on S3, so why a new bucket type? 

 Warfield said the popularity of Parquet in S3 was the rationale for S3 Tables. "We actually serve about 15 million requests per second to Parquet tables," he told us, but there is a maintenance burden.

Internally, he said, "the structure of them is a lot like git, a ledger of changes, and the mutations get added as snapshots. Even with a relatively low rate of updates into your OTF you can quickly end up with hundreds of thousands of objects under your table."

The consequence is poor performance.

"In the OTF world it was anticipated that this would happen, but it was left to the customer to do the table maintenance tasks," Warfield said. The Iceberg project includes code to expire snapshots and clean up metadata, but it is still necessary "to go and schedule and run those Spark jobs."

Apache Spark is a SQL engine for large scale data. Parquet on S3 was "a storage system on top of a storage system," said Warfield, making it sub-optimal.

The S3 table is a bucket that creates a rest endpoint on a per-table basis, Warfield told us. "Inside that bucket you get an Iceberg catalog, you can create namespaces and tables, each table is a first-class resource. You can set access control policy and security policy on the table itself. Because we know that it's an Iceberg table, we pre-partition the bucket and so those tables get a 10 times performance boost for access. And we automatically run all of the maintenance and optimization tasks under the covers."

The second new feature, now in preview, is Amazon S3 Metadata. "We have thousands of customers with petabytes of data in S3," said Warfield, creating a situation where "discovery and understanding of their data is an area of focus that that they want help with."

Until now, Warfield told us, "each [customer] individually builds a metadata layer on top of S3", that has to be kept up to date and easy to query. 

S3 Metadata "adds an indexer to mutations inside of an S3 bucket and it populates one of these new S3 tables," he said. "It populates the table with all of the metadata fields that are embedded in the objects metadata as well as any user metadata tags that customers add, and the table is structured as a journal of changes to the bucket ... if you want to find Jpegs of some file size that were created between this date and this date, you can now do that as a SQL query."

Integration and pricing

The S3 service itself is not getting built-in SQL capability. For the query, "you can do it with Amazon Athena or anything that speaks to Iceberg," Warfield said. Athena is an existing query service for S3. 

What is the pricing for the new services? "The storage is priced at a slight premium to S3 standard because of the performance work," said Warfield. "Then there's some some additional pricing to reflect the compaction and maintenance. On the metadata stuff, it's structured as a management fee on the objects that are scanned, and then you pay for the metadata stored in the table bucket as per normal."

Integration of S3 Tables and Metadata with the AWS Glue data Catalog (GDC), for searching a variety of data sources, is in preview. It will also include integration with AWS Lake Formation. "This is the bottom turtle in a in a stack of rich evolutions for how people work with their data," said Warfield.

Is S3 now a database? "The team is concerned that people have the right expectations," Warfield told us. "If we were coming out of the blue saying, now there's tables as well as objects, there would be a big risk that people jump to some of the wrong expectations on what it is. The reality, though, is that with the existence of Parquet and Iceberg now, people already have realistic expectations." 

Despite that, Warfield said that "we're starting to see a lot of software clients like PyIceberg [Python client] and DuckDB and other Iceberg connectors, to be able to use this simple on-disk format for tabular data ... I have a hypothesis that just as much as we built this thing to support really high scale databases and solve the overhead of compaction, the elasticity of S3 makes the small end of this interesting as well." ®

More about

TIP US OFF

Send us news


Other stories you might like