This article is more than 1 year old

MySQL Heatwave dives into object storage data lakes

Oracle joins the analytics anywhere bandwagon, promises future access to AWS S3

Oracle has launched MySQL HeatWave Lakehouse, an extension to its proprietary analytics platform which now supports object storage outside the database.

The analytics system, which was built on top of the open source MySQL database, can query data in the object store in a variety of file formats as well as combine it with data in MySQL. Meanwhile, files in the object store are queried directly by HeatWave without copying the data into the MySQL database, Oracle told us.

The data lake technology supports file formats including CSV, Parquet, and export files from other databases. At the same time, MySQL Autopilot promises to improve performance and scalability without requiring database tuning expertise.

On a 500TB TPC-H benchmark, Oracle claims queries took nine times longer on AWS's data warehouse and 17 times longer on Snowflake and Databricks compared with the new Heatwave datalake. Google's BigQuery would be 36 times slower, Oracle reckons, though it did not publish comparisons with Teradata, the data warehouse vendor founded in 1979.

The system is only available on Oracle Cloud Infrastructure (OCI), but Nipun Agarwal, senior vice veep of MySQL HeatWave, told The Register that Oracle planned to extend the system to query data held in object storage in other clouds including AWS, Azure and GCP.

"One of the important things to note over here is that data in the object store remains in the object store," he said. "We do not copy data from the object store into the MySQL database. Secondly, the processing of this data, whether it's loading or queried, is done by Heatwave not by the MySQL engine. That's what gives it extreme scalability because the Heatwave cluster can scale up to 500 nodes."

Using analytics engines to query data outside their home database is not new. The approach was used by Snowflake, Cloudera and Google's BigQuery with their support for the Apache Iceberg table format. Similarly, Databricks, Microsoft and SAP have endorsed Delta Lake table format, an open source format under the Linux Foundation, created by Databricks.

Commentators and vendors have suggested most vendors will come to support most formats, including Hudi.

Agarwal said Oracle intends HeatWave to support these formats in the future, starting with Iceberg and Delta Lake.

The Autopilot feature offers schema inference, which help users determine data type in object storage before data is analyzed by the query engine.

"We can come up with this mapping, even for files which don't have metadata," Agarwal said. "Autopilot can make these predictions in less than one minute. We invented this technique called adaptive data sampling, which very intelligently scans and samples the file without compromising on the accuracy."

Autopilot also predicts the in-memory representation for a specific data source, the optimal size of the cluster that is needed to compute the data and how long it's going to take to load the data, he said.

Holger Mueller, vice president and principal analyst at Constellation Research, said Oracle had introduced new features to HeatWave in the last three years at a rapid pace. "The HeatWave team has out-innovated all other cloud databases," he claimed.

The move into object storage was "huge," he added, because it "allows users to bring all the data of the enterprise together – into one single query. It is something enterprises have long waited for."

Meanwhile, the ability to query data in AWS, Azure and GCP object storage would appeal to users who want to work across all their enterprise data using Heatwave, he said.

Like any suite model, Oracle Heatwave had the downside of competing with specialist players in any one of its features. "But, at this point, Oracle is more than good enough," Mueller said. ®

More about

TIP US OFF

Send us news


Other stories you might like