One of the unique aspects of Treasure Data is its integration of a data platform at its core to build a Customer Data Platform (CDP). This enables robust and flexible handling of enterprise data. Behind Treasure Data's Audience Studio and Journey Orchestration lies the utilization of these mechanisms. In this article, we'll delve into the plans this year for two core systems of our CDP: Hive and Presto, as we move forward.
Moving to Trino from Presto
At Treasure Data, we've been operating with Presto since its inception as an open-source project by Meta (formerly Facebook), providing it as an interactive query engine to our customers. Currently, Presto handles 95% of our customers' queries within a minute as an interactive query engine. However, Presto has evolved into two systems: PrestoDB managed by Meta and Trino (formerly PrestoSQL) developed by former Presto core developers at Starburst. These two query engines have gradually diverged in functionality, and this year, TreasureData will transition from Presto to Trino through a version upgrade. With this transition, we'll also be updating naming in various areas such as the UI and product documentation while ensuring that the name Presto remains accessible to prevent any impact on existing processes for our customers. This upgrade to Trino not only expands support for numerous SQL functions but also anticipates performance enhancements across various SQL operations.
Additionally, we've identified that many of the remaining 5% workloads consist of memory-intensive queries due to its complexity, which are unsuitable for Trino (Presto). Hence, through another approach that I describe later, we aim to provide various benefits to our customers.
Moreover, beyond Trino's performance enhancements, we're planning to address cases where customers experience extended query wait times, mainly due to hitting the upper limit of concurrent query executions in their environments. We're exploring options to offer more flexible concurrent query execution to mitigate these instances. The benefits of upgrading to Trino extend beyond individual query optimization, laying the groundwork for significant improvements across our backend. Alongside Trino, many new features, including the Data Clean Room solution launched in Japan last year and the ZeroCopy (Federated Query) feature currently under consideration, will operate.
For this year's Trino upgrade, we've outlined the changes that customers should be aware of beforehand.
Moving to Hive 4 from Hive 2
Apache Hive, since 2010, has been a staple query engine utilized by many enterprise companies since our inception. Traditionally, Hive might have been deemed unsuitable for ad-hoc queries, perceived as more adept at processing longer, complex queries on massive datasets. However, with the upgrade to Hive 4 planned for this year and subsequent enhancements to our internal architecture, we believe we can offer customers both the robustness of the primary ETL engine to support CDP workload with Hive and its rapid interactive capabilities. This ensures that queries will be selected based on workload suitability rather than customers choosing between Trino and Hive for development convenience. In order to achieve this, We have currently been developing distributed algorithms to support typical ETL problems such as data skewness. In addition, we plan to implement a mechanism to support agile trial-and-error iterations for data engineers.
For this year's upgrade to Hive 4, we've outlined the changes that customers should be aware of beforehand.
Finally…
We conduct preliminary simulations to investigate the potential impact of upgrades beforehand, ensuring a smooth transition to the latest technology for everyone. We'll notify each customer eligible for migration a month in advance via our support team, so please stay tuned.
Moving forward, by evolving Trino and Hive 4, we aim to provide a framework capable of processing hundreds to thousands of query workloads swiftly, regardless of customer proficiency, unlocking all functionalities of our CDP seamlessly. Stay tuned for further advancements in Treasure Data's core functionalities.
As a further reading, this companion blog post describes a part of this groundwork behind the query engines. Check it out if you are interested in technical backend work.