On this weblog, we discover how one can seamlessly improve your Hive metastore* schemas and exterior tables to the Unity Catalog metastore utilizing the brand new SYNC command. SYNC command will also be used to push updates from the supply schemas and exterior tables in Hive metastore to the Unity Catalog metastore schemas and tables, which have been beforehand upgraded. SYNC command is at present in public preview on AWS and Azure.
*Notice: Hive metastore might be your default or exterior metastore and even AWS Glue metastore. For simplicity, now we have used the time period “Hive metastore” all through this doc
Widespread use circumstances for upgrading and syncing your Hive Metastore to Unity Catalog
Unity Catalog, now typically obtainable on AWS and Azure, provides a large number of out-of-box centralized governance options resembling unified entry and audit controls for all knowledge belongings in your knowledge Lakehouse, automated knowledge lineage for all workloads, built-in knowledge search and discovery, privilege inheritance for simplified entry administration, open cross-platform knowledge sharing, and lots of extra.
One of many widespread questions that come to thoughts is how one can simply improve or migrate your tables and schemas registered within the current Hive metastore to the Unity Catalog metastore and hold Unity Catalog in sync with the Hive metastore. Whilst you would need to reap the benefits of all of the wealthy options Unity Catalog has to supply, there may be varied situations the place you want the Hive metastore objects to co-exist even after migrating the objects to the Unity Catalog metastore. For instance, you might need an ETL pipeline that writes knowledge to tables saved in Hive metastore and it’s good to carry out an in depth impression evaluation earlier than steadily migrating the tables to the Unity Catalog metastore. Till such time, it’s good to hold your Hive metastore and the unity catalog metastore in sync.
Listed here are the widespread questions we heard from our prospects:
- How do you migrate our knowledge workloads from two-level namespaces (Schema and tables/views) to the Unity Catalog’s 3-level namespaces (Catalog, Schema, Tables/Views)?
- Do it’s good to copy knowledge from the present location to a brand new location for the desk within the unity catalog metastore or simply have to create a brand new schema and desk within the unity catalog metastore and level to the present location?
- How can we preserve entry to Hive metastore tables whereas starting to leverage Unity Catalog, and hold adjustments to the schema in sync?
- Can now we have an evaluation on what steps could be required to maneuver our HMS objects to Unity Catalog metastore?
Introducing SYNC Command in Unity Catalog
To facilitate the seamless migration of your schemas and exterior tables out of your current Hive metastore to the Unity Catalog metastore, now we have launched a utility referred to as SYNC. SYNC command helps you migrate your current Hive metastore to the Unity Catalog metastore and likewise helps to maintain each your metastores in sync on an ongoing foundation till you utterly migrate all of your dependent functions from Hive metastore to the Unity Catalog metastore. As an alternative of allocating assets to construct a customized resolution, SYNC offers you with a straightforward out of the field resolution to maintain your current Hive metastore and the Unity Catalog metastore in sync.
Key options of SYNC
- Capability to improve an exterior desk from Hive metastore to the Unity Catalog metastore and hold metadata of the 2 tables in sync.
- Capability to improve all eligible tables in Hive metastore schema to the Unity Catalog metastore and hold the metadata in sync. It makes use of multithreading whereas upgrading a number of tables in parallel
- Dry run mode to show the results of the SYNC command with out creating or updating the goal tables.
- Capability to run SYNC a number of instances on the identical schema or tables to maintain the supply and goal metastore in sync.
How does it work
The SYNC command abstracts all of the complexities of migrating a schema and exterior tables from the Hive metastore to the Unity Catalog metastore and preserving them in sync. As soon as executed, it analyses the supply and goal tables or schemas and performs the under operations:
- If the goal desk doesn’t exist, the sync operation creates a goal desk with the identical identify because the supply desk within the supplied goal schema. The proprietor of the goal desk will default to the consumer who’s working the SYNC command
- If the goal desk exists, and if the desk is decided to be created by a earlier SYNC command or upgraded through Internet Interface, the sync operation will replace the desk such that its schema matches with the schema of the supply desk.
The command outputs one row per desk which is upgraded and features a status_code and outline column. The status_code column signifies the standing of the improve for that desk and the outline offers a descriptive message for every desk.
Getting began with the SYNC command
The customers working the sync command ought to:
- Be the proprietor of the supply desk in case of utilizing “SYNC TABLE”
- Be the proprietor of the supply schema in case of utilizing “SYNC SCHEMA”
Notice: The present model of SYNC solely helps upgrades of Exterior Tables. Please discuss with the documentation for upgrading your Hive metastore Managed Tables and views to the Unity Catalog metastore. You may as well use the desk clone command to create a replica of an current Hive metastore managed desk at a selected model to the Unity Catalog metastore. Learn this weblog to be taught additional about desk clones in Databricks.
There are two choices for the improve utilizing SYNC:
- SYNC TABLE: It upgrades a desk from Hive metastore to the Unity Catalog metastore
- SYNC SCHEMA: It upgrades all eligible tables in a Schema from Hive metastore to the Unity Catalog metastore
The SYNC command upgrades tables or schemas from Hive metastore to the Unity Catalog metastore. It may be used to create new tables within the Unity Catalog metastore from current tables in Hive metastore. It may be used to push updates from the supply tables in Hive metastore to the Unity Catalog metastore tables, which have been beforehand upgraded utilizing the SYNC command or through WebUI.
An optionally available DRY RUN clause can be utilized to judge the upgradability of the desk to Unity Catalog. Within the DRY RUN mode, the command checks if the given supply desk may be upgraded to the Unity Catalog metastore and offers a status_code and descriptive error message in case it can’t improve. If the desk may be upgraded from Hive metastore to the Unity Catalog metastore then the standing code will present ’DRY_RUN_SUCCESS’ within the DRY RUN mode and SUCCESS when the desk is efficiently synced.
SYNC TABLE target_tbl FROM source_table [DRY RUN]
Please go to our documentation to search for particulars on the parameters of SYNC command.
Notice: The consumer who runs the SYNC command would be the proprietor of the newly created tables
Notice: We’re utilizing pattern knowledge for this instance. Databricks additionally offers quite a lot of knowledge units which might be already mounted to DBFS in your Databricks workspace. Yow will discover extra particulars right here.
Improve exterior desk to Unity Catalog
Create Hive metastore schema
use catalog hive_metastore; drop database if exists hmsdb_sync cascade; create database hmsdb_sync;
Create a Unity Catalog schema
use catalog most important; drop database if exists most important.ucdb_sync cascade; create database most important.ucdb_sync;
Create Exterior Desk in Hive metastore
-- create an exterior delta desk in Hive metastore drop desk if exists hive_metastore.hmsdb_sync.people_delta; create desk hive_metastore.hmsdb_sync.people_delta location "<<Your Object Storage Location>>" as choose * from delta.`dbfs:/databricks-datasets/studying-spark-v2/individuals/individuals-10m.delta` restrict 100000;
Choose the Desk to confirm
choose * from hive_metastore.hmsdb_sync.people_delta;
Execute Dry Run
sync desk most important.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta DRY RUN;
Observe the Outcomes of the Dry Run
Improve the Desk and observe the outcome
sync desk most important.ucdb_sync.people_delta from hive_metastore.hmsdb_sync.people_delta;
Describe each supply and goal tables and evaluate
describe prolonged hive_metastore.hmsdb_sync.people_delta; desc prolonged most important.ucdb_sync.people_delta;
Describe the Hive Metastore desk and UC tables
Choose from the Goal desk to confirm the info
choose * from most important.ucdb_sync.people_delta;
Improve the schema and all eligible tables in a single go
sync schema most important.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync DRY RUN;
sync schema most important.ucdb_schema_sync from hive_metastore.hmsdb_schema_sync;
On this weblog, now we have proven how you need to use the SYNC command to summary the complexity of upgrading your Hive metastore objects to Unity Catalog metastore. To be taught extra concerning the SYNC command and get began, please go to the guides (AWS, Azure). Please discuss with the Pocket book to attempt completely different choices with SYNC and hold your Hive metastore schemas and exterior tables and your Unity Catalog metastore in sync.
SYNC may be run a number of instances to make sure Hive metastore objects and the Unity Catalog metastore objects are in sync. SYNC makes it seamless and simple for purchasers to undertake Unity Catalog and profit from unified governance options. In case you not want your Hive metastore schemas and tables, you’ll be able to drop them. Dropping an exterior desk doesn’t modify the info information in your cloud tenant.