Deploy DataHub utilizing AWS managed providers and ingest metadata from AWS Glue and Amazon Redshift – Half 2

on

|

views

and

comments

[ad_1]

Within the first publish of this sequence, we mentioned the necessity of a metadata administration answer for organizations. We used DataHub as an open-source metadata platform for metadata administration and deployed it utilizing AWS managed providers with the AWS Cloud Growth Package (AWS CDK).

On this publish, we deal with methods to populate technical metadata from the AWS Glue Information Catalog and Amazon Redshift into DataHub, and methods to increase information with a enterprise glossary and visualize information lineage of AWS Glue jobs.

Overview of answer

The next diagram illustrates the answer structure and its key parts:

  1. DataHub runs on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, utilizing Amazon OpenSearch Service, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon RDS for MySQL because the storage layer for the underlying information mannequin and indexes.
  2. The answer pulls technical metadata from AWS Glue and Amazon Redshift to DataHub.
  3. We enrich the technical metadata with a enterprise glossary.
  4. Lastly, we run an AWS Glue job to remodel the information and observe the information lineage in DataHub.

Within the following sections, we display methods to ingest the metadata utilizing numerous strategies, enrich the dataset, and seize the information lineage.

Pull technical metadata from AWS Glue and Amazon Redshift

On this step, we take a look at three completely different approaches to ingest metadata into DataHub for search and discovery.

DataHub helps each push-based and pull-based metadata ingestion. Push-based integrations (for instance, Spark) mean you can emit metadata instantly out of your information programs when metadata modifications, whereas pull-based integrations mean you can extract metadata from the information programs in a batch or incremental-batch method. On this part, you pull technical metadata from the AWS Glue Information Catalog and Amazon Redshift utilizing the DataHub net interface, Python, and the DataHub CLI.

Ingest information utilizing the DataHub net interface

On this part, you utilize the DataHub net interface to ingest technical metadata. This methodology helps each the AWS Glue Information Catalog and Amazon Redshift, however we deal with Amazon Redshift right here as an illustration.

As a prerequisite, you want an Amazon Redshift cluster with pattern information, accessible from the EKS cluster internet hosting DataHub (default TCP port 5439).

Create an entry token

Full the next steps to create an entry token:

  1. Go to the DataHub net interface and select Settings.
  2. Select Generate new token.
  3. Enter a reputation (GMS_TOKEN), elective description, and expiry date and time.
  4. Copy the worth of the token to a secure place.

Create an ingestion supply

Subsequent, we configure Amazon Redshift as our ingestion supply.

  1. On the DataHub net interface, select Ingestion.
  2. Select Generate new supply.
  3. Select Amazon Redshift.
  4. Within the Configure Recipe step, enter the values of host_port and database of your Amazon Redshift cluster and preserve the remainder unchanged:
# Coordinates
host_port:instance.one thing.<area>.redshift.amazonaws.com:5439
database: dev

The values for ${REDSHIFT_USERNAME}, ${REDSHIFT_PASSWORD}, and ${GMS_TOKEN} reference secrets and techniques that you simply arrange within the subsequent step.

  1. Select Subsequent.
  2. For the run schedule, enter your required cron syntax or select Skip.
  3. Enter a reputation for the information supply (for instance, Amazon Redshift demo) and select Executed.

Create secrets and techniques for the information supply recipe

To create your secrets and techniques, full the next steps:

  1. On the DataHub Handle Ingestion web page, select Secrets and techniques.
  2. Select Create new secret.
  3. For Identify¸ enter REDSHIFT_USERNAME.
  4. For Worth¸ enter awsuser (default admin person).
  5. For Description, enter an elective description.
  6. Repeat these steps for REDSHIFT_PASSWORD and GMS_TOKEN.

Run metadata ingestion

To ingest the metadata, full the next steps:

  1. On the DataHub Handle Ingestion web page, select Sources.
  2. Select Execute subsequent to the Amazon Redshift supply you simply created.
  3. Select Execute once more to verify.
  4. Increase the supply and look forward to the ingestion to finish, or verify the error particulars (if any).

Tables within the Amazon Redshift cluster at the moment are populated in DataHub. You’ll be able to view these by navigating to Datasets > prod > redshift > dev > public > customers.

You’ll additional work on enriching this desk metadata utilizing the DataHub CLI in a later step.

Ingest information utilizing Python code

On this part, you utilize Python code to ingest technical metadata to the DataHub CLI, utilizing the AWS Glue Information Catalog for instance information supply.

As a prerequisite, you want a pattern database and desk within the Information Catalog. You additionally want an AWS Id and Entry Administration (IAM) person with the required IAM permissions:

{
    "Impact": "Enable",
    "Motion": [
        "glue:GetDatabases",
        "glue:GetTables"
    ],
    "Useful resource": [
        "arn:aws:glue:$region-id:$account-id:catalog",
        "arn:aws:glue:$region-id:$account-id:database/*",
        "arn:aws:glue:$region-id:$account-id:table/*"
    ]
}

Be aware the GMS_ENDPOINT worth for DataHub by working kubectl get svc, and find the load balancer URL and port quantity (8080) for the service datahub-datahub-gms.

Set up the DataHub shopper

To put in the DataHub shopper with AWS Cloud9, full the next steps:

  1. Open the AWS Cloud9 IDE and begin the terminal.
  2. Create a brand new digital atmosphere and set up the DataHub shopper:
# Set up the virtualenv
python3 -m venv datahub
# Activate the virtualenv
Supply datahub/bin/activate
# Set up/improve datahub shopper
pip3 set up --upgrade acryl-datahub

  1. Examine the set up:

If DataHub is efficiently put in, you see the next output:

DataHub CLI model: 0.8.44.4
Python model: 3.X.XX (default,XXXXX)

  1. Set up the DataHub plugin for AWS Glue:
pip3 set up --upgrade 'acryl-datahub[glue]'

Put together and run the ingestion Python script

Full the next steps to ingest the information:

  1. Obtain glue_ingestion.py from the GitHub repository.
  2. Edit the values of each the supply and sink objects:
from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create(
    {
        "supply": {
            "sort": "glue",
            "config": {
                "aws_access_key_id": "<aws_access_key>",
                "aws_secret_access_key": "<aws_secret_key>",
                "aws_region": "<aws_region>",
                "emit_s3_lineage" : False,
            },
        },
        "sink": {
            "sort": "datahub-rest",
            "config": {
                "server": "http://<your_gms_endpoint.area.elb.amazonaws.com:8080>",
                 "token": "<your_gms_token_string>"
                },
        },
    }
)

# Run the pipeline and report the outcomes.
pipeline.run()
pipeline.pretty_print_summary()

For manufacturing functions, use the IAM function and retailer different parameters and credentials in AWS Programs Supervisor Parameter Retailer or AWS Secrets and techniques Supervisor.

To view all configuration choices, seek advice from Config Particulars.

  1. Run the script throughout the DataHub digital atmosphere:
python3 glue_ingestion.py

If you happen to navigate again to the DataHub net interface, the databases and tables in your AWS Glue Information Catalog ought to seem below Datasets > prod > glue.

Ingest information utilizing the DataHub CLI

On this part, you utilize the DataHub CLI to ingest a pattern enterprise glossary about information classification, private info, and extra.

As a prerequisite, it’s essential to have the DataHub CLI put in within the AWS Cloud9 IDE. If not, undergo the steps within the earlier part.

Put together and ingest the enterprise glossary

Full the next steps:

  1. Open the AWS Cloud9 IDE.
  2. Obtain business_glossary.yml from the GitHub repository.
  3. Optionally, you may discover the file and add customized definitions (seek advice from Enterprise Glossary for extra info).
  4. Obtain business_glossary_to_datahub.yml from the GitHub repository.
  5. Edit the complete path to the enterprise glossary definition file, GMS endpoint, and GMS token:
supply:
  sort: datahub-business-glossary
  config:
    file: /dwelling/ec2-user/atmosphere/business_glossary.yml    

sink:
  sort: datahub-rest 
  config:
    server: 'http://<your_gms_endpoint.area.elb.amazonaws.com:8080>'
    token:  '<your_gms_token_string>'

  1. Run the next code:
datahub ingest -c business_glossary_to_datahub.yml

  1. Navigate again to the DataHub interface, and select Govern, then Glossary.

It’s best to now see the brand new enterprise glossary to make use of within the subsequent part.

Enrich the dataset with extra metadata

On this part, we enrich a dataset with further context, together with description, tags, and a enterprise glossary, to assist information discovery.

As a prerequisite, comply with the sooner steps to ingest the metadata of the pattern database from Amazon Redshift, and ingest the enterprise glossary from a YAML file.

  1. Within the DataHub net interface, browse to Datasets > prod > redshift > dev > public > customers.
  2. Beginning on the desk degree, we add associated documentation and a hyperlink to the About part.

This permits analysts to grasp the desk relationships at a look, as proven within the following screenshot.

  1. To additional improve the context, add the next:
    • Column description.
    • Tags for the desk and columns to assist search and discovery.
    • Enterprise glossary phrases to arrange information belongings utilizing a shared vocabulary. For instance, we outline userid within the USERS desk as an account in enterprise phrases.
    • House owners.
    • A area to group information belongings into logical collections. That is helpful when designing a information mesh on AWS.

Now we will search utilizing the extra context. For instance, trying to find the time period e-mail with the tag tickit appropriately returns the USERS desk.

We will additionally search utilizing tags, similar to tags:"PII" OR fieldTags:"PII" OR editedFieldTags:"PII".

Within the following instance, we search utilizing the sphere description fieldDescriptions:The person's dwelling state, similar to GA.

Be at liberty to discover the search options in DataHub to boost the information discovery expertise.

Seize information lineage

On this part, we create an AWS Glue job to seize the information lineage. This requires use of a datahub-spark-lineage JAR file as a further dependency.

  1. Obtain the NYC yellow taxi journey information for 2022 January (in parquet file format) and reserve it below s3://<<Your S3 Bucket>>/tripdata/.
  2. Create an AWS Glue crawler pointing to s3://<<Your S3 Bucket>>/tripdata/ and create a touchdown desk known as landing_nyx_taxi contained in the database nyx_taxi.
  3. Obtain the datahub-spark-lineage JAR file (v0.8.41-3-rc3) and retailer it in s3://<<Your S3 Bucket>>/externalJar/.
  4. Obtain the log4j.properties file and retailer it in s3://<<Your S3 Bucket>>/externalJar/.
  5. Create a goal desk utilizing the next SQL script.

The AWS Glue job reads the information in parquet file format utilizing the touchdown desk, performs some primary information transformation, and writes it to focus on desk in parquet format.

  1. Create an AWS Glue Job utilizing the next script and modify your GMS_ENDPOINT, GMS_TOKEN, and supply and goal database desk title.
  2. On the Job particulars tab, present the IAM function and disable job bookmarks.

  1. Add the trail of datahub-spark-lineage (s3://<<Your S3 Bucket>>/externalJar/datahub-spark-lineage-0.8.41-3-rc3.jar) for Dependent JAR path.
  2. Enter the trail of log4j.properties for Referenced recordsdata path.

The job reads the information from the touchdown desk as a Spark DataFrame after which inserts the information into the goal desk. The JAR is a light-weight Java agent that listens for Spark software job occasions and pushes metadata out to DataHub in actual time. The lineage of datasets which can be learn and written is captured. Occasions similar to software begin and finish, and SQLExecution begin and finish are captured. This info will be seen below pipelines (DataJob) and duties (DataFlow) in DataHub.

  1. Run the AWS Glue job.

When the job is full, you may see the lineage info is being populated within the DataHub UI.

The previous lineage exhibits the information is being learn from a desk backed by an Amazon Easy Storage Service (Amazon S3) location and written to an AWS Glue Information Catalog desk. The Spark run particulars like question run ID are captured, which will be mapped again to the Spark UI utilizing the Spark software title and Spark software ID.

Clear up

To keep away from incurring future prices, full the next steps to delete the sources:

  1. Run helm uninstall datahub and helm uninstall stipulations.
  2. Run cdk destroy --all.
  3. Delete the AWS Cloud9 atmosphere.

Conclusion

On this publish, we demonstrated methods to search and uncover information belongings saved in your information lake (through the AWS Glue Information Catalog) and information warehouse in Amazon Redshift. You’ll be able to increase information belongings with a enterprise glossary, and visualize the information lineage of AWS Glue jobs.


In regards to the Authors

Debadatta Mohapatra is an AWS Information Lab Architect. He has in depth expertise throughout massive information, information science, and IoT, throughout consulting and industrials. He’s an advocate of cloud-native information platforms and the worth they’ll drive for purchasers throughout industries.

Corvus Lee is a Options Architect for AWS Information Lab. He enjoys every kind of data-related discussions, and helps prospects construct MVPs utilizing AWS databases, analytics, and machine studying providers.

Suraj Bang is a Sr Options Architect at AWS. Suraj helps AWS prospects on this function on their Analytics, Database and Machine Studying use instances, architects an answer to resolve their enterprise issues and helps them construct a scalable prototype.

[ad_2]

Share this
Tags

Must-read

Top 42 Como Insertar Una Imagen En Html Bloc De Notas Update

Estás buscando información, artículos, conocimientos sobre el tema. como insertar una imagen en html bloc de notas en Google

Top 8 Como Insertar Una Imagen En Excel Desde El Celular Update

Estás buscando información, artículos, conocimientos sobre el tema. como insertar una imagen en excel desde el celular en Google

Top 7 Como Insertar Una Imagen En Excel Como Marca De Agua Update

Estás buscando información, artículos, conocimientos sobre el tema. como insertar una imagen en excel como marca de agua en Google

Recent articles

More like this