This article provides an in-depth look at Amundsen, an open-source data discovery and metadata engine, including how to set it up and examples of using it to enhance data exploration in your organization.

What is Amundsen?

Amundsen is an open-source data discovery and metadata platform originally developed by Lyft. Named after the Norwegian explorer Roald Amundsen, it's designed to improve the productivity of data analysts, scientists, and engineers when navigating their data ecosystem. Amundsen achieves this by indexing data resources (tables, dashboards, streams, etc.) and making them easily discoverable through a powerful search interface.

Key Features

Metadata Search: Provides a search interface to find data assets across your organization.
Data Lineage: Displays how data flows between different systems and transformations.
Column-Level Metadata: Offers detailed information about table columns, including data types and descriptions.
User Collaboration: Allows users to annotate datasets, add documentation, and see frequent users.
Integration: Connects with various databases, data warehouses, and BI tools.

Architecture Overview

Amundsen's architecture consists of several microservices:

Frontend Service: The web application that users interact with.
Metadata Service: Stores and serves metadata about data assets.
Search Service: Handles indexing and searching of metadata.
Neo4j or Atlas: Used as a graph database to store relationships between data assets.
Elasticsearch: Used for indexing metadata to enable fast search capabilities.

Setting Up Amundsen

Let's walk through setting up Amundsen locally using Docker. This will allow you to explore its features and understand how it can benefit your organization.

1. Prerequisites

Ensure you have the following installed on your machine:

Docker
Docker Compose

2. Clone the Amundsen Repository

git clone https://github.com/amundsen-io/amundsen.git
cd amundsen

3. Start the Services

Navigate to the docker-compose directory and start the services:

cd amundsenfrontendlibrary
docker-compose -f docker-amundsen.yml up

This command starts all the necessary services, including the frontend, metadata, and search services, as well as Neo4j and Elasticsearch.

4. Access the Amundsen UI

Once the services are running, access the Amundsen UI at http://localhost:5000.

Ingesting Metadata

To populate Amundsen with metadata from your data sources, you can use databuilder, which is Amundsen's data ingestion library.

Example: Ingesting Sample Data

Let's ingest some sample data into Amundsen.

1. Install Dependencies

pip install --upgrade pip
pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install -e databuilder/

2. Run the Sample Data Loader

Execute the sample data loader script:

python examples/sample_data_loader.py

This script loads sample metadata into Neo4j and Elasticsearch.

Exploring Data in Amundsen

Refresh the Amundsen UI, and you should see the sample data available for search and exploration.

Using Amundsen in Your Organization

Amundsen can connect to various data sources to ingest metadata. Here's an example of how to configure Amundsen to ingest metadata from a Hive data warehouse.

1. Configure the Hive Extractor

Create a configuration file for the Hive metadata extractor:

from pyhocon import ConfigFactory
from databuilder.extractor.hive_table_metadata_extractor import HiveTableMetadataExtractor

hive_extractor = HiveTableMetadataExtractor()
hive_extractor.init(
    conf=ConfigFactory.from_dict({
        'extractor.hive_table_metadata.partitioned_tables': True,
        'extractor.hive_table_metadata.cluster_source': 'my_cluster',
        'extractor.hive_table_metadata.database': 'my_database',
        # Additional configurations...
    })
)

2. Run the Ingestion Job

Set up and run a job to ingest the metadata:

from databuilder.job.job import DefaultJob
from databuilder.publisher.neo4j_csv_publisher import Neo4jCsvPublisher

job = DefaultJob(
    conf=ConfigFactory.from_dict({
        'extractor.hive_table_metadata.extractor': hive_extractor,
        'publisher.neo4j_csv_publisher': Neo4jCsvPublisher(),
        # Additional configurations...
    })
)

job.launch()

After running the job, the metadata from your Hive warehouse will be available in Amundsen for discovery.

Benefits of Using Amundsen

Improved Data Discovery: Makes it easy for users to find and understand data assets.
Enhanced Collaboration: Users can share knowledge through annotations and documentation.
Data Lineage and Compliance: Understand data flow and dependencies for better governance.
Integration Friendly: Connects with various data sources and tools in your data ecosystem.

Conclusion

Amundsen serves as a central hub for data discovery and metadata management in an organization. By providing a user-friendly interface and robust integration capabilities, it helps data professionals navigate complex data landscapes efficiently.

If you're looking to enhance data discovery and promote a data-driven culture in your organization, Amundsen is a powerful tool to consider.

LEARN, SHARE AND GROW

Amundsen - Data Discovery and Metadata Engine