

Amundsen Databuilder¶

Amundsen Databuilder is a data ingestion library, which is inspired by Apache Gobblin. It could be used in an orchestration framework(e.g. Apache Airflow) to build data from Amundsen. You could use the library either with an adhoc python script(example) or inside an Apache Airflow DAG(example).

For information about Amundsen and our other services, visit the main repository README.md . Please also see our instructions for a quick start setup of Amundsen with dummy data, and an overview of the architecture.

Requirements¶

Python >= 3.8.x
elasticsearch 7.x

Doc¶

https://www.amundsen.io/amundsen/

Concept¶

ETL job consists of extraction of records from the source, transform records, if necessary, and load records into the sink. Amundsen Databuilder is a ETL framework for Amundsen and there are corresponding components for ETL called Extractor, Transformer, and Loader that deals with record level operation. A component called task controls all these three components. Job is the highest level component in Databuilder that controls task and publisher and is the one that client use to launch the ETL job.

In Databuilder, each components are highly modularized and each components are using namespace based config, HOCON config, which makes it highly reusable and pluggable. (e.g: transformer can be reused within extractor, or extractor can be reused within extractor) (Note that concept on components are highly motivated by Apache Gobblin)

Databuilder components

Extractor ¶

An extractor extracts records from the source. This does not necessarily mean that it only supports pull pattern in ETL. For example, extracting records from messaging bus makes it a push pattern in ETL.

Transformer ¶

A transformer takes a record from either an extractor or from other transformers (via ChainedTransformer) to transform the record.

Loader ¶

A loader takes a record from a transformer or from an extractor directly and loads it to a sink, or a staging area. As the loading operates at a record-level, it’s not capable of supporting atomicity.

Task ¶

A task orchestrates an extractor, a transformer, and a loader to perform a record-level operation.

Record ¶

A record is represented by one of models.

Publisher ¶

A publisher is an optional component. Its common usage is to support atomicity in job level and/or to easily support bulk load into the sink.

Job ¶

A job is the highest level component in Databuilder, and it orchestrates a task and, if any, a publisher.

Model ¶

Models are abstractions representing the domain.

List of extractors¶

DBAPIExtractor ¶

An extractor that uses Python Database API interface. DBAPI requires three information, connection object that conforms DBAPI spec, a SELECT SQL statement, and a model class that correspond to the output of each row of SQL statement.

job_config = ConfigFactory.from_dict({
        'extractor.dbapi{}'.format(DBAPIExtractor.CONNECTION_CONFIG_KEY): db_api_conn,
        'extractor.dbapi.{}'.format(DBAPIExtractor.SQL_CONFIG_KEY ): select_sql_stmt,
        'extractor.dbapi.model_class': 'package.module_name.class_name'
        })

job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=DBAPIExtractor(),
        loader=AnyLoader()))
job.launch()

GenericExtractor ¶

An extractor that takes list of dict from user through config.

HiveTableLastUpdatedExtractor ¶

An extractor that extracts last updated time from Hive metastore and underlying file system. Although, hive metastore has a parameter called “last_modified_time”, but it cannot be used as it provides DDL timestamp not DML timestamp. For this reason, HiveTableLastUpdatedExtractor is utilizing underlying file of Hive to fetch latest updated date. However, it is not efficient to poke all files in Hive, and it only pokes underlying storage for non-partitioned table. For partitioned table, it will fetch partition created timestamp, and it’s close enough for last updated timestamp.

As getting metadata from files could be time consuming there’re several features to increase performance. 1. Support of multithreading to parallelize metadata fetching. Although, cpython’s multithreading is not true multithreading as it’s bounded by single core, getting metadata of file is mostly IO bound operation. Note that number of threads should be less or equal to number of connections. 1. User can pass where clause to only include certain schema and also remove certain tables. For example, by adding something like TBL_NAME NOT REGEXP '(tmp|temp) would eliminate unncecessary computation.

job_config = ConfigFactory.from_dict({
    'extractor.hive_table_last_updated.partitioned_table_where_clause_suffix': partitioned_table_where_clause,
    'extractor.hive_table_last_updated.non_partitioned_table_where_clause_suffix'): non_partitioned_table_where_clause,
    'extractor.hive_table_last_updated.extractor.sqlalchemy.{}'.format(
            SQLAlchemyExtractor.CONN_STRING): connection_string,
    'extractor.hive_table_last_updated.extractor.fs_worker_pool_size': pool_size,
    'extractor.hive_table_last_updated.filesystem.{}'.format(FileSystem.DASK_FILE_SYSTEM): s3fs.S3FileSystem(
        anon=False,
        config_kwargs={'max_pool_connections': pool_size})})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=HiveTableLastUpdatedExtractor(),
        loader=AnyLoader()))
job.launch()

HiveTableMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from Hive metastore database.

job_config = ConfigFactory.from_dict({
    'extractor.hive_table_metadata.{}'.format(HiveTableMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.hive_table_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=HiveTableMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

CassandraExtractor ¶

An extractor that extracts table and column metadata including keyspace, table name, column name and column type from Apache Cassandra databases

job_config = ConfigFactory.from_dict({
    'extractor.cassandra.{}'.format(CassandraExtractor.CLUSTER_KEY): cluster_identifier_string,
    'extractor.cassandra.{}'.format(CassandraExtractor.IPS_KEY): [127.0.0.1],
    'extractor.cassandra.{}'.format(CassandraExtractor.KWARGS_KEY): {},
    'extractor.cassandra.{}'.format(CassandraExtractor.FILTER_FUNCTION_KEY): my_filter_function,

})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=CassandraExtractor(),
        loader=AnyLoader()))
job.launch()

If using the function filter options here is the function description

def filter(keytab, table):
  # return False if you don't want to add that table and True if you want to add
  return True

If needed to define more args on the cassandra cluster you can pass through kwargs args

config = ConfigFactory.from_dict({
    'extractor.cassandra.{}'.format(CassandraExtractor.IPS_KEY): [127.0.0.1],
    'extractor.cassandra.{}'.format(CassandraExtractor.KWARGS_KEY): {'port': 9042}
})
# it will call the cluster constructor like this
Cluster([127.0.0.1], **kwargs)

GlueExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from AWS Glue metastore.

Before running make sure you have a working AWS profile configured and have access to search tables on Glue

job_config = ConfigFactory.from_dict({
    'extractor.glue.{}'.format(GlueExtractor.CLUSTER_KEY): cluster_identifier_string,
    'extractor.glue.{}'.format(GlueExtractor.FILTER_KEY): [],
    'extractor.glue.{}'.format(GlueExtractor.PARTITION_BADGE_LABEL_KEY): label_string,
})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=GlueExtractor(),
        loader=AnyLoader()))
job.launch()

Optionally, you may add a partition badge label to the configuration. This will apply that label to all columns that are identified as partition keys in Glue.

If using the filters option here is the input format. For more information on filters visit link

[
  {
    "Key": "string",
    "Value": "string",
    "Comparator": "EQUALS"|"GREATER_THAN"|"LESS_THAN"|"GREATER_THAN_EQUALS"|"LESS_THAN_EQUALS"
  }
  ...
]

Example filtering on database and table. Note that Comparator can only apply to time fields.

[
  {
    "Key": "DatabaseName",
    "Value": "my_database"
  },
  {
    "Key": "Name",
    "Value": "my_table"
  }
]

Delta-Lake-MetadataExtractor ¶

An extractor that runs on a spark cluster and obtains delta-lake metadata using spark sql commands. This custom solution is currently necessary because the hive metastore does not contain all metadata information for delta-lake tables. For simplicity, this extractor can also be used for all hive tables as well.

Because it must run on a spark cluster, it is required that you have an operator (for example a databricks submit run operator) that calls the configuration code on a spark cluster.

spark = SparkSession.builder.appName("Amundsen Delta Lake Metadata Extraction").getOrCreate()
job_config = create_delta_lake_job_config()
dExtractor = DeltaLakeMetadataExtractor()
dExtractor.set_spark(spark)
job = DefaultJob(conf=job_config,
                 task=DefaultTask(extractor=dExtractor, loader=FsNeo4jCSVLoader()),
                 publisher=Neo4jCsvPublisher())
job.launch()

The delta lake extractor supports extraction of complex data types to be indexed and searchable.

struct<a:int,b:string,c:array<struct<d:int,e:string>>,f:map<int,<struct<g:int,h:string>>>

Will be extracted as:
a     int
b     string
c     array<struct<d:int,e:string>>
c.d   int
c.e   string
f     map<int,<struct<g:int,h:string>>
f.g   int
f.h   string

This functionality is behind a configuration value. Simply set EXTRACT_NESTED_COLUMNS to True in the job config.

You can check out the sample deltalake metadata script for a full example.

DremioMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from Dremio.

Before running make sure that you have the Dremio ODBC driver installed. Default config values assume the default driver name for the MacBook install.

job_config = ConfigFactory.from_dict({
    'extractor.dremio.{}'.format(DremioMetadataExtractor.DREMIO_USER_KEY): DREMIO_USER,
    'extractor.dremio.{}'.format(DremioMetadataExtractor.DREMIO_PASSWORD_KEY): DREMIO_PASSWORD,
    'extractor.dremio.{}'.format(DremioMetadataExtractor.DREMIO_HOST_KEY): DREMIO_HOST})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=DremioMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

DruidMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Druid DB.

The where_clause_suffix could be defined, normally you would like to filter out the in INFORMATION_SCHEMA.

You could specify the following job config

conn_string = "druid+https://{host}:{port}/druid/v2/sql/".format(
        host=druid_broker_host,
        port=443
)
job_config = ConfigFactory.from_dict({
    'extractor.druid_metadata.{}'.format(PostgresMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
  'extractor.druid_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): conn_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=DruidMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

OracleMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from the Oracle database.

By default, the Oracle database name is ‘oracle’. To override this, set CLUSTER_KEY to what you wish to use as the cluster name.

The where_clause_suffix below should define which schemas you’d like to query. The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.oracle_metadata.{}'.format(OracleMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.oracle_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=OracleMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

PostgresMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Postgres or Redshift database.

By default, the Postgres/Redshift database name is used as the cluster name. To override this, set USE_CATALOG_AS_CLUSTER_NAME to False, and CLUSTER_KEY to what you wish to use as the cluster name.

The where_clause_suffix below should define which schemas you’d like to query (see the sample dag for an example).

The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.postgres_metadata.{}'.format(PostgresMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.postgres_metadata.{}'.format(PostgresMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
    'extractor.postgres_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=PostgresMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

MSSQLMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Microsoft SQL database.

By default, the Microsoft SQL Server Database name is used as the cluster name. To override this, set USE_CATALOG_AS_CLUSTER_NAME to False, and CLUSTER_KEY to what you wish to use as the cluster name.

The where_clause_suffix below should define which schemas you’d like to query ("('dbo','sys')").

The SQL query driving the extraction is defined here

This extractor is highly derived from PostgresMetadataExtractor.

job_config = ConfigFactory.from_dict({
    'extractor.mssql_metadata.{}'.format(MSSQLMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.mssql_metadata.{}'.format(MSSQLMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
    'extractor.mssql_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=MSSQLMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

MysqlMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a MYSQL database.

By default, the MYSQL database name is used as the cluster name. To override this, set USE_CATALOG_AS_CLUSTER_NAME to False, and CLUSTER_KEY to what you wish to use as the cluster name.

The where_clause_suffix below should define which schemas you’d like to query.

The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.mysql_metadata.{}'.format(MysqlMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.mysql_metadata.{}'.format(MysqlMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
    'extractor.mysql_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(conf=job_config,
                                 task=DefaultTask(extractor=MysqlMetadataExtractor(), loader=FsNeo4jCSVLoader()),
                                 publisher=Neo4jCsvPublisher())
job.launch()

Db2MetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Unix, Windows or Linux Db2 database or BigSQL.

The where_clause_suffix below should define which schemas you’d like to query or those that you would not (see the sample data loader for an example).

The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.db2_metadata.{}'.format(Db2MetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.db2_metadata.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=Db2MetadataExtractor(),
        loader=AnyLoader()))
job.launch()

SnowflakeMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Snowflake database.

By default, the Snowflake database name is used as the cluster name. To override this, set USE_CATALOG_AS_CLUSTER_NAME to False, and CLUSTER_KEY to what you wish to use as the cluster name.

By default, the Snowflake database is set to PROD. To override this, set DATABASE_KEY to WhateverNameOfYourDb.

By default, the Snowflake schema is set to INFORMATION_SCHEMA. To override this, set SCHEMA_KEY to WhateverNameOfYourSchema.

Note that ACCOUNT_USAGE is a separate schema which allows users to query a wider set of data at the cost of latency. Differences are defined here

The where_clause_suffix should define which schemas you’d like to query (see the sample dag for an example).

The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.snowflake.{}'.format(SnowflakeMetadataExtractor.SNOWFLAKE_DATABASE_KEY): 'YourDbName',
    'extractor.snowflake.{}'.format(SnowflakeMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.snowflake.{}'.format(SnowflakeMetadataExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
    'extractor.snowflake.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=SnowflakeMetadataExtractor(),
        loader=AnyLoader()))
job.launch()

GenericUsageExtractor ¶

An extractor that extracts table popularity metadata from a custom created Snowflake table (created by a script that may look like this scala script). You can create a DAG using the Databricks Operator and run this script within Databricks or wherever you are able to run Scala.

By default, snowflake is used as the database name. ColumnReader has the datasource as its database input, and database as its cluster input.

The following inputs are related to where you create your Snowflake popularity table.

By default, the Snowflake popularity database is set to PROD. To override this, set POPULARITY_TABLE_DATABASE to WhateverNameOfYourDb.

By default, the Snowflake popularity schema is set to SCHEMA. To override this, set POPULARTIY_TABLE_SCHEMA to WhateverNameOfYourSchema.

By default, the Snowflake popularity table is set to TABLE. To override this, set POPULARITY_TABLE_NAME to WhateverNameOfYourTable.

The where_clause_suffix should define any filtering you’d like to include in your query. For example, this may include user_emails that you don’t want to include in your popularity definition.

job_config = ConfigFactory.from_dict({
    f'extractor.generic_usage.extractor.sqlalchemy.{SQLAlchemyExtractor.CONN_STRING}': connection_string(),
    f'extractor.generic_usage.{GenericUsageExtractor.WHERE_CLAUSE_SUFFIX_KEY}': where_clause_suffix,
    f'extractor.generic_usage.{GenericUsageExtractor.POPULARITY_TABLE_DATABASE}': 'WhateverNameOfYourDb',
    f'extractor.generic_usage.{GenericUsageExtractor.POPULARTIY_TABLE_SCHEMA}': 'WhateverNameOfYourSchema',
    f'extractor.generic_usage.{GenericUsageExtractor.POPULARITY_TABLE_NAME}': 'WhateverNameOfYourTable',
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=GenericUsageExtractor(),
        loader=AnyLoader()))
job.launch()

SnowflakeTableLastUpdatedExtractor ¶

An extractor that extracts table last updated timestamp from a Snowflake database.

It uses same configs as the SnowflakeMetadataExtractor described above.

The SQL query driving the extraction is defined here

job_config = ConfigFactory.from_dict({
    'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.SNOWFLAKE_DATABASE_KEY): 'YourDbName',
    'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.WHERE_CLAUSE_SUFFIX_KEY): where_clause_suffix,
    'extractor.snowflake_table_last_updated.{}'.format(SnowflakeTableLastUpdatedExtractor.USE_CATALOG_AS_CLUSTER_NAME): True,
    'extractor.snowflake_table_last_updated.extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string()})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=SnowflakeTableLastUpdatedExtractor(),
        loader=AnyLoader()))
job.launch()

BigQueryMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, table description, column name and column description from a Bigquery database.

The API calls driving the extraction is defined here

You will need to create a service account for reading metadata and grant it “BigQuery Metadata Viewer” access to all of your datasets. This can all be done via the bigquery ui.

Download the credentials file and store it securely. Set the GOOGLE_APPLICATION_CREDENTIALS environment varible to the location of your credtials files and your code should have access to everything it needs.

You can configure bigquery like this. You can optionally set a label filter if you only want to pull tables with a certain label.

    job_config = {
        'extractor.bigquery_table_metadata.{}'.format(
            BigQueryMetadataExtractor.PROJECT_ID_KEY
            ): gcloud_project
    }
    if label_filter:
        job_config[
            'extractor.bigquery_table_metadata.{}'
            .format(BigQueryMetadataExtractor.FILTER_KEY)
            ] = label_filter
    task = DefaultTask(extractor=BigQueryMetadataExtractor(),
                       loader=csv_loader,
                       transformer=NoopTransformer())
    job = DefaultJob(conf=ConfigFactory.from_dict(job_config),
                     task=task,
                     publisher=Neo4jCsvPublisher())
job.launch()

Neo4jEsLastUpdatedExtractor ¶

An extractor that basically get current timestamp and passes it GenericExtractor. This extractor is basically being used to create timestamp for “Amundsen was last indexed on …” in Amundsen web page’s footer.

Neo4jExtractor ¶

An extractor that extracts records from Neo4j based on provided Cypher query. One example is to extract data from Neo4j so that it can transform and publish to Elasticsearch.

job_config = ConfigFactory.from_dict({
    'extractor.neo4j.{}'.format(Neo4jExtractor.CYPHER_QUERY_CONFIG_KEY): cypher_query,
    'extractor.neo4j.{}'.format(Neo4jExtractor.GRAPH_URL_CONFIG_KEY): neo4j_endpoint,
    'extractor.neo4j.{}'.format(Neo4jExtractor.MODEL_CLASS_CONFIG_KEY): 'package.module.class_name',
    'extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_USER): neo4j_user,
    'extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_PW): neo4j_password},
    'extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_ENCRYPTED): True})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=Neo4jExtractor(),
        loader=AnyLoader()))
job.launch()

Neo4jSearchDataExtractor ¶

An extractor that is extracting Neo4j utilizing Neo4jExtractor where CYPHER query is already embedded in it.

job_config = ConfigFactory.from_dict({
    'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.GRAPH_URL_CONFIG_KEY): neo4j_endpoint,
    'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.MODEL_CLASS_CONFIG_KEY): 'databuilder.models.neo4j_data.Neo4jDataResult',
    'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_USER): neo4j_user,
    'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_AUTH_PW): neo4j_password},
    'extractor.search_data.extractor.neo4j.{}'.format(Neo4jExtractor.NEO4J_ENCRYPTED): False})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=Neo4jSearchDataExtractor(),
        loader=AnyLoader()))
job.launch()

AtlasSearchDataExtractor ¶

An extractor that is extracting Atlas Data to index compatible with Elasticsearch Search Proxy.

entity_type = 'Table'
extracted_search_data_path = f'/tmp/{entity_type.lower()}_search_data.json'
process_pool_size = 5

# atlas config
atlas_url = 'localhost'
atlas_port = 21000
atlas_protocol = 'http'
atlas_verify_ssl = False
atlas_username = 'admin'
atlas_password = 'admin'
atlas_search_chunk_size = 200
atlas_details_chunk_size = 10

# elastic config
es = Elasticsearch([
    {'host': 'localhost'},
])

elasticsearch_client = es
elasticsearch_new_index_key = f'{entity_type.lower()}-' + str(uuid.uuid4())
elasticsearch_new_index_key_type = '_doc'
elasticsearch_index_alias = f'{entity_type.lower()}_search_index'

job_config = ConfigFactory.from_dict({
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_URL_CONFIG_KEY):
        atlas_url,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_PORT_CONFIG_KEY):
        atlas_port,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_PROTOCOL_CONFIG_KEY):
        atlas_protocol,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_VALIDATE_SSL_CONFIG_KEY):
        atlas_verify_ssl,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_USERNAME_CONFIG_KEY):
        atlas_username,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_PASSWORD_CONFIG_KEY):
        atlas_password,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_SEARCH_CHUNK_SIZE_KEY):
        atlas_search_chunk_size,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ATLAS_DETAILS_CHUNK_SIZE_KEY):
        atlas_details_chunk_size,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.PROCESS_POOL_SIZE_KEY):
        process_pool_size,
    'extractor.atlas_search_data.{}'.format(AtlasSearchDataExtractor.ENTITY_TYPE_KEY):
        entity_type,
    'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_PATH_CONFIG_KEY):
        extracted_search_data_path,
    'loader.filesystem.elasticsearch.{}'.format(FSElasticsearchJSONLoader.FILE_MODE_CONFIG_KEY):
        'w',
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_PATH_CONFIG_KEY):
        extracted_search_data_path,
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.FILE_MODE_CONFIG_KEY):
        'r',
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_CLIENT_CONFIG_KEY):
        elasticsearch_client,
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_NEW_INDEX_CONFIG_KEY):
        elasticsearch_new_index_key,
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_DOC_TYPE_CONFIG_KEY):
        elasticsearch_new_index_key_type,
    'publisher.elasticsearch.{}'.format(ElasticsearchPublisher.ELASTICSEARCH_ALIAS_CONFIG_KEY):
        elasticsearch_index_alias
})

if __name__ == "__main__":
    task = DefaultTask(extractor=AtlasSearchDataExtractor(),
                       transformer=NoopTransformer(),
                       loader=FSElasticsearchJSONLoader())

    job = DefaultJob(conf=job_config,
                     task=task,
                     publisher=ElasticsearchPublisher())

    job.launch()

VerticaMetadataExtractor ¶

An extractor that extracts table and column metadata including database, schema, table name, column name and column datatype from a Vertica database.

A sample loading script for Vertica is provided here

By default, the Vertica database name is used as the cluster name. The where_clause_suffix in the example can be used to define which schemas you would like to query.

SQLAlchemyExtractor ¶

An extractor utilizes SQLAlchemy to extract record from any database that support SQL Alchemy.

job_config = ConfigFactory.from_dict({
    'extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.CONN_STRING): connection_string(),
    'extractor.sqlalchemy.{}'.format(SQLAlchemyExtractor.EXTRACT_SQL): sql,
    'extractor.sqlalchemy.model_class': 'package.module.class_name'})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=SQLAlchemyExtractor(),
        loader=AnyLoader()))
job.launch()

DbtExtractor ¶

This extractor utilizes the dbt output files catalog.json and manifest.json to extract metadata and ingest it into Amundsen. The catalog.json and manifest.json can both be generated by running dbt docs generate in your dbt project. Visit the dbt artifacts page for more information.

The DbtExtractor can currently create the following:

Tables and their definitions
Columns and their definitions
Table level lineage
dbt tags (as Amundsen badges or tags)
Table Sources (e.g. link to GitHib where the dbt template resides)

job_config = ConfigFactory.from_dict({
    # Required args
    f'extractor.dbt.{DbtExtractor.DATABASE_NAME}': 'snowflake',
    f'extractor.dbt.{DbtExtractor.MANIFEST_JSON}': catalog_file_loc,  # File location
    f'extractor.dbt.{DbtExtractor.DATABASE_NAME}': json.dumps(manifest_data),  # JSON Dumped object
    # Optional args
    f'extractor.dbt.{DbtExtractor.SOURCE_URL}': 'https://github.com/your-company/your-repo/tree/main',
    f'extractor.dbt.{DbtExtractor.EXTRACT_TABLES}': True,
    f'extractor.dbt.{DbtExtractor.EXTRACT_DESCRIPTIONS}': True,
    f'extractor.dbt.{DbtExtractor.EXTRACT_TAGS}': True,
    f'extractor.dbt.{DbtExtractor.IMPORT_TAGS_AS}': 'badges',
    f'extractor.dbt.{DbtExtractor.EXTRACT_LINEAGE}': True,
})
job = DefaultJob(
    conf=job_config,
    task=DefaultTask(
        extractor=DbtExtractor(),
        loader=AnyLoader()))
job.launch()

RestAPIExtractor ¶

A extractor that utilizes RestAPIQuery to extract data. RestAPIQuery needs to be constructed (example) and needs to be injected to RestAPIExtractor.

Mode Dashboard Extractor¶

Here are extractors that extracts metadata information from Mode via Mode’s REST API.

Prerequisite:

You will need to create API access token that has admin privilege.
You will need organization code. This is something you can easily get by looking at one of Mode report’s URL. https://app.mode.com/<organization code>/reports/report_token

ModeDashboardExtractor ¶

A Extractor that extracts core metadata on Mode dashboard. https://app.mode.com/

It extracts list of reports that consists of: Dashboard group name (Space name) Dashboard group id (Space token) Dashboard group description (Space description) Dashboard name (Report name) Dashboard id (Report token) Dashboard description (Report description)

Other information such as report run, owner, chart name, query name is in separate extractor.

It calls two APIs (spaces API and reports API) joining together.

You can create Databuilder job config like this.

task = DefaultTask(extractor=ModeDashboardExtractor(),
                   loader=FsNeo4jCSVLoader(), )

tmp_folder = '/var/tmp/amundsen/mode_dashboard_metadata'
node_files_folder = '{tmp_folder}/nodes'.format(tmp_folder=tmp_folder)
relationship_files_folder = '{tmp_folder}/relationships'.format(tmp_folder=tmp_folder)

job_config = ConfigFactory.from_dict({
    'extractor.mode_dashboard.{}'.format(ORGANIZATION): organization,
    'extractor.mode_dashboard.{}'.format(MODE_BEARER_TOKEN): mode_bearer_token,
    'extractor.mode_dashboard.{}'.format(DASHBOARD_GROUP_IDS_TO_SKIP): [space_token_1, space_token_2, ...],
    'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.NODE_DIR_PATH): node_files_folder,
    'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.RELATION_DIR_PATH): relationship_files_folder,
    'loader.filesystem_csv_neo4j.{}'.format(FsNeo4jCSVLoader.SHOULD_DELETE_CREATED_DIR): True,
    'task.progress_report_frequency': 100,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NODE_FILES_DIR): node_files_folder,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.RELATION_FILES_DIR): relationship_files_folder,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_END_POINT_KEY): neo4j_endpoint,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_USER): neo4j_user,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_PASSWORD): neo4j_password,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_ENCRYPTED): True,
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_CREATE_ONLY_NODES): [DESCRIPTION_NODE_LABEL],
    'publisher.neo4j.{}'.format(neo4j_csv_publisher.JOB_PUBLISH_TAG): job_publish_tag
})

job = DefaultJob(conf=job_config,
                 task=task,
                 publisher=Neo4jCsvPublisher())
job.launch()

ModeDashboardOwnerExtractor ¶

An Extractor that extracts Dashboard owner. Mode itself does not have concept of owner and it will use creator as owner. Note that if user left the organization, it would skip the dashboard.