Common Module Documentation

The common module provides generic functions to support a Delta Lakehouse implementation

alter_delta_table_properties(spark, ...)

This function alters the properties of a Delta table in a database.

create_aggregate_dataframe(spark, dataframe, ...)

This function aggregates a dataframe based on the aggregation definition provided

create_anti_merge_predicate(spark, ...)

Creates the anti-merge key based on the 'merge_predicate' provided and logs the function's start and end time using 'post_la_data'.

create_catalog_not_exists(spark, catalogName)

A function to create a Databricks Unity Catalog within the Databricks Workspace

create_database_not_exists(spark, databaseName)

A function to create a Hive Catalog within the spark workspace

create_delta_table(spark, table_def, ...)

This function is used to create a delta table in the Delta Lakehouse and returns the schema to be used when creating a synapse serverless view

create_folder_structure(spark, lake_file_struct)

A function to create the folder depth structure for querying nested files based on the lake file structof the data being queried

create_merge_dict(spark, columns)

Create a dictionary of the form {"column1": "source.column1", "column2": "source.column2", ...}, where "column1", "column2", etc.

create_merge_predicate(spark, merge_predicate)

Creates the merge key based on the 'merge_predicate' provided and logs the function's start and end time using 'post_la_data'.

create_mount(spark, mounting_point, ...[, ...])

A function to create a mount point within the Databricks Cluster

create_path_not_exists(spark, path)

A function to create a folder path in the datalake

create_type_two_condition(spark, type_two_keys)

Creates the Type 2 condition based on the 'type_two_keys' provided and logs the function's start and end time using 'post_la_data'.

delete_files_in_folder(spark, mount_point, ...)

A function to recursively delete all files in a folder within a mount point

delta_merge(spark, dataframe, partition_by, ...)

This function performs merge opertation for delta tables

do_vacuum_and_optimize(spark, delta_db, ...)

A function to Check to see if Table has enough version or is old enough to optimize and vacuum

drop_all_tables(spark, schema)

A function to drop all hive tables within a hive database/schema

generate_delta_load_rowsOutput(spark, ...)

a function to return the number of rows output rows from a delta table operation

generate_delta_load_stats(spark, schema, ...)

a function to return the statistics from a delta table insert/update/delete operation, will always return the most recent history expluding optimize operations

generate_expectation_meta_behaviour(spark, ge_df)

This function generates expectation meta data to control how the calling pipeline behaves on expectation failure

generate_expectation_results(spark, ...[, ...])

A function to generate expectation results defined in a YAML definition

generate_sql_expectation_table_validation_result_queries(...)

generate queries for expectation set

get_all_tables(spark, schema)

A function to return all Hive tables in the hive database

get_delta_floor_pred_for_sql(spark, ...[, ...])

A function to get the ceiling value for a data element in the data lake in incremental data loading strategies

get_partition_key_vals_for_pruning(spark, ...)

A function to generate a the partition keys string used when mergeing into a partitioned table

optimize_table(spark, delta_db, delta_table)

A function to optimize a delta table

read_data_lake_query(spark, query[, using])

A function to read data from the data lake using a hive sql query, the data must be available as a hive table

read_datalake_file(spark, mounting_point, ...)

A function to read a set of files or file in the data lake

repartition_delta_table(spark, schema, ...)

A function to repartition a delta table

rollback_delta_table_by_version(spark, ...)

A function to truncate all of the Hive tables within a hive database/schema

sql_data_type_replace(type_string)

This function is used to replace none support sql types when generating a schema for sql serverless views

truncate_all_tables(spark, schema)

A function to truncate all of the Hive tables within a hive database/schema

vacumm_and_optimize_all_tables(spark, delta_db)

A function to vacuum and optimize all delta tables in a specific hive database/schema

vacuum_table(spark, delta_db, delta_table[, ...])

A function to vacumm a delta table

jinga_reference(config_mount, input)

This function evaluates input string and applies jinga macros found in the macro_functions.j2 file in the config/04-macros folder in the data lake

write_file_to_datalake(spark, data, ...)

The function is to write files in datalake