valdata package

valdata.auxiliary module

valdata.auxiliary.check_type(item: Series | ndarray | DataFrame | DataFrame | list | dict) str

Checks the type of the given item and returns its name as a string.

Parameters:

item – The item whose type is to be checked.

Returns:

A string indicating the type of the item.

valdata.auxiliary.now_timedelta() datetime

Returns the current time in the ‘dd-MonthAb-yy hh:mm:ss AM/PM’ format for the ‘America/Bogota’ timezone.

Returns:

Current time in the ‘America/Bogota’ timezone as a datetime object.

valdata.auxiliary.path_correction(path: str) str

Corrects the path by replacing backslashes with forward slashes.

Parameters:

path – The file path to be corrected.

Returns:

The corrected path with forward slashes.

valdata.utils module

valdata.utils.add_module_path(module_path: str) None

Adds a new module path to the Python module search path.

Parameters:

module_path – The path to the directory to add to the module search path.

Returns:

None

valdata.utils.check_directory(directory_path: str, create_if_not_exists: bool = False) None

Checks if a directory exists at the given path.

Parameters:
  • directory_path – str - The directory path to check.

  • create_if_not_exists – bool - Flag to create the directory if it doesn’t exist. Default is False.

Returns:

None

Example:

check_directory(‘/path/to/directory’, create_if_not_exists=True) # Creates the directory if it does not exist

valdata.utils.get_directories(directory_path='Current')

Displays the contents of the specified directory.

Parameters:

directory_path – The directory to list contents of. If ‘Current’, the current working directory is used.

Returns:

None

valdata.utils.get_hadoop_version()

Retrieves and prints the versions of Python, Hadoop, Spark, and PySpark.

This function attempts to get the version of Hadoop using the ‘hadoop version’ command, the version of Spark through the SparkSession, and the version of PySpark from the pyspark library.

If Hadoop is not installed or if there is an error during execution, it will handle the error gracefully.

Returns:

None

valdata.utils.get_modules() None

Prints the currently loaded modules.

Returns:

None

valdata.utils.read_file()
valdata.utils.save_file()

valdata.validation module

valdata.validation.check_variables(dfs_dict: dict, df_comparison: str | None = None) None

Compares the variables (columns) of multiple DataFrames, with an option to compare to a reference DataFrame.

Parameters:
  • dfs_dict – dict - A dictionary where keys are DataFrame names (str) and values are the DataFrames (pd.DataFrame or Spark DataFrame) to compare.

  • df_comparison – str or None - The name of the reference DataFrame to compare against. If None, all DataFrames are compared against each other.

Returns:

None - Prints a table showing the columns of each DataFrame and highlights any discrepancies.

Functionality: 1. Column Comparison: - If df_comparison is provided, compares the columns of the specified DataFrame (df_comparison) with the other DataFrames in dfs_dict. - If df_comparison is not provided, compares all DataFrames against each other.

2. Missing Columns: - The table will show ‘Error’ for missing columns in the reference DataFrame (df_comparison), or ‘…’ for missing columns in other DataFrames when not in df_comparison.

3. Output Format: - Displays a formatted table showing the column names of all DataFrames and their alignment. Missing or differing columns are highlighted.

Example:

dfs_dict = {‘df_name_1’: df1, ‘df_name_2’: df2, ‘df_comparison’: df_ref} check_variables(dfs_dict, df_comparison=’df_comparison’)

valdata.validation.equal_df(df1: DataFrame, df2: DataFrame, aggregated: bool = False) None

Compare two Pandas DataFrames for equality, either element-wise or aggregated by columns.

Parameters:
  • df1 – pd.DataFrame - The first DataFrame to compare.

  • df2 – pd.DataFrame - The second DataFrame to compare.

  • aggregated – bool - Determines the level of comparison. - If False, performs a full element-wise comparison, printing True if all elements across both DataFrames match, and False if any mismatch is found. - If True, performs an aggregated comparison, outputting a DataFrame showing True or False values for each cell, indicating equality by column for every row.

Returns:

None

valdata.validation.equal_df_mult(dfs_dict: dict, df_comparison: str, row_count: bool = False) None

Compare multiple DataFrames to a reference DataFrame for structural and row-wise equality.

Parameters:
  • dfs_dict – dict - Dictionary where keys are DataFrame names (str) and values are the DataFrames (pd.DataFrame or Spark DataFrame) to compare.

  • df_comparison – str - The name of the reference DataFrame within dfs_dict that other DataFrames will be compared against.

  • row_count – bool - If True, includes row count for each DataFrame in the comparison output; if False, omits row count details.

Returns:

None - Prints a summary of comparison results for each DataFrame in dfs_dict against the reference DataFrame.

Functionality: 1. Column Comparison: - Checks if the reference DataFrame (df_comparison) and each other DataFrame in dfs_dict have the same number of columns. - Skips the comparison if the column counts differ.

2. Row Comparison: - Uses .exceptAll() for Spark DataFrames to identify row-wise differences between the reference and each other DataFrame, handling duplicates and row order.

3. Output Format: - For identical DataFrames, outputs a checkmark (✓) indicating that both DataFrames have the same shape and no different rows. - For differing DataFrames, outputs an error (X) with row mismatch details. - If row_count is True, includes row counts for each DataFrame; otherwise, displays “Omitted” in place of row counts.

Example:

dfs_dict = {‘df_name_1’: df1, ‘df_name_2’: df2, ‘df_name_comparison’: df_ref} equal_df_mult(dfs_dict, ‘df_name_comparison’, row_count=True)

valdata.validation.get_overview(df: DataFrame | DataFrame) None

Prints an overview of the given DataFrame, whether it’s a Pandas or Spark DataFrame, including its shape, type, head and composition (Variables, unique counts and unique values).

Parameters:

df – The DataFrame to be analyzed (Pandas or Spark).

Returns:

None

valdata.validation.get_pandas_schemaStr(df: DataFrame) str

Returns a schema-like representation of a Pandas DataFrame, similar to the Spark DataFrame schema.

Parameters:

df – The Pandas DataFrame to extract schema from.

Returns:

A string representation of the DataFrame’s schema.

valdata.validation.now() str

Returns the current time in the ‘dd-MonthAb-yy hh:mm:ss AM/PM’ format for the ‘America/Bogota’ timezone.

Returns:

Current time as a formatted string.

valdata.validation.show_pandas_as_table(df: DataFrame | Series, num_rows: int = 5) None

Display a Pandas DataFrame or Series in a tabular format, similar to the way Spark shows DataFrames.

Parameters:
  • df – The Pandas DataFrame or Series to display.

  • num_rows – The number of rows to display. Default is 5.

Returns:

None

valdata.validation.tblt_concentrations(df, conditioner: str, thresholds: list) None

Tabulates the concentration of data for a variable given a set of thresholds.

Parameters:
  • df – The DataFrame containing the data to be analyzed.

  • conditioner – The name of the column used as a condition for filtering.

  • thresholds – A list of threshold values to evaluate the concentration.

Returns:

None

valdata.validation.tblt_ocurrences(df_columns, conditionated: bool = False) None

Tabulates the occurrence of values in a specific variable or variables.

Parameters:
  • df_columns – A list of Series or DataFrame columns to tabulate. The first item is the variable of interest, and the second item is the condition (if any).

  • conditionated – A boolean indicating whether to tabulate conditioned on a second variable.

Returns:

None

valdata.validation.tms_0() None

Initializes the start time for the valdata process by setting the global variable ‘valdata_start_time’ to the current time.

Returns:

None

valdata.validation.tms_1() None

Calculates and prints the elapsed time since ‘tms_0’ in the format ‘days - hours:minutes:seconds’.

Returns:

None

valdata.validation.unique_values(df) None

Prints the unique values of each column in the given DataFrame.

Parameters:

df – The DataFrame for which to display unique values.

Returns:

None

valdata.visuals module

class valdata.visuals.TextStyle

Bases: object

BLACK = '\x1b[30m'
BLUE = '\x1b[34m'
BOLD = '\x1b[1m'
CYAN = '\x1b[36m'
GREEN = '\x1b[32m'
ITALIC = '\x1b[3m'
MAGENTA = '\x1b[35m'
RED = '\x1b[31m'
RESET = '\x1b[0m'
UNDERLINE = '\x1b[4m'
WHITE = '\x1b[37m'
YELLOW = '\x1b[33m'
valdata.visuals.b(text: str) str

Prints the given text as bold.

Parameters:

text – The text to be printed in bold.

Returns:

The formatted bold text as a string.

valdata.visuals.b_bl(text: str) str

Prints the given text as bold blue.

Parameters:

text – The text to be printed in bold blue.

Returns:

The formatted bold blue text as a string.

valdata.visuals.b_gr(text: str) str

Prints the given text as bold green.

Parameters:

text – The text to be printed in bold green.

Returns:

The formatted bold green text as a string.

valdata.visuals.b_re(text: str) str

Prints the given text as bold red.

Parameters:

text – The text to be printed in bold red.

Returns:

The formatted bold red text as a string.

valdata.visuals.set_start(msg: str = '') None

Initializes the environment by printing a message, the current time, and the Python version. It also sets custom styles for Jupyter Notebook cells.

Parameters:

msg – Optional message to be printed at the start. Defaults to an empty string.

Returns:

None