valdata package¶
valdata.auxiliary module¶
- valdata.auxiliary.check_type(item: Series | ndarray | DataFrame | DataFrame | list | dict) str¶
Checks the type of the given item and returns its name as a string.
- Parameters:
item – The item whose type is to be checked.
- Returns:
A string indicating the type of the item.
- valdata.auxiliary.now_timedelta() datetime¶
Returns the current time in the ‘dd-MonthAb-yy hh:mm:ss AM/PM’ format for the ‘America/Bogota’ timezone.
- Returns:
Current time in the ‘America/Bogota’ timezone as a datetime object.
- valdata.auxiliary.path_correction(path: str) str¶
Corrects the path by replacing backslashes with forward slashes.
- Parameters:
path – The file path to be corrected.
- Returns:
The corrected path with forward slashes.
valdata.utils module¶
- valdata.utils.add_module_path(module_path: str) None¶
Adds a new module path to the Python module search path.
- Parameters:
module_path – The path to the directory to add to the module search path.
- Returns:
None
- valdata.utils.check_directory(directory_path: str, create_if_not_exists: bool = False) None¶
Checks if a directory exists at the given path.
- Parameters:
directory_path – str - The directory path to check.
create_if_not_exists – bool - Flag to create the directory if it doesn’t exist. Default is False.
- Returns:
None
- Example:
check_directory(‘/path/to/directory’, create_if_not_exists=True) # Creates the directory if it does not exist
- valdata.utils.get_directories(directory_path='Current')¶
Displays the contents of the specified directory.
- Parameters:
directory_path – The directory to list contents of. If ‘Current’, the current working directory is used.
- Returns:
None
- valdata.utils.get_hadoop_version()¶
Retrieves and prints the versions of Python, Hadoop, Spark, and PySpark.
This function attempts to get the version of Hadoop using the ‘hadoop version’ command, the version of Spark through the SparkSession, and the version of PySpark from the pyspark library.
If Hadoop is not installed or if there is an error during execution, it will handle the error gracefully.
- Returns:
None
- valdata.utils.get_modules() None¶
Prints the currently loaded modules.
- Returns:
None
- valdata.utils.read_file()¶
- valdata.utils.save_file()¶
valdata.validation module¶
- valdata.validation.check_variables(dfs_dict: dict, df_comparison: str | None = None) None¶
Compares the variables (columns) of multiple DataFrames, with an option to compare to a reference DataFrame.
- Parameters:
dfs_dict – dict - A dictionary where keys are DataFrame names (str) and values are the DataFrames (pd.DataFrame or Spark DataFrame) to compare.
df_comparison – str or None - The name of the reference DataFrame to compare against. If None, all DataFrames are compared against each other.
- Returns:
None - Prints a table showing the columns of each DataFrame and highlights any discrepancies.
Functionality: 1. Column Comparison: - If df_comparison is provided, compares the columns of the specified DataFrame (df_comparison) with the other DataFrames in dfs_dict. - If df_comparison is not provided, compares all DataFrames against each other.
2. Missing Columns: - The table will show ‘Error’ for missing columns in the reference DataFrame (df_comparison), or ‘…’ for missing columns in other DataFrames when not in df_comparison.
3. Output Format: - Displays a formatted table showing the column names of all DataFrames and their alignment. Missing or differing columns are highlighted.
- Example:
dfs_dict = {‘df_name_1’: df1, ‘df_name_2’: df2, ‘df_comparison’: df_ref} check_variables(dfs_dict, df_comparison=’df_comparison’)
- valdata.validation.equal_df(df1: DataFrame, df2: DataFrame, aggregated: bool = False) None¶
Compare two Pandas DataFrames for equality, either element-wise or aggregated by columns.
- Parameters:
df1 – pd.DataFrame - The first DataFrame to compare.
df2 – pd.DataFrame - The second DataFrame to compare.
aggregated – bool - Determines the level of comparison. - If False, performs a full element-wise comparison, printing True if all elements across both DataFrames match, and False if any mismatch is found. - If True, performs an aggregated comparison, outputting a DataFrame showing True or False values for each cell, indicating equality by column for every row.
- Returns:
None
- valdata.validation.equal_df_mult(dfs_dict: dict, df_comparison: str, row_count: bool = False) None¶
Compare multiple DataFrames to a reference DataFrame for structural and row-wise equality.
- Parameters:
dfs_dict – dict - Dictionary where keys are DataFrame names (str) and values are the DataFrames (pd.DataFrame or Spark DataFrame) to compare.
df_comparison – str - The name of the reference DataFrame within dfs_dict that other DataFrames will be compared against.
row_count – bool - If True, includes row count for each DataFrame in the comparison output; if False, omits row count details.
- Returns:
None - Prints a summary of comparison results for each DataFrame in dfs_dict against the reference DataFrame.
Functionality: 1. Column Comparison: - Checks if the reference DataFrame (df_comparison) and each other DataFrame in dfs_dict have the same number of columns. - Skips the comparison if the column counts differ.
2. Row Comparison: - Uses .exceptAll() for Spark DataFrames to identify row-wise differences between the reference and each other DataFrame, handling duplicates and row order.
3. Output Format: - For identical DataFrames, outputs a checkmark (✓) indicating that both DataFrames have the same shape and no different rows. - For differing DataFrames, outputs an error (X) with row mismatch details. - If row_count is True, includes row counts for each DataFrame; otherwise, displays “Omitted” in place of row counts.
- Example:
dfs_dict = {‘df_name_1’: df1, ‘df_name_2’: df2, ‘df_name_comparison’: df_ref} equal_df_mult(dfs_dict, ‘df_name_comparison’, row_count=True)
- valdata.validation.get_overview(df: DataFrame | DataFrame) None¶
Prints an overview of the given DataFrame, whether it’s a Pandas or Spark DataFrame, including its shape, type, head and composition (Variables, unique counts and unique values).
- Parameters:
df – The DataFrame to be analyzed (Pandas or Spark).
- Returns:
None
- valdata.validation.get_pandas_schemaStr(df: DataFrame) str¶
Returns a schema-like representation of a Pandas DataFrame, similar to the Spark DataFrame schema.
- Parameters:
df – The Pandas DataFrame to extract schema from.
- Returns:
A string representation of the DataFrame’s schema.
- valdata.validation.now() str¶
Returns the current time in the ‘dd-MonthAb-yy hh:mm:ss AM/PM’ format for the ‘America/Bogota’ timezone.
- Returns:
Current time as a formatted string.
- valdata.validation.show_pandas_as_table(df: DataFrame | Series, num_rows: int = 5) None¶
Display a Pandas DataFrame or Series in a tabular format, similar to the way Spark shows DataFrames.
- Parameters:
df – The Pandas DataFrame or Series to display.
num_rows – The number of rows to display. Default is 5.
- Returns:
None
- valdata.validation.tblt_concentrations(df, conditioner: str, thresholds: list) None¶
Tabulates the concentration of data for a variable given a set of thresholds.
- Parameters:
df – The DataFrame containing the data to be analyzed.
conditioner – The name of the column used as a condition for filtering.
thresholds – A list of threshold values to evaluate the concentration.
- Returns:
None
- valdata.validation.tblt_ocurrences(df_columns, conditionated: bool = False) None¶
Tabulates the occurrence of values in a specific variable or variables.
- Parameters:
df_columns – A list of Series or DataFrame columns to tabulate. The first item is the variable of interest, and the second item is the condition (if any).
conditionated – A boolean indicating whether to tabulate conditioned on a second variable.
- Returns:
None
- valdata.validation.tms_0() None¶
Initializes the start time for the valdata process by setting the global variable ‘valdata_start_time’ to the current time.
- Returns:
None
- valdata.validation.tms_1() None¶
Calculates and prints the elapsed time since ‘tms_0’ in the format ‘days - hours:minutes:seconds’.
- Returns:
None
- valdata.validation.unique_values(df) None¶
Prints the unique values of each column in the given DataFrame.
- Parameters:
df – The DataFrame for which to display unique values.
- Returns:
None
valdata.visuals module¶
- class valdata.visuals.TextStyle¶
Bases:
object- BLACK = '\x1b[30m'¶
- BLUE = '\x1b[34m'¶
- BOLD = '\x1b[1m'¶
- CYAN = '\x1b[36m'¶
- GREEN = '\x1b[32m'¶
- ITALIC = '\x1b[3m'¶
- MAGENTA = '\x1b[35m'¶
- RED = '\x1b[31m'¶
- RESET = '\x1b[0m'¶
- UNDERLINE = '\x1b[4m'¶
- WHITE = '\x1b[37m'¶
- YELLOW = '\x1b[33m'¶
- valdata.visuals.b(text: str) str¶
Prints the given text as bold.
- Parameters:
text – The text to be printed in bold.
- Returns:
The formatted bold text as a string.
- valdata.visuals.b_bl(text: str) str¶
Prints the given text as bold blue.
- Parameters:
text – The text to be printed in bold blue.
- Returns:
The formatted bold blue text as a string.
- valdata.visuals.b_gr(text: str) str¶
Prints the given text as bold green.
- Parameters:
text – The text to be printed in bold green.
- Returns:
The formatted bold green text as a string.
- valdata.visuals.b_re(text: str) str¶
Prints the given text as bold red.
- Parameters:
text – The text to be printed in bold red.
- Returns:
The formatted bold red text as a string.
- valdata.visuals.set_start(msg: str = '') None¶
Initializes the environment by printing a message, the current time, and the Python version. It also sets custom styles for Jupyter Notebook cells.
- Parameters:
msg – Optional message to be printed at the start. Defaults to an empty string.
- Returns:
None