Generic Bootstrap and Permutation methods API

ccrvam.checkerboard.genstatsim

Module Contents

Classes

CustomBootstrapResult

Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities.

CustomPermutationResult

Container for permutation simulation (including test) results with statistical visualization capabilities.

SubsetCCRAMResult

Container for subset (S)CCRAM analysis results.

BestSubsetCCRAMResult

Container for best subset (S)CCRAM analysis results.

Functions

_process_bootstrap_batch

Helper function for parallel bootstrap processing.

bootstrap_ccram

Perform bootstrap simulation and confidence intervals for (S)CCRAM.

_process_prediction_batch

Helper function for parallel prediction processing.

bootstrap_predict_ccr_summary

Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis.

save_predictions

Save prediction results generated by bootstrap_predict_ccr_summary() to a file.

_process_permutation_batch

Helper function for parallel permutation test processing.

permutation_test_ccram

Perform permutation simulation and test for (S)CCRAM.

_format_tuple_display

Format a tuple as a string without trailing comma for single elements.

all_subsets_ccram

Calculate (S)CCRAM for all possible predictor subsets.

best_subset_ccram

Find the optimal predictor subset with the highest (S)CCRAM value.

API

class ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]

Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities.

Input Arguments

  • metric_name : Name of the metric being bootstrapped

  • observed_value : Original observed value of the metric

  • confidence_interval : Lower and upper bounds for the bootstrap confidence interval

  • bootstrap_distribution : Array of bootstrapped values of the metric

  • standard_error : Standard error of the bootstrap distribution

  • bootstrap_tables : Array of bootstrapped contingency tables (optional)

  • histogram_fig : Matplotlib figure of distribution plot (optional)

metric_name: str[source]

None

observed_value: float[source]

None

confidence_interval: Tuple[float, float][source]

None

bootstrap_distribution: numpy.ndarray[source]

None

standard_error: float[source]

None

bootstrap_tables: Optional[numpy.ndarray][source]

None

histogram_fig: Optional[matplotlib.pyplot.Figure][source]

None

plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, text_fontsize: Optional[int] = None, **kwargs) Optional[matplotlib.pyplot.Figure][source]

Plot bootstrap distribution with observed value.

Input Arguments

  • title : Title of the plot (optional)

  • figsize : Figure size as (width, height) tuple (optional)

  • title_fontsize : Font size for the plot title (optional)

  • xlabel_fontsize : Font size for x-axis label (optional)

  • ylabel_fontsize : Font size for y-axis label (optional)

  • tick_fontsize : Font size for axis tick labels (optional)

  • text_fontsize : Font size for text inside the plot (optional)

  • **kwargs : Additional matplotlib arguments passed to plotting functions

Outputs

Matplotlib figure of distribution plot

Warnings/Errors

  • Exception : If the plot cannot be created

ccrvam.checkerboard.genstatsim._process_bootstrap_batch(args)[source]

Helper function for parallel bootstrap processing.

ccrvam.checkerboard.genstatsim.bootstrap_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, n_resamples: int = 9999, confidence_level: float = 0.95, method: str = 'percentile', random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]

Perform bootstrap simulation and confidence intervals for (S)CCRAM.

Input Arguments

  • contingency_table : Input contingency table of frequency counts

  • predictors : List of 1-indexed predictor axes for (S)CCRAM calculation

  • response : 1-indexed target response variable axis for (S)CCRAM calculation

  • scaled : Whether to use scaled (S)CCRAM (default=False)

  • n_resamples : Number of bootstrap resamples (default=9999)

  • confidence_level : Confidence level for bootstrap confidence intervals (default=0.95)

  • method : Bootstrap CI method (‘percentile’, ‘basic’, ‘BCa’); (default=’percentile’)

  • random_state : Random state for reproducibility (optional)

  • store_tables : Whether to store the bootstrapped contingency tables (default=False)

  • parallel : Whether to use parallel processing (default=False)

Outputs

Bootstrap result class containing bootstrap confidence interval, bootstrap estimates for the (S)CCRAM and bootstrap tables

Warnings/Errors

  • ValueError : If predictor or response axis is out of bounds

ccrvam.checkerboard.genstatsim._process_prediction_batch(args)[source]

Helper function for parallel prediction processing.

ccrvam.checkerboard.genstatsim.bootstrap_predict_ccr_summary(table: numpy.ndarray, predictors: Union[List[int], int], predictors_names: Optional[List[str]] = None, response: Optional[int] = None, response_name: Optional[str] = None, n_resamples: int = 9999, random_state: Optional[int] = None, parallel: bool = True) pandas.DataFrame[source]

Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis.

Input Arguments

  • table : Contingency table of frequency counts

  • predictors : List of predictor dimensions (1-indexed)

  • predictors_names : Names of predictor variables (optional)

  • response : Response variable dimension (1-indexed). If None, the last dimension is used.

  • response_name : Name of response variable (optional)

  • n_resamples : Number of bootstrap resamples (default=9999)

  • random_state : Random state for reproducibility (optional)

  • parallel : Whether to use parallel processing (default=True)

Outputs

CCR Prediction matrix post-bootstrap showing the percentage of the predicted category of the response variable for each combination of categories of the predictors.

Warnings/Errors

  • ValueError : If predictor or response axis is out of bounds

Notes

  • The output is a pandas DataFrame with the percentage of the predicted category of the response variable for each combination of categories of the predictors.

  • The output also includes a method plot_predictions_summary to plot the prediction matrix as a heatmap or bubble plot.

ccrvam.checkerboard.genstatsim.save_predictions(prediction_matrix: pandas.DataFrame, save_path: Optional[str] = None, format: str = 'csv', decimal_places: int = 2) None[source]

Save prediction results generated by bootstrap_predict_ccr_summary() to a file.

Input Arguments

  • prediction_matrix : DataFrame containing prediction results generated by bootstrap_predict_ccr_summary()

  • save_path : Path to save the output file

  • format : Output format (‘csv’ or ‘txt’)

  • decimal_places : Number of decimal places for prediction percentages results from bootstrap_predict_ccr_summary()

Outputs

None (saves file to disk)

Warnings/Errors

  • ValueError : If save_path is not specified

class ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]

Container for permutation simulation (including test) results with statistical visualization capabilities.

Input Arguments

  • metric_name : Name of the (S)CCRAM being tested

  • observed_value : Original observed value of the (S)CCRAM

  • p_value : Permutation test p-value

  • null_distribution : Array of the values of the (S)CCRAM computed for the permuted contingency tables

  • permutation_tables : (Optional) Array of permuted contingency tables generated under the null hypothesis (no regression association)

  • histogram_fig : (Optional) Matplotlib figure of distribution plot

metric_name: str[source]

None

observed_value: float[source]

None

p_value: float[source]

None

null_distribution: numpy.ndarray[source]

None

permutation_tables: numpy.ndarray[source]

None

histogram_fig: matplotlib.pyplot.Figure[source]

None

plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, **kwargs) Optional[matplotlib.pyplot.Figure][source]

Plot null distribution with observed value.

Input Arguments

  • title : Title of the plot (optional)

  • figsize : Figure size as (width, height) tuple (optional)

  • title_fontsize : Font size for the plot title (optional)

  • xlabel_fontsize : Font size for x-axis label (optional)

  • ylabel_fontsize : Font size for y-axis label (optional)

  • tick_fontsize : Font size for axis tick labels (optional)

  • **kwargs : Additional matplotlib arguments passed to plotting functions

Outputs

Matplotlib figure of distribution plot

ccrvam.checkerboard.genstatsim._process_permutation_batch(args)[source]

Helper function for parallel permutation test processing.

ccrvam.checkerboard.genstatsim.permutation_test_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, alternative: str = 'greater', n_resamples: int = 9999, random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]

Perform permutation simulation and test for (S)CCRAM.

Input Arguments

  • contingency_table : Input contingency table of frequency counts

  • predictors : List of 1-indexed predictors axes for (S)CCRAM calculation

  • response : 1-indexed target response axis for (S)CCRAM calculation

  • scaled : Whether to use scaled (S)CCRAM (default=False)

  • alternative : Alternative hypothesis (‘greater’, ‘less’, ‘two-sided’) (default=’greater’)

  • n_resamples : Number of permutations (default=9999)

  • random_state : Random state for reproducibility (optional)

  • store_tables : Whether to store the permuted contingency tables (default=False)

  • parallel : Whether to use parallel processing (default=False)

Outputs

Test results including Monte Carlo permutation p-value, (S)CCRAM values computed for the permuted contingency tables, and (optionally) permuted contingency tables generated under the null hypothesis (no regression association)

Warnings/Errors

  • ValueError : If predictor or response variable axis is out of bounds

class ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]

Container for subset (S)CCRAM analysis results.

Input Arguments

  • results_df : DataFrame containing all subset (S)CCRAM results with columns:

    • k: number of predictors in subset

    • predictors: tuple of predictor variable indices (1-indexed)

    • response: response variable index (1-indexed)

    • ccram/sccram: (S)CCRAM value for this subset (column name depends on scaled parameter)

  • response : The response variable index (1-indexed) used in the analysis

  • n_dimensions : Total number of dimensions in the contingency table

  • scaled : Whether scaled CCRAM (SCCRAM) was used

_results_df_full: pandas.DataFrame[source]

None

response: int[source]

None

n_dimensions: int[source]

None

scaled: bool[source]

None

property results_df: pandas.DataFrame[source]

Return the results DataFrame with internal columns hidden.

property metric_column: str[source]

Return the column name for the metric based on scaled parameter.

_filter_display_columns(df: pandas.DataFrame) pandas.DataFrame[source]

Filter out internal columns (starting with ‘_’) from a DataFrame.

get_top_subsets(top: int = 5) pandas.DataFrame[source]

Get the top subsets with highest (S)CCRAM values across all predictor sizes.

This method returns the subsets with the highest (S)CCRAM values globally, regardless of the number of predictors (k) in each subset.

Input Arguments

  • top : Number of top subsets to return (default=5)

Outputs

DataFrame with top subsets sorted by (S)CCRAM value (highest first)

get_top_subsets_per_k(top: int = 3) pandas.DataFrame[source]

Get the top subsets with highest (S)CCRAM values for each predictor size k.

This method returns the top top subsets for EACH value of k (number of predictors), from k=1 up to k=D (where D is the total number of available predictors). This is useful when users want to compare the best predictor combinations within each subset size.

Input Arguments

  • top : Number of top subsets to return for each k value (default=3). If top exceeds the number of possible combinations for a given k (i.e., top > C(D,k) where D is total predictors), all combinations for that k are returned.

Outputs

DataFrame with top subsets for each k, sorted by k ascending and (S)CCRAM descending within each k. Includes all columns from results_df.

get_subsets_by_k(k: int) pandas.DataFrame[source]

Get all subsets with exactly k predictors.

Input Arguments

  • k : Number of predictors

Outputs

DataFrame with subsets having k predictors, sorted by (S)CCRAM

summary() pandas.DataFrame[source]

Get summary statistics for each k value.

Outputs

DataFrame with summary statistics (max, mean, min, count) for each k

get_results_with_penalties() pandas.DataFrame[source]

Get results DataFrame with penalty-related columns for predictor categories.

This method computes additional columns that can be useful for penalizing (S)CCRAM values based on the number of predictor categories, since (S)CCRAM is non-decreasing as the number of predictors increases.

Outputs

DataFrame with all public columns plus:

  • sum_cate: Sum of categories across all predictors in the subset

  • prod_cate: Product of categories across all predictors in the subset

Example

result = all_subsets_ccram(table, response=4, scaled=True) df = result.get_results_with_penalties()

Use sum_cate or prod_cate to compute penalized scores

df[‘penalized_sccram’] = df[‘sccram’] / df[‘sum_cate’]

plot_subsets(figsize: Optional[Tuple[int, int]] = None, point_size: int = 80, point_color: str = 'steelblue', title: Optional[str] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, label_fontsize: Optional[int] = None, save_path: Optional[str] = None, dpi: int = 300, **kwargs) Tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]

Plot all subset (S)CCRAM values against the number of predictors (k).

This visualization helps identify patterns across different subset sizes and aids in deciding which k value to focus on for detailed analysis. The best subset for each k is labeled with its predictor combination.

Input Arguments

  • figsize : Figure size as (width, height) tuple (optional, default=(10, 6))

  • point_size : Size of scatter points (default=80)

  • point_color : Color for scatter points (default=’steelblue’)

  • title : Custom title for the plot (optional)

  • title_fontsize : Font size for the plot title (optional)

  • xlabel_fontsize : Font size for x-axis label (optional)

  • ylabel_fontsize : Font size for y-axis label (optional)

  • tick_fontsize : Font size for axis tick labels (optional)

  • label_fontsize : Font size for labels on best subsets (optional)

  • save_path : Path to save the plot (optional)

  • dpi : Resolution for saved image (default=300)

  • **kwargs : Additional matplotlib arguments passed to plt.subplots()

Outputs

Tuple of (Figure, Axes) matplotlib objects

class ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]

Container for best subset (S)CCRAM analysis results.

Input Arguments

  • predictors : Tuple of optimal predictor variable indices (1-indexed)

  • response : Response variable index (1-indexed)

  • ccram : (S)CCRAM value for the optimal subset

  • k : Number of predictors in the optimal subset

  • rank_within_k : Rank of this subset within all subsets of size k

  • total_subsets_in_k : Total number of subsets of size k

  • scaled : Whether scaled CCRAM (SCCRAM) was used

  • all_results : Full SubsetCCRAMResult object for further analysis

predictors: tuple[source]

None

response: int[source]

None

ccram: float[source]

None

k: int[source]

None

rank_within_k: int[source]

None

total_subsets_in_k: int[source]

None

scaled: bool[source]

None

all_results: ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]

None

__repr__() str[source]
summary_df() pandas.DataFrame[source]

Return a DataFrame with the summary of the best subset.

Outputs

DataFrame with best subset information

ccrvam.checkerboard.genstatsim._format_tuple_display(t: tuple) str[source]

Format a tuple as a string without trailing comma for single elements.

ccrvam.checkerboard.genstatsim.all_subsets_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]

Calculate (S)CCRAM for all possible predictor subsets.

This function computes the (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for all combinations of predictor variables given a specified response variable. Results are organized by the number of predictors (k) and sorted by (S)CCRAM value within each k.

Input Arguments

  • contingency_table : Input contingency table of frequency counts (multi-dimensional numpy array)

  • response : 1-indexed target response variable axis for (S)CCRAM calculation

  • scaled : Whether to use scaled (S)CCRAM (default=False)

  • k : Optional number of predictors to consider. If None, all possible subset sizes are computed (from k=1 to k=ndim-1). If specified, only subsets of size k are computed.

  • variable_names : Optional dictionary mapping 1-indexed variable indices to names. If provided, predictor names will be included in the output.

Outputs

SubsetCCRAMResult object containing:

  • results_df: DataFrame with columns [k, predictors, pred_cate, response, ccram/sccram] (column name is ‘ccram’ when scaled=False, ‘sccram’ when scaled=True) (and optionally predictor_names), sorted by k ascending and metric value descending within each k. The pred_cate column contains the number of categories for each predictor in the subset, formatted as a tuple string (e.g., “(2, 3)” for a 2-predictor subset where the first predictor has 2 categories and the second has 3).

  • response: The response variable index

  • n_dimensions: Total number of dimensions

  • scaled: Whether scaled CCRAM was used

Warnings/Errors

  • ValueError : If response axis is out of bounds

  • ValueError : If k is specified but invalid (k < 1 or k >= ndim)

ccrvam.checkerboard.genstatsim.best_subset_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]

Find the optimal predictor subset with the highest (S)CCRAM value.

This function identifies the predictor combination that yields the maximum (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for predicting the specified response variable.

Input Arguments

  • contingency_table : Input contingency table of frequency counts (multi-dimensional numpy array)

  • response : 1-indexed target response variable axis for (S)CCRAM calculation

  • scaled : Whether to use scaled (S)CCRAM (default=False)

  • k : Optional number of predictors to consider. If None, searches across all possible subset sizes (k=1 to k=ndim-1). If specified, finds the best subset of exactly k predictors.

  • variable_names : Optional dictionary mapping 1-indexed variable indices to names.

Outputs

BestSubsetCCRAMResult object containing: - predictors: Tuple of optimal predictor variable indices (1-indexed) - response: Response variable index - ccram: (S)CCRAM value for the optimal subset - k: Number of predictors in the optimal subset - rank_within_k: Rank of this subset among all subsets of the same size k - total_subsets_in_k: Total number of subsets of size k - scaled: Whether scaled CCRAM was used - all_results: Complete SubsetCCRAMResult for further analysis

Warnings/Errors

  • ValueError : If response axis is out of bounds

  • ValueError : If k is specified but invalid (k < 1 or k >= ndim)