Generic Bootstrap and Permutation methods API

`ccrvam.checkerboard.genstatsim`

Module Contents

Classes

`CustomBootstrapResult`	Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities.
`CustomPermutationResult`	Container for permutation simulation (including test) results with statistical visualization capabilities.
`SubsetCCRAMResult`	Container for subset (S)CCRAM analysis results.
`BestSubsetCCRAMResult`	Container for best subset (S)CCRAM analysis results.

Functions

`_process_bootstrap_batch`	Helper function for parallel bootstrap processing.
`bootstrap_ccram`	Perform bootstrap simulation and confidence intervals for (S)CCRAM.
`_process_prediction_batch`	Helper function for parallel prediction processing.
`bootstrap_predict_ccr_summary`	Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis.
`save_predictions`	Save prediction results generated by `bootstrap_predict_ccr_summary()` to a file.
`_process_permutation_batch`	Helper function for parallel permutation test processing.
`permutation_test_ccram`	Perform permutation simulation and test for (S)CCRAM.
`_format_tuple_display`	Format a tuple as a string without trailing comma for single elements.
`all_subsets_ccram`	Calculate (S)CCRAM for all possible predictor subsets.
`best_subset_ccram`	Find the optimal predictor subset with the highest (S)CCRAM value.

API

class ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]

Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities.

Input Arguments

metric_name : Name of the metric being bootstrapped
observed_value : Original observed value of the metric
confidence_interval : Lower and upper bounds for the bootstrap confidence interval
bootstrap_distribution : Array of bootstrapped values of the metric
standard_error : Standard error of the bootstrap distribution
bootstrap_tables : Array of bootstrapped contingency tables (optional)
histogram_fig : Matplotlib figure of distribution plot (optional)

metric_name: str[source]: None

observed_value: float[source]: None

confidence_interval: Tuple[float, float][source]: None

bootstrap_distribution: numpy.ndarray[source]: None

standard_error: float[source]: None

bootstrap_tables: Optional[numpy.ndarray][source]: None

histogram_fig: Optional[matplotlib.pyplot.Figure][source]: None

plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, text_fontsize: Optional[int] = None, **kwargs) → Optional[matplotlib.pyplot.Figure][source]

Plot bootstrap distribution with observed value.

Input Arguments

title : Title of the plot (optional)
figsize : Figure size as (width, height) tuple (optional)
title_fontsize : Font size for the plot title (optional)
xlabel_fontsize : Font size for x-axis label (optional)
ylabel_fontsize : Font size for y-axis label (optional)
tick_fontsize : Font size for axis tick labels (optional)
text_fontsize : Font size for text inside the plot (optional)
**kwargs : Additional matplotlib arguments passed to plotting functions

Outputs

Matplotlib figure of distribution plot

Warnings/Errors

Exception : If the plot cannot be created

ccrvam.checkerboard.genstatsim._process_bootstrap_batch(args)[source]: Helper function for parallel bootstrap processing.

ccrvam.checkerboard.genstatsim.bootstrap_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, n_resamples: int = 9999, confidence_level: float = 0.95, method: str = 'percentile', random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) → ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]

Perform bootstrap simulation and confidence intervals for (S)CCRAM.

Input Arguments

contingency_table : Input contingency table of frequency counts
predictors : List of 1-indexed predictor axes for (S)CCRAM calculation
response : 1-indexed target response variable axis for (S)CCRAM calculation
scaled : Whether to use scaled (S)CCRAM (default=False)
n_resamples : Number of bootstrap resamples (default=9999)
confidence_level : Confidence level for bootstrap confidence intervals (default=0.95)
method : Bootstrap CI method (‘percentile’, ‘basic’, ‘BCa’); (default=’percentile’)
random_state : Random state for reproducibility (optional)
store_tables : Whether to store the bootstrapped contingency tables (default=False)
parallel : Whether to use parallel processing (default=False)

Outputs

Bootstrap result class containing bootstrap confidence interval, bootstrap estimates for the (S)CCRAM and bootstrap tables

Warnings/Errors

ValueError : If predictor or response axis is out of bounds

ccrvam.checkerboard.genstatsim._process_prediction_batch(args)[source]: Helper function for parallel prediction processing.

ccrvam.checkerboard.genstatsim.bootstrap_predict_ccr_summary(table: numpy.ndarray, predictors: Union[List[int], int], predictors_names: Optional[List[str]] = None, response: Optional[int] = None, response_name: Optional[str] = None, n_resamples: int = 9999, random_state: Optional[int] = None, parallel: bool = True) → pandas.DataFrame[source]

Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis.

Input Arguments

table : Contingency table of frequency counts
predictors : List of predictor dimensions (1-indexed)
predictors_names : Names of predictor variables (optional)
response : Response variable dimension (1-indexed). If None, the last dimension is used.
response_name : Name of response variable (optional)
n_resamples : Number of bootstrap resamples (default=9999)
random_state : Random state for reproducibility (optional)
parallel : Whether to use parallel processing (default=True)

Outputs

CCR Prediction matrix post-bootstrap showing the percentage of the predicted category of the response variable for each combination of categories of the predictors.

Warnings/Errors

ValueError : If predictor or response axis is out of bounds

Notes

The output is a pandas DataFrame with the percentage of the predicted category of the response variable for each combination of categories of the predictors.
The output also includes a method plot_predictions_summary to plot the prediction matrix as a heatmap or bubble plot.

ccrvam.checkerboard.genstatsim.save_predictions(prediction_matrix: pandas.DataFrame, save_path: Optional[str] = None, format: str = 'csv', decimal_places: int = 2) → None[source]

Save prediction results generated by bootstrap_predict_ccr_summary() to a file.

Input Arguments

prediction_matrix : DataFrame containing prediction results generated by bootstrap_predict_ccr_summary()
save_path : Path to save the output file
format : Output format (‘csv’ or ‘txt’)
decimal_places : Number of decimal places for prediction percentages results from bootstrap_predict_ccr_summary()

Outputs

None (saves file to disk)

Warnings/Errors

ValueError : If save_path is not specified

class ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]

Container for permutation simulation (including test) results with statistical visualization capabilities.

Input Arguments

metric_name : Name of the (S)CCRAM being tested
observed_value : Original observed value of the (S)CCRAM
p_value : Permutation test p-value
null_distribution : Array of the values of the (S)CCRAM computed for the permuted contingency tables
permutation_tables : (Optional) Array of permuted contingency tables generated under the null hypothesis (no regression association)
histogram_fig : (Optional) Matplotlib figure of distribution plot

metric_name: str[source]: None

observed_value: float[source]: None

p_value: float[source]: None

null_distribution: numpy.ndarray[source]: None

permutation_tables: numpy.ndarray[source]: None

histogram_fig: matplotlib.pyplot.Figure[source]: None

plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, **kwargs) → Optional[matplotlib.pyplot.Figure][source]

Plot null distribution with observed value.

Input Arguments

title : Title of the plot (optional)
figsize : Figure size as (width, height) tuple (optional)
title_fontsize : Font size for the plot title (optional)
xlabel_fontsize : Font size for x-axis label (optional)
ylabel_fontsize : Font size for y-axis label (optional)
tick_fontsize : Font size for axis tick labels (optional)
**kwargs : Additional matplotlib arguments passed to plotting functions

Outputs

Matplotlib figure of distribution plot

ccrvam.checkerboard.genstatsim._process_permutation_batch(args)[source]: Helper function for parallel permutation test processing.

ccrvam.checkerboard.genstatsim.permutation_test_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, alternative: str = 'greater', n_resamples: int = 9999, random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) → ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]

Perform permutation simulation and test for (S)CCRAM.

Input Arguments

contingency_table : Input contingency table of frequency counts
predictors : List of 1-indexed predictors axes for (S)CCRAM calculation
response : 1-indexed target response axis for (S)CCRAM calculation
scaled : Whether to use scaled (S)CCRAM (default=False)
alternative : Alternative hypothesis (‘greater’, ‘less’, ‘two-sided’) (default=’greater’)
n_resamples : Number of permutations (default=9999)
random_state : Random state for reproducibility (optional)
store_tables : Whether to store the permuted contingency tables (default=False)
parallel : Whether to use parallel processing (default=False)

Outputs

Test results including Monte Carlo permutation p-value, (S)CCRAM values computed for the permuted contingency tables, and (optionally) permuted contingency tables generated under the null hypothesis (no regression association)

Warnings/Errors

ValueError : If predictor or response variable axis is out of bounds

class ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]

Container for subset (S)CCRAM analysis results.

Input Arguments

results_df : DataFrame containing all subset (S)CCRAM results with columns:
- k: number of predictors in subset
- predictors: tuple of predictor variable indices (1-indexed)
- response: response variable index (1-indexed)
- ccram/sccram: (S)CCRAM value for this subset (column name depends on scaled parameter)
response : The response variable index (1-indexed) used in the analysis
n_dimensions : Total number of dimensions in the contingency table
scaled : Whether scaled CCRAM (SCCRAM) was used

_results_df_full: pandas.DataFrame[source]: None

response: int[source]: None

n_dimensions: int[source]: None

scaled: bool[source]: None

property results_df: pandas.DataFrame[source]: Return the results DataFrame with internal columns hidden.

property metric_column: str[source]: Return the column name for the metric based on scaled parameter.

_filter_display_columns(df: pandas.DataFrame) → pandas.DataFrame[source]: Filter out internal columns (starting with ‘_’) from a DataFrame.

get_top_subsets(top: int = 5) → pandas.DataFrame[source]

Get the top subsets with highest (S)CCRAM values across all predictor sizes.

This method returns the subsets with the highest (S)CCRAM values globally, regardless of the number of predictors (k) in each subset.

Input Arguments

top : Number of top subsets to return (default=5)

Outputs

DataFrame with top subsets sorted by (S)CCRAM value (highest first)

get_top_subsets_per_k(top: int = 3) → pandas.DataFrame[source]

Get the top subsets with highest (S)CCRAM values for each predictor size k.

This method returns the top top subsets for EACH value of k (number of predictors), from k=1 up to k=D (where D is the total number of available predictors). This is useful when users want to compare the best predictor combinations within each subset size.

Input Arguments

top : Number of top subsets to return for each k value (default=3). If top exceeds the number of possible combinations for a given k (i.e., top > C(D,k) where D is total predictors), all combinations for that k are returned.

Outputs

DataFrame with top subsets for each k, sorted by k ascending and (S)CCRAM descending within each k. Includes all columns from results_df.

get_subsets_by_k(k: int) → pandas.DataFrame[source]

Get all subsets with exactly k predictors.

Input Arguments

k : Number of predictors

Outputs

DataFrame with subsets having k predictors, sorted by (S)CCRAM

summary() → pandas.DataFrame[source]

Get summary statistics for each k value.

Outputs

DataFrame with summary statistics (max, mean, min, count) for each k

get_results_with_penalties() → pandas.DataFrame[source]

Get results DataFrame with penalty-related columns for predictor categories.

This method computes additional columns that can be useful for penalizing (S)CCRAM values based on the number of predictor categories, since (S)CCRAM is non-decreasing as the number of predictors increases.

Outputs

DataFrame with all public columns plus:

sum_cate: Sum of categories across all predictors in the subset
prod_cate: Product of categories across all predictors in the subset

Example

result = all_subsets_ccram(table, response=4, scaled=True) df = result.get_results_with_penalties()

Use sum_cate or prod_cate to compute penalized scores

df[‘penalized_sccram’] = df[‘sccram’] / df[‘sum_cate’]

plot_subsets(figsize: Optional[Tuple[int, int]] = None, point_size: int = 80, point_color: str = 'steelblue', title: Optional[str] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, label_fontsize: Optional[int] = None, save_path: Optional[str] = None, dpi: int = 300, **kwargs) → Tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]

Plot all subset (S)CCRAM values against the number of predictors (k).

This visualization helps identify patterns across different subset sizes and aids in deciding which k value to focus on for detailed analysis. The best subset for each k is labeled with its predictor combination.

Input Arguments

figsize : Figure size as (width, height) tuple (optional, default=(10, 6))
point_size : Size of scatter points (default=80)
point_color : Color for scatter points (default=’steelblue’)
title : Custom title for the plot (optional)
title_fontsize : Font size for the plot title (optional)
xlabel_fontsize : Font size for x-axis label (optional)
ylabel_fontsize : Font size for y-axis label (optional)
tick_fontsize : Font size for axis tick labels (optional)
label_fontsize : Font size for labels on best subsets (optional)
save_path : Path to save the plot (optional)
dpi : Resolution for saved image (default=300)
**kwargs : Additional matplotlib arguments passed to plt.subplots()

Outputs

Tuple of (Figure, Axes) matplotlib objects

class ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]

Container for best subset (S)CCRAM analysis results.

Input Arguments

predictors : Tuple of optimal predictor variable indices (1-indexed)
response : Response variable index (1-indexed)
ccram : (S)CCRAM value for the optimal subset
k : Number of predictors in the optimal subset
rank_within_k : Rank of this subset within all subsets of size k
total_subsets_in_k : Total number of subsets of size k
scaled : Whether scaled CCRAM (SCCRAM) was used
all_results : Full SubsetCCRAMResult object for further analysis

predictors: tuple[source]: None

response: int[source]: None

ccram: float[source]: None

k: int[source]: None

rank_within_k: int[source]: None

total_subsets_in_k: int[source]: None

scaled: bool[source]: None

all_results: ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]: None

__repr__() → str[source]

summary_df() → pandas.DataFrame[source]

Return a DataFrame with the summary of the best subset.

Outputs

DataFrame with best subset information

ccrvam.checkerboard.genstatsim._format_tuple_display(t: tuple) → str[source]: Format a tuple as a string without trailing comma for single elements.

ccrvam.checkerboard.genstatsim.all_subsets_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) → ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]

Calculate (S)CCRAM for all possible predictor subsets.

This function computes the (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for all combinations of predictor variables given a specified response variable. Results are organized by the number of predictors (k) and sorted by (S)CCRAM value within each k.

Input Arguments

contingency_table : Input contingency table of frequency counts (multi-dimensional numpy array)
response : 1-indexed target response variable axis for (S)CCRAM calculation
scaled : Whether to use scaled (S)CCRAM (default=False)
k : Optional number of predictors to consider. If None, all possible subset sizes are computed (from k=1 to k=ndim-1). If specified, only subsets of size k are computed.
variable_names : Optional dictionary mapping 1-indexed variable indices to names. If provided, predictor names will be included in the output.

Outputs

SubsetCCRAMResult object containing:

results_df: DataFrame with columns [k, predictors, pred_cate, response, ccram/sccram] (column name is ‘ccram’ when scaled=False, ‘sccram’ when scaled=True) (and optionally predictor_names), sorted by k ascending and metric value descending within each k. The pred_cate column contains the number of categories for each predictor in the subset, formatted as a tuple string (e.g., “(2, 3)” for a 2-predictor subset where the first predictor has 2 categories and the second has 3).
response: The response variable index
n_dimensions: Total number of dimensions
scaled: Whether scaled CCRAM was used

Warnings/Errors

ValueError : If response axis is out of bounds
ValueError : If k is specified but invalid (k < 1 or k >= ndim)

ccrvam.checkerboard.genstatsim.best_subset_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) → ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]

Find the optimal predictor subset with the highest (S)CCRAM value.

This function identifies the predictor combination that yields the maximum (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for predicting the specified response variable.

Input Arguments

contingency_table : Input contingency table of frequency counts (multi-dimensional numpy array)
response : 1-indexed target response variable axis for (S)CCRAM calculation
scaled : Whether to use scaled (S)CCRAM (default=False)
k : Optional number of predictors to consider. If None, searches across all possible subset sizes (k=1 to k=ndim-1). If specified, finds the best subset of exactly k predictors.
variable_names : Optional dictionary mapping 1-indexed variable indices to names.

Outputs

BestSubsetCCRAMResult object containing: - predictors: Tuple of optimal predictor variable indices (1-indexed) - response: Response variable index - ccram: (S)CCRAM value for the optimal subset - k: Number of predictors in the optimal subset - rank_within_k: Rank of this subset among all subsets of the same size k - total_subsets_in_k: Total number of subsets of size k - scaled: Whether scaled CCRAM was used - all_results: Complete SubsetCCRAMResult for further analysis

Warnings/Errors

ValueError : If response axis is out of bounds
ValueError : If k is specified but invalid (k < 1 or k >= ndim)

Generic Bootstrap and Permutation methods API

ccrvam.checkerboard.genstatsim

Module Contents

Classes

Functions

API

Input Arguments

Input Arguments

Outputs

Warnings/Errors

Input Arguments

Outputs

Warnings/Errors

Input Arguments

Outputs

Warnings/Errors

Notes

Input Arguments

Outputs

Warnings/Errors

Input Arguments

Input Arguments

Outputs

Input Arguments

Outputs

Warnings/Errors

Input Arguments

Input Arguments

Outputs

Input Arguments

Outputs

Input Arguments

Outputs

Outputs

Outputs

Example

Use sum_cate or prod_cate to compute penalized scores

Input Arguments

Outputs

Input Arguments

Outputs

Input Arguments

Outputs

Warnings/Errors

Input Arguments

Outputs

Warnings/Errors

`ccrvam.checkerboard.genstatsim`