Generic Bootstrap and Permutation methods API
ccrvam.checkerboard.genstatsim
Module Contents
Classes
|
Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities. |
|
Container for permutation simulation (including test) results with statistical visualization capabilities. |
|
Container for subset (S)CCRAM analysis results. |
|
Container for best subset (S)CCRAM analysis results. |
Functions
|
Helper function for parallel bootstrap processing. |
|
Perform bootstrap simulation and confidence intervals for (S)CCRAM. |
|
Helper function for parallel prediction processing. |
|
Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis. |
|
Save prediction results generated by |
|
Helper function for parallel permutation test processing. |
|
Perform permutation simulation and test for (S)CCRAM. |
|
Format a tuple as a string without trailing comma for single elements. |
|
Calculate (S)CCRAM for all possible predictor subsets. |
|
Find the optimal predictor subset with the highest (S)CCRAM value. |
API
- class ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]
Container for bootstrap simulation (including confidence intervals) results with statistical visualization capabilities.
Input Arguments
metric_name: Name of the metric being bootstrappedobserved_value: Original observed value of the metricconfidence_interval: Lower and upper bounds for the bootstrap confidence intervalbootstrap_distribution: Array of bootstrapped values of the metricstandard_error: Standard error of the bootstrap distributionbootstrap_tables: Array of bootstrapped contingency tables (optional)histogram_fig: Matplotlib figure of distribution plot (optional)
- bootstrap_distribution: numpy.ndarray[source]
None
- bootstrap_tables: Optional[numpy.ndarray][source]
None
- plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, text_fontsize: Optional[int] = None, **kwargs) Optional[matplotlib.pyplot.Figure][source]
Plot bootstrap distribution with observed value.
Input Arguments
title: Title of the plot (optional)figsize: Figure size as (width, height) tuple (optional)title_fontsize: Font size for the plot title (optional)xlabel_fontsize: Font size for x-axis label (optional)ylabel_fontsize: Font size for y-axis label (optional)tick_fontsize: Font size for axis tick labels (optional)text_fontsize: Font size for text inside the plot (optional)**kwargs: Additional matplotlib arguments passed to plotting functions
Outputs
Matplotlib figure of distribution plot
Warnings/Errors
Exception: If the plot cannot be created
- ccrvam.checkerboard.genstatsim._process_bootstrap_batch(args)[source]
Helper function for parallel bootstrap processing.
- ccrvam.checkerboard.genstatsim.bootstrap_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, n_resamples: int = 9999, confidence_level: float = 0.95, method: str = 'percentile', random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) ccrvam.checkerboard.genstatsim.CustomBootstrapResult[source]
Perform bootstrap simulation and confidence intervals for (S)CCRAM.
Input Arguments
contingency_table: Input contingency table of frequency countspredictors: List of 1-indexed predictor axes for (S)CCRAM calculationresponse: 1-indexed target response variable axis for (S)CCRAM calculationscaled: Whether to use scaled (S)CCRAM (default=False)n_resamples: Number of bootstrap resamples (default=9999)confidence_level: Confidence level for bootstrap confidence intervals (default=0.95)method: Bootstrap CI method (‘percentile’, ‘basic’, ‘BCa’); (default=’percentile’)random_state: Random state for reproducibility (optional)store_tables: Whether to store the bootstrapped contingency tables (default=False)parallel: Whether to use parallel processing (default=False)
Outputs
Bootstrap result class containing bootstrap confidence interval, bootstrap estimates for the (S)CCRAM and bootstrap tables
Warnings/Errors
ValueError: If predictor or response axis is out of bounds
- ccrvam.checkerboard.genstatsim._process_prediction_batch(args)[source]
Helper function for parallel prediction processing.
- ccrvam.checkerboard.genstatsim.bootstrap_predict_ccr_summary(table: numpy.ndarray, predictors: Union[List[int], int], predictors_names: Optional[List[str]] = None, response: Optional[int] = None, response_name: Optional[str] = None, n_resamples: int = 9999, random_state: Optional[int] = None, parallel: bool = True) pandas.DataFrame[source]
Compute bootstrap prediction matrix showing percentage predictions for each combination of predictor values in CCR analysis.
Input Arguments
table: Contingency table of frequency countspredictors: List of predictor dimensions (1-indexed)predictors_names: Names of predictor variables (optional)response: Response variable dimension (1-indexed). If None, the last dimension is used.response_name: Name of response variable (optional)n_resamples: Number of bootstrap resamples (default=9999)random_state: Random state for reproducibility (optional)parallel: Whether to use parallel processing (default=True)
Outputs
CCR Prediction matrix post-bootstrap showing the percentage of the predicted category of the response variable for each combination of categories of the predictors.
Warnings/Errors
ValueError: If predictor or response axis is out of bounds
Notes
The output is a pandas DataFrame with the percentage of the predicted category of the response variable for each combination of categories of the predictors.
The output also includes a method
plot_predictions_summaryto plot the prediction matrix as a heatmap or bubble plot.
- ccrvam.checkerboard.genstatsim.save_predictions(prediction_matrix: pandas.DataFrame, save_path: Optional[str] = None, format: str = 'csv', decimal_places: int = 2) None[source]
Save prediction results generated by
bootstrap_predict_ccr_summary()to a file.Input Arguments
prediction_matrix: DataFrame containing prediction results generated bybootstrap_predict_ccr_summary()save_path: Path to save the output fileformat: Output format (‘csv’ or ‘txt’)decimal_places: Number of decimal places for prediction percentages results frombootstrap_predict_ccr_summary()
Outputs
None (saves file to disk)
Warnings/Errors
ValueError: If save_path is not specified
- class ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]
Container for permutation simulation (including test) results with statistical visualization capabilities.
Input Arguments
metric_name: Name of the (S)CCRAM being testedobserved_value: Original observed value of the (S)CCRAMp_value: Permutation test p-valuenull_distribution: Array of the values of the (S)CCRAM computed for the permuted contingency tablespermutation_tables: (Optional) Array of permuted contingency tables generated under the null hypothesis (no regression association)histogram_fig: (Optional) Matplotlib figure of distribution plot
- null_distribution: numpy.ndarray[source]
None
- permutation_tables: numpy.ndarray[source]
None
- histogram_fig: matplotlib.pyplot.Figure[source]
None
- plot_distribution(title: Optional[str] = None, figsize: Optional[Tuple[int, int]] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, **kwargs) Optional[matplotlib.pyplot.Figure][source]
Plot null distribution with observed value.
Input Arguments
title: Title of the plot (optional)figsize: Figure size as (width, height) tuple (optional)title_fontsize: Font size for the plot title (optional)xlabel_fontsize: Font size for x-axis label (optional)ylabel_fontsize: Font size for y-axis label (optional)tick_fontsize: Font size for axis tick labels (optional)**kwargs: Additional matplotlib arguments passed to plotting functions
Outputs
Matplotlib figure of distribution plot
- ccrvam.checkerboard.genstatsim._process_permutation_batch(args)[source]
Helper function for parallel permutation test processing.
- ccrvam.checkerboard.genstatsim.permutation_test_ccram(contingency_table: numpy.ndarray, predictors: Union[List[int], int], response: int, scaled: bool = False, alternative: str = 'greater', n_resamples: int = 9999, random_state: Optional[int] = None, store_tables: bool = False, parallel: bool = False) ccrvam.checkerboard.genstatsim.CustomPermutationResult[source]
Perform permutation simulation and test for (S)CCRAM.
Input Arguments
contingency_table: Input contingency table of frequency countspredictors: List of 1-indexed predictors axes for (S)CCRAM calculationresponse: 1-indexed target response axis for (S)CCRAM calculationscaled: Whether to use scaled (S)CCRAM (default=False)alternative: Alternative hypothesis (‘greater’, ‘less’, ‘two-sided’) (default=’greater’)n_resamples: Number of permutations (default=9999)random_state: Random state for reproducibility (optional)store_tables: Whether to store the permuted contingency tables (default=False)parallel: Whether to use parallel processing (default=False)
Outputs
Test results including Monte Carlo permutation p-value, (S)CCRAM values computed for the permuted contingency tables, and (optionally) permuted contingency tables generated under the null hypothesis (no regression association)
Warnings/Errors
ValueError: If predictor or response variable axis is out of bounds
- class ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]
Container for subset (S)CCRAM analysis results.
Input Arguments
results_df: DataFrame containing all subset (S)CCRAM results with columns:k: number of predictors in subset
predictors: tuple of predictor variable indices (1-indexed)
response: response variable index (1-indexed)
ccram/sccram: (S)CCRAM value for this subset (column name depends on scaled parameter)
response: The response variable index (1-indexed) used in the analysisn_dimensions: Total number of dimensions in the contingency tablescaled: Whether scaled CCRAM (SCCRAM) was used
- _results_df_full: pandas.DataFrame[source]
None
- property results_df: pandas.DataFrame[source]
Return the results DataFrame with internal columns hidden.
- property metric_column: str[source]
Return the column name for the metric based on scaled parameter.
- _filter_display_columns(df: pandas.DataFrame) pandas.DataFrame[source]
Filter out internal columns (starting with ‘_’) from a DataFrame.
- get_top_subsets(top: int = 5) pandas.DataFrame[source]
Get the top subsets with highest (S)CCRAM values across all predictor sizes.
This method returns the subsets with the highest (S)CCRAM values globally, regardless of the number of predictors (k) in each subset.
Input Arguments
top: Number of top subsets to return (default=5)
Outputs
DataFrame with top subsets sorted by (S)CCRAM value (highest first)
- get_top_subsets_per_k(top: int = 3) pandas.DataFrame[source]
Get the top subsets with highest (S)CCRAM values for each predictor size k.
This method returns the top
topsubsets for EACH value of k (number of predictors), from k=1 up to k=D (where D is the total number of available predictors). This is useful when users want to compare the best predictor combinations within each subset size.Input Arguments
top: Number of top subsets to return for each k value (default=3). Iftopexceeds the number of possible combinations for a given k (i.e., top > C(D,k) where D is total predictors), all combinations for that k are returned.
Outputs
DataFrame with top subsets for each k, sorted by k ascending and (S)CCRAM descending within each k. Includes all columns from results_df.
- get_subsets_by_k(k: int) pandas.DataFrame[source]
Get all subsets with exactly k predictors.
Input Arguments
k: Number of predictors
Outputs
DataFrame with subsets having k predictors, sorted by (S)CCRAM
- summary() pandas.DataFrame[source]
Get summary statistics for each k value.
Outputs
DataFrame with summary statistics (max, mean, min, count) for each k
- get_results_with_penalties() pandas.DataFrame[source]
Get results DataFrame with penalty-related columns for predictor categories.
This method computes additional columns that can be useful for penalizing (S)CCRAM values based on the number of predictor categories, since (S)CCRAM is non-decreasing as the number of predictors increases.
Outputs
DataFrame with all public columns plus:
sum_cate: Sum of categories across all predictors in the subsetprod_cate: Product of categories across all predictors in the subset
Example
result = all_subsets_ccram(table, response=4, scaled=True) df = result.get_results_with_penalties()
Use sum_cate or prod_cate to compute penalized scores
df[‘penalized_sccram’] = df[‘sccram’] / df[‘sum_cate’]
- plot_subsets(figsize: Optional[Tuple[int, int]] = None, point_size: int = 80, point_color: str = 'steelblue', title: Optional[str] = None, title_fontsize: Optional[int] = None, xlabel_fontsize: Optional[int] = None, ylabel_fontsize: Optional[int] = None, tick_fontsize: Optional[int] = None, label_fontsize: Optional[int] = None, save_path: Optional[str] = None, dpi: int = 300, **kwargs) Tuple[matplotlib.pyplot.Figure, matplotlib.pyplot.Axes][source]
Plot all subset (S)CCRAM values against the number of predictors (k).
This visualization helps identify patterns across different subset sizes and aids in deciding which k value to focus on for detailed analysis. The best subset for each k is labeled with its predictor combination.
Input Arguments
figsize: Figure size as (width, height) tuple (optional, default=(10, 6))point_size: Size of scatter points (default=80)point_color: Color for scatter points (default=’steelblue’)title: Custom title for the plot (optional)title_fontsize: Font size for the plot title (optional)xlabel_fontsize: Font size for x-axis label (optional)ylabel_fontsize: Font size for y-axis label (optional)tick_fontsize: Font size for axis tick labels (optional)label_fontsize: Font size for labels on best subsets (optional)save_path: Path to save the plot (optional)dpi: Resolution for saved image (default=300)**kwargs: Additional matplotlib arguments passed to plt.subplots()
Outputs
Tuple of (Figure, Axes) matplotlib objects
- class ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]
Container for best subset (S)CCRAM analysis results.
Input Arguments
predictors: Tuple of optimal predictor variable indices (1-indexed)response: Response variable index (1-indexed)ccram: (S)CCRAM value for the optimal subsetk: Number of predictors in the optimal subsetrank_within_k: Rank of this subset within all subsets of size ktotal_subsets_in_k: Total number of subsets of size kscaled: Whether scaled CCRAM (SCCRAM) was usedall_results: Full SubsetCCRAMResult object for further analysis
- all_results: ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]
None
- summary_df() pandas.DataFrame[source]
Return a DataFrame with the summary of the best subset.
Outputs
DataFrame with best subset information
- ccrvam.checkerboard.genstatsim._format_tuple_display(t: tuple) str[source]
Format a tuple as a string without trailing comma for single elements.
- ccrvam.checkerboard.genstatsim.all_subsets_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) ccrvam.checkerboard.genstatsim.SubsetCCRAMResult[source]
Calculate (S)CCRAM for all possible predictor subsets.
This function computes the (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for all combinations of predictor variables given a specified response variable. Results are organized by the number of predictors (k) and sorted by (S)CCRAM value within each k.
Input Arguments
contingency_table: Input contingency table of frequency counts (multi-dimensional numpy array)response: 1-indexed target response variable axis for (S)CCRAM calculationscaled: Whether to use scaled (S)CCRAM (default=False)k: Optional number of predictors to consider. If None, all possible subset sizes are computed (from k=1 to k=ndim-1). If specified, only subsets of size k are computed.variable_names: Optional dictionary mapping 1-indexed variable indices to names. If provided, predictor names will be included in the output.
Outputs
SubsetCCRAMResult object containing:
results_df: DataFrame with columns [k, predictors, pred_cate, response, ccram/sccram] (column name is ‘ccram’ when scaled=False, ‘sccram’ when scaled=True) (and optionally predictor_names), sorted by k ascending and metric value descending within each k. Thepred_catecolumn contains the number of categories for each predictor in the subset, formatted as a tuple string (e.g., “(2, 3)” for a 2-predictor subset where the first predictor has 2 categories and the second has 3).response: The response variable indexn_dimensions: Total number of dimensionsscaled: Whether scaled CCRAM was used
Warnings/Errors
ValueError: If response axis is out of boundsValueError: If k is specified but invalid (k < 1 or k >= ndim)
- ccrvam.checkerboard.genstatsim.best_subset_ccram(contingency_table: numpy.ndarray, response: int, scaled: bool = False, k: Optional[int] = None, variable_names: Optional[dict] = None) ccrvam.checkerboard.genstatsim.BestSubsetCCRAMResult[source]
Find the optimal predictor subset with the highest (S)CCRAM value.
This function identifies the predictor combination that yields the maximum (Scaled) Checkerboard Copula Regression Association Measure ((S)CCRAM) for predicting the specified response variable.
Input Arguments
contingency_table: Input contingency table of frequency counts (multi-dimensional numpy array)response: 1-indexed target response variable axis for (S)CCRAM calculationscaled: Whether to use scaled (S)CCRAM (default=False)k: Optional number of predictors to consider. If None, searches across all possible subset sizes (k=1 to k=ndim-1). If specified, finds the best subset of exactly k predictors.variable_names: Optional dictionary mapping 1-indexed variable indices to names.
Outputs
BestSubsetCCRAMResult object containing: - predictors: Tuple of optimal predictor variable indices (1-indexed) - response: Response variable index - ccram: (S)CCRAM value for the optimal subset - k: Number of predictors in the optimal subset - rank_within_k: Rank of this subset among all subsets of the same size k - total_subsets_in_k: Total number of subsets of size k - scaled: Whether scaled CCRAM was used - all_results: Complete SubsetCCRAMResult for further analysis
Warnings/Errors
ValueError: If response axis is out of boundsValueError: If k is specified but invalid (k < 1 or k >= ndim)