Gene Set Analysis Example ========================= This example demonstrates how to use UpSet plots to visualize gene set intersections, a common use case in bioinformatics. We'll use simulated data representing genes involved in different biological pathways. Setup ----- First, let's import our libraries and create some sample data: .. altair-plot:: :output: none import altair_upset as au import pandas as pd import numpy as np # Simulate gene set data np.random.seed(42) n_genes = 2000 # Define pathways and their approximate sizes pathways = { 'Cell_Cycle': 0.15, # 15% of genes 'DNA_Repair': 0.10, 'Apoptosis': 0.12, 'Immune_Response': 0.20, 'Metabolism': 0.25, 'Signal_Transduction': 0.30 } # Create data with realistic overlaps data = pd.DataFrame() for pathway, prob in pathways.items(): # Add some correlation between related pathways if pathway == 'Cell_Cycle': data[pathway] = np.random.choice([0, 1], size=n_genes, p=[1-prob, prob]) elif pathway == 'DNA_Repair': # DNA repair genes are more likely to be involved in cell cycle p_repair = np.where(data['Cell_Cycle'] == 1, 0.3, 0.05) p_repair = np.clip(p_repair, 0, 1) # Ensure probabilities are valid data[pathway] = np.random.binomial(1, p_repair) elif pathway == 'Apoptosis': # Apoptosis genes might be involved in cell cycle and DNA repair p_apoptosis = 0.05 + 0.15 * data['Cell_Cycle'] + 0.1 * data['DNA_Repair'] p_apoptosis = np.clip(p_apoptosis, 0, 1) # Ensure probabilities are valid data[pathway] = np.random.binomial(1, p_apoptosis) else: data[pathway] = np.random.choice([0, 1], size=n_genes, p=[1-prob, prob]) Basic UpSet Plot ---------------- Create a basic UpSet plot showing all pathway intersections: .. altair-plot:: au.UpSetAltair( data=data, sets=data.columns.tolist(), sort_by="frequency", sort_order="descending", title="Gene Set Intersections", subtitle="Distribution of genes across pathways", glyph_size=100, # Ensure positive size set_label_bg_size=500, # Ensure positive size line_connection_size=2, ).chart Focused DNA Repair Analysis --------------------------- Create a focused view of DNA repair pathways: .. altair-plot:: dna_repair_pathways = ['DNA_Repair', 'Cell_Cycle', 'Apoptosis'] au.UpSetAltair( data=data[dna_repair_pathways], sets=dna_repair_pathways, sort_by="frequency", sort_order="descending", title="DNA Repair Pathway Intersections", subtitle="Focused analysis of DNA repair mechanisms", glyph_size=100, # Ensure positive size set_label_bg_size=500, # Ensure positive size line_connection_size=2, ).chart Analysis Results ---------------- Let's analyze the pathway intersections in detail: Single Pathway Analysis ~~~~~~~~~~~~~~~~~~~~~~~ .. altair-plot:: :output: repr print("\nGenes unique to each pathway:") for pathway in pathways: unique_genes = data[data[pathway] == 1][ data.drop(columns=[pathway]).sum(axis=1) == 0 ] print( f"{pathway}: {len(unique_genes)} genes ({len(unique_genes)/n_genes*100:.1f}%)" ) Multi-Pathway Analysis ~~~~~~~~~~~~~~~~~~~~~~ .. altair-plot:: :output: repr # Multi-pathway genes multi_pathway = data[data.sum(axis=1) > 1] print( f"\nGenes involved in multiple pathways: {len(multi_pathway)} ({len(multi_pathway)/n_genes*100:.1f}%)" ) # Most common pathway combination def get_pathway_combination(row): return " & ".join(data.columns[row == 1]) most_common = ( data.groupby(data.columns.tolist()).size().sort_values(ascending=False).head(1) ) combination = get_pathway_combination( pd.Series(most_common.index[0], index=data.columns) ) print(f"\nMost common pathway combination: {combination}") print( f"Number of genes: {most_common.values[0]} ({most_common.values[0]/n_genes*100:.1f}%)" ) DNA Repair Pathway Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. altair-plot:: :output: repr dna_repair_genes = data[data["DNA_Repair"] == 1] print(f"\nDNA Repair Pathway Analysis:") print( f"Total DNA repair genes: {len(dna_repair_genes)} ({len(dna_repair_genes)/n_genes*100:.1f}%)" ) print("Co-occurrence with other pathways:") for pathway in pathways: if pathway != "DNA_Repair": co_occurrence = data[(data["DNA_Repair"] == 1) & (data[pathway] == 1)] print( f"{pathway}: {len(co_occurrence)} genes ({len(co_occurrence)/len(dna_repair_genes)*100:.1f}%)" )