# Preprocessing The `preprocess()` function applies the preprocessing pipeline defined in a `FitOptions` to a `GrowthData` and returns a **new** `GrowthData`. The input is never modified. ```python from pykinbiont import preprocess, FitOptions, GrowthData ``` ## Smoothing ```python opts = FitOptions(smooth=True, smooth_method="rolling_avg", smooth_pt_avg=5) smoothed = preprocess(data, opts) print(f"Original min: {data.curves.min():.4f}") print(f"Smoothed min: {smoothed.curves.min():.4f}") ``` Supported smoothing methods: | `smooth_method` | Notes | |---|---| | `"lowess"` | Locally weighted scatterplot smoothing | | `"rolling_avg"` | Rolling mean with window `smooth_pt_avg` | | `"gaussian"` | Gaussian kernel, bandwidth = `gaussian_h_mult × median_spacing` | | `"boxcar"` | Uniform boxcar filter with window `boxcar_window` | | `"none"` | Pass-through (same as `smooth=False`) | ## Blank subtraction ```python blank_od = 0.015 # measured from blank wells opts = FitOptions( blank_subtraction=True, blank_value=blank_od, correct_negatives=True, negative_method="thr_correction", negative_threshold=0.001, ) subtracted = preprocess(data, opts) ``` After subtraction some values may go below zero (noise in blank wells). Set `correct_negatives=True` to handle them: - `"remove"` — removes time points where OD ≤ 0 - `"thr_correction"` — replaces values below `negative_threshold` with `negative_threshold` - `"blank_correction"` — adds back the blank mean to floor-clamp values ## Clustering Clustering groups curves by shape (z-normalised k-means) and attaches cluster assignments to the returned `GrowthData`. ```python opts = FitOptions(cluster=True, n_clusters=3, kmeans_seed=42) clustered = preprocess(data, opts) for label, cid in zip(data.labels, clustered.clusters): print(f" {label:12s} → cluster {cid}") print(f"WCSS: {clustered.wcss:.4f}") print(f"Centroid matrix shape: {clustered.centroids.shape}") ``` ### Clustering then fitting Because `cluster=True` in `fit()` skips model fitting entirely, the recommended pattern is to call `preprocess()` first for clustering and then `fit()` separately: ```python from pykinbiont import preprocess, fit, FitOptions, ModelSpec, LogLinModel # Step 1: cluster opts_cluster = FitOptions(cluster=True, n_clusters=3) clustered = preprocess(data, opts_cluster) cluster_assignments = dict(zip(clustered.labels, map(int, clustered.clusters))) # Step 2: fit (no cluster flag here) spec = ModelSpec(models=[LogLinModel()], params=[[]]) opts_fit = FitOptions(smooth=True, smooth_method="rolling_avg") results = fit(data, spec, opts_fit) # Attach cluster info manually df = results.to_dataframe() df["cluster"] = df["label"].map(cluster_assignments) ``` ## Full pipeline ```python opts = FitOptions( smooth=True, smooth_method="rolling_avg", smooth_pt_avg=5, blank_subtraction=True, blank_value=0.015, correct_negatives=True, negative_method="thr_correction", negative_threshold=0.001, cut_stationary_phase=True, ) preprocessed = preprocess(data, opts) ``` The pipeline runs in the order: blank subtraction → negative correction → scattering correction → smoothing → stationary-phase trimming.