# Preprocessing

The `preprocess()` function applies the preprocessing pipeline defined in a `FitOptions` to a
`GrowthData` and returns a **new** `GrowthData`. The input is never modified.

```python
from pykinbiont import preprocess, FitOptions, GrowthData
```

## Smoothing

```python
opts = FitOptions(smooth=True, smooth_method="rolling_avg", smooth_pt_avg=5)
smoothed = preprocess(data, opts)

print(f"Original min: {data.curves.min():.4f}")
print(f"Smoothed min: {smoothed.curves.min():.4f}")
```

Supported smoothing methods:

| `smooth_method` | Notes |
|---|---|
| `"lowess"` | Locally weighted scatterplot smoothing |
| `"rolling_avg"` | Rolling mean with window `smooth_pt_avg` |
| `"gaussian"` | Gaussian kernel, bandwidth = `gaussian_h_mult × median_spacing` |
| `"boxcar"` | Uniform boxcar filter with window `boxcar_window` |
| `"none"` | Pass-through (same as `smooth=False`) |

## Blank subtraction

```python
blank_od = 0.015  # measured from blank wells

opts = FitOptions(
    blank_subtraction=True,
    blank_value=blank_od,
    correct_negatives=True,
    negative_method="thr_correction",
    negative_threshold=0.001,
)
subtracted = preprocess(data, opts)
```

After subtraction some values may go below zero (noise in blank wells).
Set `correct_negatives=True` to handle them:

- `"remove"` — removes time points where OD ≤ 0
- `"thr_correction"` — replaces values below `negative_threshold` with `negative_threshold`
- `"blank_correction"` — adds back the blank mean to floor-clamp values

## Clustering

Clustering groups curves by shape (z-normalised k-means) and attaches cluster assignments to the
returned `GrowthData`.

```python
opts = FitOptions(cluster=True, n_clusters=3, kmeans_seed=42)
clustered = preprocess(data, opts)

for label, cid in zip(data.labels, clustered.clusters):
    print(f"  {label:12s}  →  cluster {cid}")

print(f"WCSS: {clustered.wcss:.4f}")
print(f"Centroid matrix shape: {clustered.centroids.shape}")
```

### Clustering then fitting

Because `cluster=True` in `fit()` skips model fitting entirely, the recommended pattern is
to call `preprocess()` first for clustering and then `fit()` separately:

```python
from pykinbiont import preprocess, fit, FitOptions, ModelSpec, LogLinModel

# Step 1: cluster
opts_cluster = FitOptions(cluster=True, n_clusters=3)
clustered = preprocess(data, opts_cluster)

cluster_assignments = dict(zip(clustered.labels, map(int, clustered.clusters)))

# Step 2: fit (no cluster flag here)
spec  = ModelSpec(models=[LogLinModel()], params=[[]])
opts_fit = FitOptions(smooth=True, smooth_method="rolling_avg")
results = fit(data, spec, opts_fit)

# Attach cluster info manually
df = results.to_dataframe()
df["cluster"] = df["label"].map(cluster_assignments)
```

## Full pipeline

```python
opts = FitOptions(
    smooth=True,
    smooth_method="rolling_avg",
    smooth_pt_avg=5,
    blank_subtraction=True,
    blank_value=0.015,
    correct_negatives=True,
    negative_method="thr_correction",
    negative_threshold=0.001,
    cut_stationary_phase=True,
)
preprocessed = preprocess(data, opts)
```

The pipeline runs in the order: blank subtraction → negative correction → scattering
correction → smoothing → stationary-phase trimming.