AI Clustering
The AI Clustering activity groups rows in a staging table into clusters using the K-Means algorithm and writes cluster assignments and distances to an output staging table.
Purpose
Use the AI Clustering activity to:
- Segment customers, products, or cost centres into groups based on numeric attributes
- Identify natural groupings in financial or operational data
- Produce cluster labels for downstream analysis or reporting
Algorithm
Uses ML.NET K-Means Clustering:
- Rows are assigned to the nearest of
kcluster centroids. - Centroids are iteratively updated until convergence.
- Each row is assigned a cluster ID (1 to k) and a distance from its centroid.
Configuration
Input Staging Table
The staging table containing the data to cluster. All feature columns must be numeric.
Output Staging Table
The staging table where cluster assignments are written.
Transforms
Pre-processing pipeline applied before clustering. A typical pipeline:
- Concatenate — combine the numeric feature columns into a single feature vector
- Normalize — scale the features before clustering (recommended)
- Train Model — the K-Means step, where you configure the feature, output, and distance columns, and the number of clusters
Train Model step settings:
| Setting | Description |
|---|---|
| Input Column | The (vector) column to cluster on — typically the output of a Concatenate transform |
| Output Column Name | Column name for cluster assignments (default: ClusterId) |
| Distance Column Name | Column name for distance from centroid (default: Distance) |
| Number of Clusters | The number of clusters k (default: 3, min: 2) |
Row Filter (optional)
Filter the input data before clustering.
Output Schema
The output staging table contains all non-vector columns from the input, plus:
| Column | Description |
|---|---|
ClusterId (or configured name) | Cluster assignment (integer, 1 to k) |
Distance (or configured name) | Distance from the row to its cluster centroid |
Clustering quality metrics (average distance, Davies-Bouldin Index) are logged to the workflow run log.
Usage Notes
- The number of clusters must be chosen before running the algorithm. There is no automatic determination of the optimal
k. Run with different values and compare the average distance / Davies-Bouldin Index metrics to choose. - K-Means is sensitive to feature scale. Always Normalize before clustering.
- K-Means uses random initialisation — results may vary slightly between runs. For reproducible results, consider running multiple times and choosing the best result.
Best Practices
- Start with 3–5 clusters and adjust based on the Davies-Bouldin Index (lower is better) logged after each run.
- Always use Concatenate + Normalize in the transforms before the Train Model step.
- Inspect the cluster contents after the first run to validate that the groupings are meaningful. Adjust feature selection or
kif the clusters are not interpretable.
K-Means uses random initialisation. Confirm whether there is a seed parameter to make results deterministic, and document it if available.
JSON Reference
{
"discriminator": "KMeansClusteringWorkflowActivity",
"activityId": "<uuid>",
"name": "AI Clustering",
"positionX": 0,
"positionY": 0,
"advanceRule": 2,
"inputStagingTable": "StagingInput",
"outputStagingTable": "StagingClusters",
"fsoPath": "",
"mode": 0,
"transforms": [],
"filter": null,
"inputColumnName": "Features",
"outputColumnName": "ClusterId",
"distanceColumnName": "Distance",
"clusterCount": 3
}
| Property | Type | Description |
|---|---|---|
inputStagingTable | string | Corresponds to the Input Staging Table field. The staging table containing the data to cluster. |
outputStagingTable | string | Corresponds to the Output Staging Table field. The staging table where cluster assignments are written. |
fsoPath | string | File system path. Not used by this activity but present as an inherited field. |
mode | integer | 0 = TrainModel, 1 = RunModel. Not applicable — this activity trains and applies in a single pass. |
transforms | array | Corresponds to the Transforms editor. Array of transform objects for pre-processing (typically Concatenate + Normalize). |
filter | object | null | Corresponds to the Row Filter field. An optional filter applied to input rows before clustering. null means no filter. |
inputColumnName | string | Corresponds to the Input Column setting. The vector column to cluster on — typically the output of a Concatenate transform. |
outputColumnName | string | Corresponds to the Output Column Name setting. Column name for cluster assignment IDs. Default: "ClusterId". |
distanceColumnName | string | Corresponds to the Distance Column Name setting. Column name for distance from centroid values. Default: "Distance". |
clusterCount | integer | Corresponds to the Number of Clusters setting. The number of clusters k. Default: 3. |