Skip to main content

AI Clustering

The AI Clustering activity groups rows in a staging table into clusters using the K-Means algorithm and writes cluster assignments and distances to an output staging table.

Purpose

Use the AI Clustering activity to:

  • Segment customers, products, or cost centres into groups based on numeric attributes
  • Identify natural groupings in financial or operational data
  • Produce cluster labels for downstream analysis or reporting

Algorithm

Uses ML.NET K-Means Clustering:

  1. Rows are assigned to the nearest of k cluster centroids.
  2. Centroids are iteratively updated until convergence.
  3. Each row is assigned a cluster ID (1 to k) and a distance from its centroid.

Configuration

Input Staging Table

The staging table containing the data to cluster. All feature columns must be numeric.

Output Staging Table

The staging table where cluster assignments are written.

Transforms

Pre-processing pipeline applied before clustering. A typical pipeline:

  1. Concatenate — combine the numeric feature columns into a single feature vector
  2. Normalize — scale the features before clustering (recommended)
  3. Train Model — the K-Means step, where you configure the feature, output, and distance columns, and the number of clusters

Train Model step settings:

SettingDescription
Input ColumnThe (vector) column to cluster on — typically the output of a Concatenate transform
Output Column NameColumn name for cluster assignments (default: ClusterId)
Distance Column NameColumn name for distance from centroid (default: Distance)
Number of ClustersThe number of clusters k (default: 3, min: 2)

Row Filter (optional)

Filter the input data before clustering.

Output Schema

The output staging table contains all non-vector columns from the input, plus:

ColumnDescription
ClusterId (or configured name)Cluster assignment (integer, 1 to k)
Distance (or configured name)Distance from the row to its cluster centroid

Clustering quality metrics (average distance, Davies-Bouldin Index) are logged to the workflow run log.

Usage Notes

  • The number of clusters must be chosen before running the algorithm. There is no automatic determination of the optimal k. Run with different values and compare the average distance / Davies-Bouldin Index metrics to choose.
  • K-Means is sensitive to feature scale. Always Normalize before clustering.
  • K-Means uses random initialisation — results may vary slightly between runs. For reproducible results, consider running multiple times and choosing the best result.

Best Practices

  • Start with 3–5 clusters and adjust based on the Davies-Bouldin Index (lower is better) logged after each run.
  • Always use Concatenate + Normalize in the transforms before the Train Model step.
  • Inspect the cluster contents after the first run to validate that the groupings are meaningful. Adjust feature selection or k if the clusters are not interpretable.
Needs Review

K-Means uses random initialisation. Confirm whether there is a seed parameter to make results deterministic, and document it if available.

JSON Reference

{
"discriminator": "KMeansClusteringWorkflowActivity",
"activityId": "<uuid>",
"name": "AI Clustering",
"positionX": 0,
"positionY": 0,
"advanceRule": 2,
"inputStagingTable": "StagingInput",
"outputStagingTable": "StagingClusters",
"fsoPath": "",
"mode": 0,
"transforms": [],
"filter": null,
"inputColumnName": "Features",
"outputColumnName": "ClusterId",
"distanceColumnName": "Distance",
"clusterCount": 3
}
PropertyTypeDescription
inputStagingTablestringCorresponds to the Input Staging Table field. The staging table containing the data to cluster.
outputStagingTablestringCorresponds to the Output Staging Table field. The staging table where cluster assignments are written.
fsoPathstringFile system path. Not used by this activity but present as an inherited field.
modeinteger0 = TrainModel, 1 = RunModel. Not applicable — this activity trains and applies in a single pass.
transformsarrayCorresponds to the Transforms editor. Array of transform objects for pre-processing (typically Concatenate + Normalize).
filterobject | nullCorresponds to the Row Filter field. An optional filter applied to input rows before clustering. null means no filter.
inputColumnNamestringCorresponds to the Input Column setting. The vector column to cluster on — typically the output of a Concatenate transform.
outputColumnNamestringCorresponds to the Output Column Name setting. Column name for cluster assignment IDs. Default: "ClusterId".
distanceColumnNamestringCorresponds to the Distance Column Name setting. Column name for distance from centroid values. Default: "Distance".
clusterCountintegerCorresponds to the Number of Clusters setting. The number of clusters k. Default: 3.