AI Clustering

The AI Clustering activity groups rows in a staging table into clusters using the K-Means algorithm and writes cluster assignments and distances to an output staging table.

Purpose

Use the AI Clustering activity to:

Segment customers, products, or cost centres into groups based on numeric attributes
Identify natural groupings in financial or operational data
Produce cluster labels for downstream analysis or reporting

Algorithm

Uses ML.NET K-Means Clustering:

Rows are assigned to the nearest of k cluster centroids.
Centroids are iteratively updated until convergence.
Each row is assigned a cluster ID (1 to k) and a distance from its centroid.

Configuration

Input Staging Table

The staging table containing the data to cluster. All feature columns must be numeric.

Output Staging Table

The staging table where cluster assignments are written.

Transforms

Pre-processing pipeline applied before clustering. A typical pipeline:

Concatenate — combine the numeric feature columns into a single feature vector
Normalize — scale the features before clustering (recommended)
Train Model — the K-Means step, where you configure the feature, output, and distance columns, and the number of clusters

Train Model step settings:

Setting	Description
Input Column	The (vector) column to cluster on — typically the output of a Concatenate transform
Output Column Name	Column name for cluster assignments (default: `ClusterId`)
Distance Column Name	Column name for distance from centroid (default: `Distance`)
Number of Clusters	The number of clusters `k` (default: 3, min: 2)

Row Filter (optional)

Filter the input data before clustering.

Output Schema

The output staging table contains all non-vector columns from the input, plus:

Column	Description
`ClusterId` (or configured name)	Cluster assignment (integer, 1 to k)
`Distance` (or configured name)	Distance from the row to its cluster centroid

Clustering quality metrics (average distance, Davies-Bouldin Index) are logged to the workflow run log.

Usage Notes

The number of clusters must be chosen before running the algorithm. There is no automatic determination of the optimal k. Run with different values and compare the average distance / Davies-Bouldin Index metrics to choose.
K-Means is sensitive to feature scale. Always Normalize before clustering.
K-Means uses random initialisation — results may vary slightly between runs. For reproducible results, consider running multiple times and choosing the best result.

Best Practices

Start with 3–5 clusters and adjust based on the Davies-Bouldin Index (lower is better) logged after each run.
Always use Concatenate + Normalize in the transforms before the Train Model step.
Inspect the cluster contents after the first run to validate that the groupings are meaningful. Adjust feature selection or k if the clusters are not interpretable.

Needs Review

K-Means uses random initialisation. Confirm whether there is a seed parameter to make results deterministic, and document it if available.

JSON Reference

{
  "discriminator": "KMeansClusteringWorkflowActivity",
  "activityId": "<uuid>",
  "name": "AI Clustering",
  "positionX": 0,
  "positionY": 0,
  "advanceRule": 2,
  "inputStagingTable": "StagingInput",
  "outputStagingTable": "StagingClusters",
  "fsoPath": "",
  "mode": 0,
  "transforms": [],
  "filter": null,
  "inputColumnName": "Features",
  "outputColumnName": "ClusterId",
  "distanceColumnName": "Distance",
  "clusterCount": 3
}

Property	Type	Description
`inputStagingTable`	string	Corresponds to the Input Staging Table field. The staging table containing the data to cluster.
`outputStagingTable`	string	Corresponds to the Output Staging Table field. The staging table where cluster assignments are written.
`fsoPath`	string	File system path. Not used by this activity but present as an inherited field.
`mode`	integer	`0` = TrainModel, `1` = RunModel. Not applicable — this activity trains and applies in a single pass.
`transforms`	array	Corresponds to the Transforms editor. Array of transform objects for pre-processing (typically Concatenate + Normalize).
`filter`	object \| null	Corresponds to the Row Filter field. An optional filter applied to input rows before clustering. `null` means no filter.
`inputColumnName`	string	Corresponds to the Input Column setting. The vector column to cluster on — typically the output of a Concatenate transform.
`outputColumnName`	string	Corresponds to the Output Column Name setting. Column name for cluster assignment IDs. Default: `"ClusterId"`.
`distanceColumnName`	string	Corresponds to the Distance Column Name setting. Column name for distance from centroid values. Default: `"Distance"`.
`clusterCount`	integer	Corresponds to the Number of Clusters setting. The number of clusters `k`. Default: 3.

Purpose​

Algorithm​

Configuration​

Input Staging Table​

Output Staging Table​

Transforms​

Train Model step settings:​

Row Filter (optional)​

Output Schema​

Usage Notes​

Best Practices​

JSON Reference​