| Title: | Geometric Analysis of Configurations in High-Dimensional Spaces |
|---|---|
| Description: | Tools for analysing the geometry of configurations in high-dimensional spaces using the Average Membership Degree (AMD) framework and synthetic configuration generation. The package supports a domain-agnostic approach to studying the shape, dispersion, and internal structure of point clouds, with applications across biological and ecological datasets, including those derived from deep-time records. The AMD framework builds on the idea that strongly coupled systems may occupy a limited set of recurrent regimes in state space, producing high-occupancy regions separated by sparsely populated transitional configurations. The package focuses on detecting these concentration patterns and quantifying their geometric definition without assuming any underlying dynamical model. It provides AMD curve computation, cluster assignment, and sigma-equivalent estimation, together with S3 methods for plotting, printing, and summarising AMD and sigma-equivalent objects. Mendoza (2025) <https://mmendoza1967.github.io/AMDconfigurations/>. |
| Authors: | Manuel Mendoza [aut, cre] (ORCID: <https://orcid.org/0000-0002-2143-8138>) |
| Maintainer: | Manuel Mendoza <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-18 05:53:06 UTC |
| Source: | https://github.com/mmendoza1967/amdconfigurations |
This function runs fuzzy c-means clustering (e1071::cmeans) repeatedly
with different random seeds and selects the partition that maximises an
AMD-like objective:
assign_clusters_best( data, opt_cluster, nreps = 10, m = 2, iter.max = 20, scale_data = FALSE, seeds = NULL, preselect_top_sd = NULL )assign_clusters_best( data, opt_cluster, nreps = 10, m = 2, iter.max = 20, scale_data = FALSE, seeds = NULL, preselect_top_sd = NULL )
data |
A numeric matrix or data frame of samples × features. |
opt_cluster |
Integer; number of clusters to fit. |
nreps |
Number of repeated initialisations. |
m |
Fuzziness parameter for fuzzy c-means (default 2). |
iter.max |
Maximum number of iterations for fuzzy c-means. |
scale_data |
Logical; if |
seeds |
Optional numeric vector of seeds for deterministic behaviour.
Must have length |
preselect_top_sd |
Optional integer; if provided, only the top-SD features are retained before clustering (useful for very high-dimensional data). |
where is the membership vector of sample .
The best partition is returned, with cluster labels aligned to the original
row order of the input data (rows with missing values receive NA).
A list with components:
Integer vector of cluster labels aligned to the original data.
Rows with missing values receive NA.
Membership matrix from the best fuzzy c-means run.
Cluster centroids from the best run.
Best AMD-like objective value.
## Not run: set.seed(1) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) out <- assign_clusters_best(X, opt_cluster = 3, nreps = 20) table(out$cluster) ## End(Not run)## Not run: set.seed(1) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) out <- assign_clusters_best(X, opt_cluster = 3, nreps = 20) table(out$cluster) ## End(Not run)
This function computes the Average Membership Deviation (AMD) curve for
fuzzy c-means clustering across a sequence of cluster numbers k.
For each k, multiple random initialisations are performed and the
AMD value is computed as:
compute_amd_curve( data, its, nin, nsp, seeds = NULL, verbose = TRUE, plot_curve = FALSE, open_device = TRUE, scale_data = FALSE, iter_max = 100, m = 2, preselect_top_sd = NULL )compute_amd_curve( data, its, nin, nsp, seeds = NULL, verbose = TRUE, plot_curve = FALSE, open_device = TRUE, scale_data = FALSE, iter_max = 100, m = 2, preselect_top_sd = NULL )
data |
A numeric matrix or data frame of samples (rows) × features (columns). |
its |
Number of random initialisations per value of |
nin |
Minimum number of clusters to evaluate. |
nsp |
Maximum number of clusters to evaluate. |
seeds |
Optional numeric vector of seeds for deterministic behaviour.
Must have length |
verbose |
Logical; print progress messages. |
plot_curve |
Logical; if |
open_device |
Logical; if |
scale_data |
Logical; if |
iter_max |
Maximum number of iterations for fuzzy c-means. |
m |
Fuzziness parameter for fuzzy c-means (default 2). |
preselect_top_sd |
Optional integer; if provided, only the top-SD features are retained before clustering (useful for very high-dimensional data). |
where is the membership vector of sample .
The optimal number of clusters is selected as the k that maximises
the AMD peak across repetitions.
A list with components:
The optimal number of clusters (maximising AMD peak).
Vector of AMD peak values for each k.
Vector of mean AMD values across repetitions.
Matrix of AMD values (rows = repetitions, columns = k).
## Not run: set.seed(1) X <- matrix(rnorm(2000), nrow = 100, ncol = 20) res <- compute_amd_curve(X, its = 10, nin = 2, nsp = 6) res$k_opt ## End(Not run)## Not run: set.seed(1) X <- matrix(rnorm(2000), nrow = 100, ncol = 20) res <- compute_amd_curve(X, its = 10, nin = 2, nsp = 6) res$k_opt ## End(Not run)
This function generates synthetic datasets composed of n_clusters
Gaussian clusters in n_dim-dimensional space. Cluster centroids are
placed uniformly inside a hypercube of side cube_size, and samples
are drawn with isotropic Gaussian noise of standard deviation std_dev.
create_synthetic_samples( n_samples, n_clusters, std_dev, n_dim, cube_size = 100, standardize = FALSE, center = TRUE, scale. = TRUE )create_synthetic_samples( n_samples, n_clusters, std_dev, n_dim, cube_size = 100, standardize = FALSE, center = TRUE, scale. = TRUE )
n_samples |
Total number of samples to generate. |
n_clusters |
Number of clusters to simulate. |
std_dev |
Standard deviation of the Gaussian noise around each centroid. |
n_dim |
Number of dimensions (features). |
cube_size |
Side length of the hypercube where centroids are placed. |
standardize |
Logical; if |
center, scale.
|
Logical arguments passed to |
The function is used internally to calibrate the compactness of real data by matching its AMD peak against synthetic datasets with varying noise levels.
A data frame of size n_samples × n_dim containing the
synthetic samples.
## Not run: set.seed(1) syn <- create_synthetic_samples( n_samples = 200, n_clusters = 4, std_dev = 5, n_dim = 10 ) head(syn) ## End(Not run)## Not run: set.seed(1) syn <- create_synthetic_samples( n_samples = 200, n_clusters = 4, std_dev = 5, n_dim = 10 ) head(syn) ## End(Not run)
This function calibrates the observed AMD peak of real data against synthetic
datasets generated with varying levels of isotropic Gaussian noise ().
For each candidate , synthetic data are generated with the same
number of samples, dimensionality, and number of clusters as the real data.
The AMD peak of each synthetic dataset is computed, and the sigma-equivalent
value is defined as the whose synthetic AMD peak best matches the
real AMD peak (either by interpolation or nearest match).
estimate_sigma_equivalent( real_data, its, nin, nsp, k_opt = NULL, sigmas, iter_max = 20, make_plot = FALSE, return_plot = TRUE, quiet = TRUE, open_device_each = FALSE, device_width = 7, device_height = 5, cube_size = 100, method = c("interpolate", "nearest"), standardize = FALSE, seed_base = 7, plot_sigma_curves = FALSE )estimate_sigma_equivalent( real_data, its, nin, nsp, k_opt = NULL, sigmas, iter_max = 20, make_plot = FALSE, return_plot = TRUE, quiet = TRUE, open_device_each = FALSE, device_width = 7, device_height = 5, cube_size = 100, method = c("interpolate", "nearest"), standardize = FALSE, seed_base = 7, plot_sigma_curves = FALSE )
real_data |
A numeric matrix or data frame of samples × features. |
its |
Number of random initialisations per AMD computation. |
nin |
Minimum number of clusters to evaluate. |
nsp |
Maximum number of clusters to evaluate. |
k_opt |
Optional; the optimal number of clusters for the real data.
If |
sigmas |
Numeric vector of candidate |
iter_max |
Maximum number of iterations for fuzzy c-means. |
make_plot |
Logical; if |
return_plot |
Logical; if |
quiet |
Logical; suppress console output from synthetic data generation. |
open_device_each |
Logical; if |
device_width, device_height
|
Size of graphics device for sigma-curve plots. |
cube_size |
Side length of the hypercube used to place synthetic centroids. |
method |
Method for estimating sigma-equivalent: |
standardize |
Logical; if |
seed_base |
Base seed for reproducibility. |
plot_sigma_curves |
Logical; if |
A list containing:
AMD peak of the real dataset.
Optimal number of clusters for the real data.
Data frame of vs synthetic AMD peaks.
Interpolated sigma-equivalent value.
Nearest-match sigma on the explored grid.
Logical; whether interpolation required extrapolation.
Comparative plot object (if requested).
Index of best-matching sigma.
Best-matching sigma value.
Full AMD results for the best synthetic dataset.
Data frame of the AMD curve for the best sigma.
## Not run: set.seed(1) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) out <- estimate_sigma_equivalent( real_data = X, its = 5, nin = 2, nsp = 6, sigmas = seq(1, 10, by = 2) ) out$sigma_equivalent ## End(Not run)## Not run: set.seed(1) X <- matrix(rnorm(1000), nrow = 100, ncol = 10) out <- estimate_sigma_equivalent( real_data = X, its = 5, nin = 2, nsp = 6, sigmas = seq(1, 10, by = 2) ) out$sigma_equivalent ## End(Not run)