SUPERVISED NONPARAMETRIC CLASSIFICATION IN THE CONTEXT OF REPLICATED POINT PATTERNS

A spatial point pattern is a collection of points in space, representing, e.g. observed locations of trees, bird nests, centers of cells in a histological sample, etc. When several independent realizations of the underlying stochastic process are observed, these realizations are referred to as replicated point patterns. The main objective of this paper is to classify a newly observed pattern into one of the existing classes using a supervised nonparametric classification method, namely the Bayes classifier in combination with the k -nearest neighbors algorithms and the kernel regression method. The dissimilarity between a pair of patterns is defined using the functional summaries extracted from the point patterns via the Cram´er-von Mises or Kolmogorov-Smirnov type formula. A set of simulation experiments is presented to investigate the performance of the proposed classifier with a dissimilarity measure based on functional summaries, such as the pair correlation function. The application of such a classifier to a real point pattern dataset is also illustrated.


INTRODUCTION
Spatial point processes are mathematical models that describe the arrangement of objects randomly placed in space.Such models are of particular interest in many scientific disciplines, including biology, ecology, statistical physics, or material science (Illian et al., 2004, Sect. 1).We distinguish between the theoretical model, called point process, and its realization, a deterministic configuration of points, called point pattern.In practice, point patterns are observed in a bounded observation window.Three different point patterns can be seen in Fig. 1.Individual points represent locations of the centres of intramembranous particles of the mitochondrial membranes of the human HeLa cell line.These patterns were observed during the analysis of the HeLa cell line via the freeze-fracture technique (Schladitz et al., 2003).It has become a standard approach to use functional summary statistics instead of univariate ones in all steps of statistical analysis of point patterns, from exploratory analysis through model fitting to hypothesis testing.Supervised classification is one of the fundamental problems in statistics and machine learning.Early work on classification and statistical learning in general dates back to Fisher and the linear discriminant rule (Fisher, 1936;1938).A collection of labelled observations, called a training set or training data, is available in supervised classification.The label indicates the affiliation of the given observation to one of the G possible classes.Based on the training data, the task is to assign a label to a new observation.
In the point pattern literature, the term classification usually refers to the procedure of labelling individual points within a single pattern generated by a superposition of several point processes (Dasgupta and Raftery, 1998;Redenbach et al., 2015;Walsh and Raftery, 2005).This corresponds to the typical setting of spatial statistics, where a single point pattern, obtained by some physical measurement, is analyzed.This paper focuses on a different context: replicated point patterns.This means that the observed dataset consists of a collection of point patterns that need to be analyzed simultaneously rather than individually.
For replicated point patterns, supervised classification has been studied to a limited extent.In (Cholaquidis et al., 2017), the patterns generated by inhomogeneous Poisson point processes with different intensity functions were classified.The task of classifying replicated point patterns is transformed in (Mateu et al., 2015), with the help of multidimensional scaling, to the classification task in R 2 and then solved with the help of Fisher's linear discriminant analysis.Parametric supervised classification is reviewed in (Vo et al., 2018); this approach is also called model-based learning.Unsupervised classification is explored in (Ayala et al., 2006).Note that we focus here only on spatial point patterns, and thus point patterns on the real line are disregarded.For a discussion about the onedimensional setting, see e.g.(Tranbarger Freier and Schoenberg, 2010;Victor and Purpura, 1997).
In this paper, we propose a general classification method that can be used for both stationary and nonstationary point processes in R p and can be further generalized to more complicated settings (point patterns in non-Euclidean spaces, random sets, random tessellations, etc.).For ease of exposition, we present the methods only for stationary point processes in R 2 .We use the Bayes classifier in combination with the k-nearest neighbors algorithms and the kernel regression method.We need a mapping measuring dissimilarities between two point patterns to construct such classifiers.A summary of such mappings is given in (Mateu et al., 2015;Alba-Fernández et al., 2016).
We pay special attention to dissimilarity measures based on functional summaries, e.g. the pair correlation function extracted from the point patterns.So instead of comparing the patterns directly, we compare the extracted features in the form of functional data, and we can use well-established methods from functional data analysis, described, e.g. in (Ferraty and Vieu, 2006).
The complexity of the distribution of the considered point process models and their corresponding functional summary characteristics makes it highly challenging, if not impossible, to study the properties of the classifiers analytically.Thus, the behavior of the kernel regression classifier is explored through a simulation study.It is oriented towards the situation where the groups represent different parametric families of point process models (and hence different nature of spatial interactions) or the same parametric family with varying values of the model parameter.This paper is organized as follows.We start with a brief description of the three point process models that will be used in the simulation experiments.Next, we present several choices for the dissimilarity measure.The main contribution of the paper lies in the analysis of the performance of the proposed classifier on simulated point pattern data.Before describing the experiments themselves, we give the background on the Bayes classifier and the kernel regression method.
In addition, we discuss some of the computational aspects of our simulations.The simulation experiments are complemented with an illustrative application to the HeLa cell line data.Finally, we close the paper with some concluding remarks.

SPATIAL POINT PROCESSES
This section briefly describes the point process models used in the sequel.Related necessary definitions are given.For the foundations of the point process theory, see, e.g.(Daley and Vere-Jones, 2008).A comprehensive discussion about summary characteristics and feature extraction for point processes can be found in (Møller and Waagepetersen, 2004).
Throughout this paper, we restrict ourselves to point processes (random locally finite sets) in the plane.However, all the definitions and statements below can be easily reformulated for a general dimension p. Point process X is said to be stationary if its distribution is invariant under translations.Moreover, X is said to be isotropic if its distribution is invariant under rotations around the origin.We suppose that the intensity function of X exists, that is, the expected number of points of X in a Borel set B ⊆ R 2 can be written as B λ (y) dy, where λ (the intensity function) is nonnegative and measurable.If X is stationary, then λ (y) = λ > 0 for all y, and the constant λ is called intensity.The pair correlation function g is defined by In case λ (x) = 0 or λ (y) = 0, we set g(x, y) = 0.The function λ (2) (•, •) is the Radon-Nikodym derivative (with respect to the four-dimensional Lebesgue measure) of the second-order factorial moment measure (Møller and Waagepetersen, 2004).Loosely speaking, λ (2) (x, y) indicates how likely it is to observe two points together that occur in infinitely small neighbourhoods of x and y, respectively.With slight abuse of the notation, we write g(x, y) = g(x − y) whenever g is translation invariant.If g is also invariant under rotations around the origin, we write g(x, y) = g (∥x − y∥) , x, y ∈ R 2 .From now on, we suppose that g is defined for X, and is, moreover, motion invariant.These assumptions imply that g is a function of one argument r that represents the Euclidean distance between two points in the process.
The Poisson point process is a benchmark point process model.It is used to model situations with no spatial interactions among the points.Since λ (2) (x, y) = λ (x)λ (y), x, y ∈ R 2 (Møller and Waagepetersen, 2004, Sect. 4.1), the pair correlation function g ≡ 1 is a constant function.The value of g for the Poison point process can be used as a benchmark in the following way.Let us have the formula for g derived for another point process model.Then, values above 1 indicate aggregation of points, and values below 1 indicate repulsive interactions.In the sequel, the term Poisson point process always stands for a stationary Poisson point process with constant intensity λ > 0. The process will be denoted by Π(λ ).
The Thomas process is one of the basic models for aggregation of points.It was introduced in (Thomas, 1949) in the context of ecological surveys.A triplet of parameters (if we impose the stationarity) characterizes the model.First, we need to specify the intensity κ of the underlying stationary Poisson process that models (unobserved) parental points.Then, we need to set the mean number µ of offspring points per parent.These points are the observed ones.Finally, the scale parameter σ of the bivariate Gaussian density that controls the spatial distribution of the offspring points around a parent must be specified.The resulting process is stationary, with an intensity equal to the product κ µ.For details, see (Møller and Waagepetersen, 2004, Sect. 5.3).The analytical formula for g is known: From equation (1), we see that for all r, g(r) > 1.In the sequel, the process will be denoted by Φ(κ, µ, σ ).
The Gaussian determinantal point process is a member of the family of the determinantal point processes (DPPs), which have been studied in mathematical physics, combinatorics, and random matrix theory for several decades.The general notion was introduced in 1975 in (Macchi, 1975).A detailed overview of the theory of DPPs is given in (Lavancier et al., 2015).A DPP models repulsive interactions.Roughly speaking, the process is defined by specifying the Radon-Nikodym derivatives for the factorial moment measures of all orders with the help of a Gaussian covariance function Here, θ > 0 and 0 < α ≤ α max , where α max is a known constant given by α max = 1/ √ πθ .Again, the formula for g is known: α 2 , r > 0.
(2) Equation (2) shows that the pair correlation function is always below the benchmark value for the Poisson point process.In the sequel, the process will be denoted by Ψ(θ , α).Sample realizations of the three models can be found in Fig. 2.

DISSIMILARITY MEASURES
In this section, we are looking for a map d that takes two point patterns and returns a number quantifying how dissimilar the two point patterns are.It has to meet the following conditions: where U, X and Z represent different point patterns.An overview of the dissimiliraty measures for point patterns is given in (Mateu et al., 2015;Alba-Fernández et al., 2016).
Starting with the dissimilarity measures that are based on pattern matching, the Hausdorff distance is In particular, it takes the maximum of the maximal Euclidean distance from a point in X to its nearest neighbour in Z and vice versa.In this case, d H is a metric, so d H (X, Z) = 0 if and only if the two point patterns coincide.However, its use is reasonable only if the two investigated patterns are observed in the same observation window.To achieve a low value, d H forces the two configurations to have points at very similar locations, and hence high values can be seen even for two realisations coming from the same model.Modifications of d H have been proposed in (Hoffman and Mahler, 2004;Schuhmacher et al., 2008;Cholaquidis et al., 2017), but none of them is the actual remedy for this problem.In special cases and assuming stationarity and isotropy, it may be relevant to consider d H (X, Z ⋆ ) instead of d H (X, Z), where Z ⋆ is the element of the set of all translations and rotations of Z such that d H (X, Z ⋆ ) is minimal.
Another group of dissimilarity measures is based on feature matching.Important information (called a feature) about the distribution of the stochastic process that generated the pattern at hand is extracted from the pattern using a point process summary characteristic.In what follows, we focus on functional summaries such as the pair correlation function.
Let us now fix the functional characteristic f .For r > 0, let f (X, r) be the value of f (r) estimated from the point pattern X.We define the integral dissimilarity measure for two point patterns X, Z based on the functional characteristic f as where R is a given constant depending on the size and shape of the observation window W .If W is the unit square, a popular rule of thumb leads to the choice R = 0.25.In real-life applications, the observation window W can be rather complicated.Then, any general recommendation for the choice of R would be counterproductive.Expert knowledge of the problem at hand should play the primary role in deciding which ranges of r are relevant for distinguishing the groups of patterns.
The expression (3) resembles the Cramér-von Mises statistic from the goodness-of-fit tests in the classical statistics with i.i.d.observations.In the point process literature, similar expressions appear in the theory of parameter estimation (Møller and Waagepetersen, 2004, Sect. 10.1).Moreover, (3) is used to quantify the dissimilarities between point patterns in the stochastic reconstruction procedure for point patterns (Koňasová and Dvořák, 2021;Tscheschel and Stoyan, 2006).In (Mateu et al., 2015), a dissimilarity matrix with entries computed as in ( 3) is plugged into a multidimensional scaling procedure, resulting in a representation of the collection of observed point patterns by a collection of points in R 2 .
The maximum absolute deviation counterpart of d int , resembling the Kolmogorov-Smirnov statistic, is then defined as Both d int and d sup are derived from semi-metrics commonly used in functional data analysis.They can be used even when the two investigated patterns are observed in different observation windows.The use of edge correction factors in the estimators of the functional summary characteristics reduces the impact of the size and shape of the observation window on the value of the estimate.However, the constant R has to be chosen with care.
An ideal dissimilarity measure d (X, Z) would have small values whenever the probability distributions of the stochastic processes that generated X and Z, respectively, are very similar.Our dissimilarity measures d int and d sup are based on matching the second-order properties of the point processes that generated X and Z.One should have in mind that two point processes with different probability distributions can have the same form of the secondorder characteristics; see e.g.Baddeley and Silverman (1984).In other words, two (visibly different) point patterns generated by models with different probability distributions can have zero dissimilarity d int or d sup .This is a deeper issue than the fact that d int and d sup are not metrics.However, if such a situation would be encountered in a practical application, the user would choose a different summary characteristic instead, e.g. one based on interpoint distances, which would be able to distinguish the different groups in the training dataset.
For some applications, the need to use more than one characteristic to extract the essential features of the probability distribution may arise.Combining multiple integral or maximum absolute deviation terms into a weighted sum is possible.Choosing appropriate weights is a complex problem that usually deserves some preliminary exploratory analysis and expert knowledge.Another possibility is to use the dissimilarity measure described in (Dai et al., 2021) which allows combining multiple characteristics without the need to weigh the individual terms.

SUPERVISED CLASSIFICATION
This section gives an overview of the two classification methods used in the sequel.A list of related theoretical results available in the literature is included in Sect.S10 of the Supplementary material accompanying this paper.
Suppose that a point pattern X is observed in a bounded observation window W , |W | > 0. We assume that this pattern was generated by a stationary point process X, for which we can define the pair correlation function.Take G ∈ N. Let Y be the label, i.e., a random variable with values in G = {1, 2, . . ., G} representing the affiliation to one of the G possible groups.We consider X and Y as a random pair (X,Y ) and aim at predicting the value of the label variable Y , given the realization X of the point process X.Since we are talking about supervised classification, our decision about the value of Y is based on the knowledge of training data, i.e., a collection of point patterns with known labels.In other words, let T N = {(X i , Y i ), i = 1, 2, . . ., N} be a set of N ≥ 1 independent realizations of (X,Y ).We call T N the training set or the training data.From now on, we suppose that the dissimilarity measure d in hand is chosen so that ties do not occur.Suppose the expert knowledge about the analysed data indicates that the choice of d can lead to ties.In that case, one should consider changing the dissimilarity measure, e.g. using a different functional summary.
The classification task can also be viewed as the search for a classification rule ϕ, which assigns a label ϕ(X) to a point pattern X.If we know the conditional probabilities p g (X However, p g , g ∈ G are usually not known in practical applications.The crucial step when building up a classification rule is thus the estimation of these conditional probabilities, based on our knowledge of the training data T N .Let X be a new pattern whose label is to be predicted.In the following, we will restrict our attention to the estimators of p g , g ∈ G in the form where {ω i (X), i = 1, 2, . . ., N} are nonnegative weights, depending on the new observation X and the training set T N .The weights should be chosen in such a way that for all g ∈ G and In this paper, we focus on the weights derived from the kernel regression method.The method is presented in the context of functional data analysis in (Ferraty and Vieu, 2006, Sect. 8.2).Let K : R −→ R + be a symmetric function such that R K(u) du = 1.We will call such a function a kernel.Moreover, let the support of K be [−1, 1].Note that our assumptions imply that R uK(u) du = 0.One of the classical examples of this function is the Epanechnikov kernel where h > 0 is a smoothing parameter, also called bandwidth.Note that if d(X, X i ) > h, then the weight corresponding to the pattern X i is 0. Equation ( 6) in fact corresponds to the Nadaraya-Watson estimate for a categorical response variable in the context of local polynomial regression.The classification rule based on the Nadaraya-Watson type of weights will be denoted by ϕ NW (• | T N , d, K, h).The corresponding weights ω i,h (X), i = 1, 2, . . ., N sum up to 1 and the conditions in (5) are met.
Three different features have to be chosen by the user; the kernel K, the dissimilarity measure d, and the value of the smoothing parameter h.As is typical for kernel methods, the choice of the kernel function K is not crucial.Regarding the dissimilarity measure, we will restrict our attention to the three examples listed in the previous section.Note that the integral dissimilarity measure d int corresponds to the framework of discrepancy measures for functional data presented in (Ferraty and Vieu, 2006, Sect. 3.4).
For the choice of bandwidth, one needs to deal with the problem of selecting h among an infinite set of positive values.To overcome this issue, we replace h with h k such that only k terms in (4) contribute to the final sum with nonzero weights.In other words, we want to choose h k so that there are exactly k elements For the choice of k, three different approaches can be considered.

Fixed value
The parameter k is considered fixed and the decision about its value has to be based on our prior knowledge of the problem at hand.For each new observation X, we order the dissimilarities Global cross-validation This approach is based on the leave-one-out cross-validation procedure.For i = 1, 2, . . ., N, denote by T a modified set of the training data T N , where we omit the i-th observation (X i , Y i ).The point pattern X i is considered as a new observation, and we predict its label.We build 6) and the new training data set T Then, we estimate the label of X i as We define the global loss function GCV : {1, 2, . . ., N − 1} −→ [0, 1] by The parameter k is chosen as a solution of an optimization problem This choice of k is called global since k GCV does not depend on the new observation X.Finally, the new observation X is classified using the estimated conditional probabilities p g,h k GCV (X), g ∈ G , where ) )/2.This procedure is described in (Ferraty and Vieu, 2006, Sect. 7.1.1).

Local cross-validation
We now include the new observation X in the procedure of finding the optimal value of k.Let us define the local loss function For each observation X i in the training set T N , the local optimal value k LCV (X i ) is then found as a solution to the optimization problem The local cross-validation procedure is described in (Ferraty and Vieu, 2006, Sect. 8.3).
Note that if we set K as the uniform kernel K(u) = 1/2, u ∈ [−1, 1], instead of the Epanechnikov kernel, then the weights ω i,h k (X), i = 1, 2, . . ., N are of the form The classification rule based on this choice of weights is called the k-nearest neighbors classifier.It can be easily seen that the estimated probabilities p g,k (X), obtained by plugging these weights in (4), meet the conditions (5).The parameter k can be fixed, or it can be learned from the training set T N using global or local cross-validation.Together with the dissimilarity measure d H , the k-nearest neighbours classifier is used in (Cholaquidis et al., 2017) to classify realizations of inhomogeneous Poisson point processes with different intensity functions.

SIMULATION EXPERIMENTS
In the next sections, the performance of the Bayes classifier in combination with the k-nearest neighbors algorithms and the kernel regression method will be examined in the context of replicated point patterns.Three simulation experiments will be presented.First, we compare the performance of the proposed classifier with respect to the different choices of the underlying dissimilarity measure.Second, the proposed method is compared with the approach suggested in (Mateu et al., 2015), which uses multidimensional scaling to transform the problem into a classification task in 2D.Finally, we explore the situation where the groups correspond to the realizations from the models of the same parametric family but with different values of the model parameter.For ease of exposition, our attention is restricted solely to binary classification.This section serves as a detailed description of the practical aspects of our simulation experiments, such as the methodology for assessing the quality of performance of individual classification rules or the exact setting of the computational environment.

Classification rules
In the following experiments, we consider the classification rule to be the ϕ NW .We fix K as the Epanechnikov kernel and use the automatic choice of the bandwidth; h = h k LCV is found by the k-nearest neighbors algorithm with the local choice of k.We write

Misclassification rate
The performance of a given classification rule ϕ is determined using the misclassification rate.Take T N , N ∈ N, the training set.To calculate the misclassification rate, another set of labelled patterns is needed.Denote the testing set, that is, a collection of patterns with known labels, by Γ M = ( X j , Y j ), j = 1, 2, . . .M , M ∈ N. Note that T N and Γ M should include different observations.Given the training and the testing set, the misclassification rate γ ϕ | T N , Γ M is computed as where ϕ X j | T N is the estimated value of the label based on the classification rule ϕ and the training data T N .Point pattern X j is misclassified if its estimated label ϕ X j | T N does not correspond to its true label Y j .The misclassification rate thus gives the ratio of the number of misclassified patterns from the testing set and the size of the testing set itself.For binary classification, γ ϕ | T N , Γ M = 0.5 corresponds to the situation where the labels are assigned randomly, regardless of the value of the new observation.
Quality of performance Different classification scenarios are compared using the average misclassification rate computed from the set of I ∈ N replications of the specific simulation experiment.Scenario with a lower average misclassification rate is then referred to as preferable.For all the simulation experiments, we set I at 100.

Computational
aspects All simulation experiments are performed in the statistical software R (R Core Team , 2017), with packages doParallel (Wallig et al., 2019), pracma (Borchers, 2019), and spatstat (Baddeley et al., 2015).The code for the Hausdorff metric d H is taken from the package pracma.The pair correlation function is computed with the default estimator in spatstat.Details about the estimation of functional summary characteristics can be found in (Møller and Waagepetersen, 2004).The unit square [0, 1] 2 is taken as the observation window W .The constant R, appearing in the formulas for d int and d sup , is set to 0.25.When implementing the kernel regression classifiers, we use the code accompanying the book (Ferraty and Vieu, 2006).We have made some minor modifications to adapt the original code to the context of replicated point patterns.The automatic procedure for the choice of the smoothing parameter h is based on leave-one-out cross-validation.In the simulation experiments, we use a small number of patterns in the training set (which is often the case in practical applications).In this setting, the computation of the dissimilarities (namely, the estimation of the pair correlation function) is the computational bottleneck.In contrast, the classification, including cross-validation, runs rather quickly.However, in the case of a large training set, it can be beneficial to consider j-fold cross-validation to save some computational time.This approach requires the specification of the number of folds j.
Further simulation experiments (Sections S4, S5, S6) and the extension of the ones presented here (Sections S2, S3, S7, S8) can be found in the Supplementary material accompanying the paper.An illustration of the code for our experiments can be found in a repository on Github at https:// github.com/kpawlasova/Sup_nonparam_clas_pp.git

EXPERIMENT 1
This simulation experiment provides a basic overview of the performance of the proposed classifiers.The pair correlation function g is considered here, given its simple interpretation and widespread use in practical applications.Classification scenarios For each σ ∈ Σ, three different classification scenarios are considered:

Models
where the choice of K and h k LCV was described in the previous section of this paper.Values of average misclassification rates are reported.For α ∈ A, the corresponding scenarios ϕ H [α], ϕ g,int [α] and ϕ g,sup [α] are considered.
Results for Thomas process In the following comments, values of σ in [0.02, 0.1) are considered small and correspond to strong clustering, values in [0.1, 0.15) are considered moderate, and values in [0.15, 0.2] are considered large and correspond to weak clustering.The Poisson point process Π can then be considered a limiting case of Φ(σ ) as σ goes to infinity.The categorization to strong, moderate, and weak clustering is related to the size and shape of the observation window.Recall that the theoretical formula for g is given in (1).For an illustration of how the values of g depend on the model parameter σ , see Fig. S2.1 in the Supplementary material.
For small values of σ , which indicates strong clustering, the average misclassification rate is expected to be close to 0. On the other hand, for large values of σ , the realizations from Φ(σ ) are hardly distinguishable from those from Π (given our observation window), and the average misclassification rate is expected to be close to 0.5.
The observed average misclassification rates are, in fact, increasing functions of σ , regardless of the dissimilarity measure; see Fig. 3 (top left).In terms of the average misclassification rate, both ϕ g,int and ϕ g,sup outperform ϕ H , especially for σ < 0.1.The small difference between the performance of ϕ g,int and ϕ g,sup is caused by the high variability of the estimator of g(r) for very small values of r, which influences the maximum absolute deviation counterpart of the dissimilarity measure more than the integral one.
The realizations of the Poisson point process Π have a very similar structure, which leads to small variability in the values of d (X i , X j ), where X i and X j are realizations of Π, regardless of the dissimilarity measure in use.On the other hand, for small values of σ , the realizations of the Thomas process Φ(σ ) show higher variability in terms of point configurations (clusters of points are placed in arbitrary locations, leaving gaps between clusters) and hence higher variability in the dissimilarities measured between two elements of this group.The dissimilarities between elements of different groups are smaller for ϕ g,int satisfactory, but the average misclassification rate of ϕ H [σ ] is much higher for small values of σ .
For higher values of σ , the realizations of Φ(σ ) resemble those of Π, and the dissimilarities between the realizations from different groups become smaller.This implies a higher number of misclassified patterns from both groups and higher average misclassification rates for all classifiers considered in this experiment.
Note that the pair correlation function itself is estimated with the help of a kernel-based estimator.Therefore, it requires the user to choose a smoothing parameter.The quality of the estimates is highly dependent on the choice of this smoothing parameter.Fig. 5 illustrates the impact of this choice on the performance of the classifier based on d int [g] and d sup [g], respectively.We study three choices of the smoothing parameter: the default value from the spatstat package, 0.5× the default, and 1.5× the default.Our simulations show that the appropriate choice of the smoothing parameter (in this case, higher than the default) can slightly improve the results.On the contrary, a wrong choice (too small value) can severely disrupt the classification.Hence, we recommend keeping the default value to reduce the risk of choosing a too small value.
Results for Gaussian determinantal process In this case, the values of α in [0.0025, 0.002) are considered small and correspond to weak repulsion, the values in [0.002, 0.004) are considered moderate and the values in [0.04, α max ] are considered large and correspond to strong repulsion.With the value of α approaching 0, the repulsive interactions become weaker, and Π can be considered a limiting case of Ψ(α) for α → 0. Recall that the theoretical formula for g is given in (2), illustration of the dependence of g on α can be found in Fig. S2.4 in the Supplementary material.For small values of α, the average misclassification rate is expected to be close to 0.5.Then, it is expected to decrease with α increasing.Fig. 6 shows that γ ϕ H [α] is between 0.4 and 0.5 for all α ∈ A, meaning that the realizations from Ψ(α) and Π are practically indistinguishable using the Hausdorff metric, regardless of the value of α.Scenarios ϕ g,int [α] and ϕ q,sup [α] produce similar average misclassification rates, which are very satisfactory for high values of α.
The repulsive interactions in Ψ(α) imply that the realizations have a very similar structure, with smaller dissimilarities between two realizations from the same model than in the case of two realizations of Π.Similar considerations about the number of misclassified patterns in each group that we have made in the previous paragraphs also apply to the classification Ψ(α) vs Π.However, note that now Π has more variable configurations and larger dissimilarities between two realizations from the same model, see Fig. 7.More details can be found in Fig. S2.5 and Fig. S2.6 in the Supplementary material.
Summary In both situations, Φ(σ ) vs Π and Ψ(α) vs Π, the classifiers ϕ g,int and ϕ g,sup outperform ϕ H for all values of the model parameter.If the model parameters are set such that the observation window W does not provide enough information to distinguish between the realizations of the two models (high value of σ or small value of α), the average misclassification rate is close to 0.5 for all classifiers.However, even in these cases, it is beneficial to choose ϕ g,int or ϕ g,sup over ϕ H .We conclude that the choice of the dissimilarity measure greatly affects the performance of the classifiers.Further computations (Sect.S3 in the Supplementary material) favour the use of the second-order summary characteristics such as g or L when constructing the dissimilarity measure.
Given this basic framework (binary classification, simple models), ϕ NW together with d int [g] or d sup [g] provides satisfactory results.However, the appropriate choice of the summary characteristic is not obvious.Prior expert knowledge of the problem at hand should always be taken into account.

EXPERIMENT 2
This simulation experiment extends Experiment 1 from the previous section.It compares the performance of the classifiers studied in Experiment 1 with the classifier introduced in (Mateu et al., 2015).Models, training and testing data are precisely the same as in Experiment 1.
Classification scenarios For each σ ∈ Σ, we consider the three classification scenarios ϕ H [σ ], ϕ g,int [σ ] and ϕ g,sup [σ ] introduced in Experiment 1.The performance of these classifiers will be compared with ϕ uses the dissimilarities based on the Hausdorff metric and multidimensional scaling (MDS) to represent the realizations of Π and Φ(σ ) as points in R 2 .Then, Fisher's linear discriminant analysis (LDA) is used to solve the substitutive classification task in R 2 .Analogously, ϕ ⋆ g,int [σ ] and ϕ ⋆ g,sup [σ ] use d int [g] and d sup [g], respectively, to calculate the dissimilarities that enter multidimensional scaling.Multidimensional scaling is performed using the function smacofSym from the R package smacof.The function lda from the R package MASS is then used to perform the classification.We report the values of the average misclassification rates, as in Experiment 1.
Results for Thomas process Fig. 8 shows that ϕ ⋆ H [σ ] is clearly outperformed by the two classification rules ϕ ⋆ g,int [σ ] and ϕ ⋆ g,sup [σ ].For small values of σ , ϕ ⋆ g,sup [σ ] gives the lowest values (among the three classifiers based on multidimensional scaling and linear discriminant analysis) of the average misclassification rate.On the other hand, for σ > 0.1, ϕ ⋆ g,int [σ ] has the best performance.When comparing the "MDS + LDA" classifiers with those based on kernel regression, we see that ϕ ⋆ H [σ ] produces a lower average misclassification rate than ϕ H [σ ] with the greatest difference between the average misclassification rates observed for σ between 0.05 and 0.1.The classification rule ϕ g,int [σ ] outperforms ϕ ⋆ g,int [σ ] for small values of σ , for σ > 0.1 the choice of the "MDS + LDA" classifier leads to a slightly lower average misclassification rate.The differences in the performance of ϕ g,sup [σ ] and ϕ ⋆ g,sup [σ ] are almost negligible, with a small favor towards the use of ϕ g,sup [σ ].
Results for Gaussian determinantal process For classification Π vs Ψ(α), the situation with the three classifiers based on MDS and LDA is very similar to the results reported in Experiment 1.For α > 0.02, ϕ ⋆ g,sup [α] gives slightly lower average misclassification rates than ϕ ⋆ g,int [α], but both clearly outperform ϕ ⋆ H [α].
The difference between the "MDS + LDA" classifiers and those based on kernel regression is negligible, except for the dissimilarity measure d int [g], where the kernel regression classifier gives lower average misclassification rates while α > 0.03.

Results for Thomas process
For each value of σ 1 , we expect the average misclassification rate to be 0.5 for σ 2 = σ 1 and to decrease with increasing distance |σ 1 − σ 2 |.Fig. 10 shows that the average misclassification rate corresponding to ϕ g,int [σ 1 , σ 2 ], σ 1 = 0.05, is below 0.1 expect for the few values of σ 2 that are the closest neighbors of σ 1 = 0.05.This is a consequence of the behaviour of the (theoretical) pair correlation function in the model with such strong clustering.The values of the pair correlation function change significantly with even small changes of σ .In addition, the 90% pointwise envelope is very narrow and the changes in its width with respect to σ 2 are negligible.For σ 1 = 0.1 and σ 1 = 0.15, the average misclassification rate is close to 0 for very small values of σ 2 but increases significantly towards 0.5 for σ 2 = σ 1 and decreases afterwards.The value of the average misclassification rate is below 0.1 for σ 2 ∈ [0.02, 0.05] in the first case and σ 2 ∈ [0.02, 0.7] in the second case.The 90% pointwise envelopes are narrow for the smallest values of σ 2 , and their width increases as σ 2 is growing towards 0.2.Loosely speaking, the realizations of Φ(σ ) with the value of σ corresponding to mild or weak clustering can be distinguished successfully from the realizations of Φ(σ ) with σ sufficiently small (representing strong clustering).Similar observations can be made for the maxima absolute deviation counterpart d sup [g] of the dissimilarity measure (see Fig. S7.14 and Fig. S7.15 in the Supplementary material).
Results for Gaussian determinantal process We again expect the average misclassification rate to be 0.5 for α 2 = α 1 , and to decrease with |α 1 − α 2 | increasing.For α 1 = 0.02, Fig. 11 shows that the average misclassification rate corresponding to ϕ g,int starts at 0.2, then increases to 0.5 and then decreases as α 2 grows towards its maximal values.For α 2 > 0.04, the average misclassification rate is below 0.1 Similarly, for α 1 = 0.03 the average misclassification rate is below 0.1 for the two smallest values of α 2 as well as for the two largest values.For α 1 = α max , the situation is different.In this case, we start with a nearly perfect classification, the average misclassification rate stays below 0.1 for α 2 ≤ 0.03, then increases sharply towards 0.5.This is consistent with the fact that α max represents the most repulsive model whereas α < 0.02 represents weak repulsion.The difference in performance for the classifier based on the maximum absolute deviation counterpart d sup [g] of the dissimilarity measure is negligible.
Summary For binary classification Φ(σ 1 ) vs Φ(σ 2 ) or Ψ(α 1 ) vs Ψ(α 2 ), the average misclassification rates corresponding to ϕ g,int decrease with increasing distance |σ 1 − σ 2 | or |α 1 − α 2 |.Classification between models with strong interactions and weak interactions is very successful, but for models with similar properties (similar values of model parameters), the average misclassification rates are high.Recall that all of the realizations in this experiment are observed on the unit square.Sect.S8 in the Supplementary material presents a repetition of this experiment with a larger observation window, studying the impact of the increasing number of points per realization on the performance of the kernel regression classifier.

REAL-DATA EXAMPLE
To illustrate the proposed methodology, we apply our classification procedure to a collection of 68 point patterns representing the centers of intramembranous particles located in the mitochondrial membranes of the HeLa cell line.These data were collected using the freeze-fracture technique (Schladitz et al., 2003).
During data collection, the cell line was observed in three different environments to study mitochondrial metabolism: under normal conditions, after exposure to sodium acid, and after exposition to rotenone.Therefore, we distinguish three groups of patterns: a control group corresponding to standard  conditions (33 patterns), the first group corresponding to the sodium acid environment (14 patterns), and the second group corresponding to the rotenone environment (21 patterns).One (randomly chosen) pattern from each group is plotted in Fig. 1.For all the observed patterns, we fix a squared observation window with an edge length of 336 nm.According to the analysis in (Schladitz et al., 2003), observations in all three groups exhibit a hard-core property for very small distances.It means that pairs of points that are very close to each other do not occur in the patterns -the intramembranous particles, whose centers are recorded, do not overlap.Weak repulsion between points of the process occurs on the scale from 10 nm to 20 nm.Weak aggregation can be observed for interpoint distances greater than 20 nm.
We perform ternary classification using the Bayes classifier in combination with the k-nearest neighbors algorithm and the kernel regression method (including the local choice of the optimal k).After some preliminary observations, we have decided to use g, L and G to build the dissimilarity measures.While L is derived from the pair correlation function, the characteristic G is based on interpoint distances; see (Møller and Waagepetersen, 2004) for further details.We set R(g) = R(L) = 84 nm, that is, 1/4 of the edge length of the observation window, and R(G) = 66 nm.The classification is performed as follows: we fix one of the 68 patterns, consider the remaining 67 patterns as training data, and predict the label of the fixed pattern.We then compare the predicted labels to the true ones to compute the misclassification rate, see Table 1.Furthermore, we report the number of misclassified patterns in each group.
Table 1 shows that the use of G leads to the lowest misclassification rate (for both versions of the dissimilarity measure).However, more than onequarter of the patterns are misclassified, even in the best scenario.Note that from the sodium acid group, 9 resp.8 patterns are not classified correctly.This group contains the smallest number of patterns (14).For g and L, the misclassification rates based on d sup are the same.With d sup [L], 60% of the patterns in the rotenone group are misclassified.With d sup [g], the sodium acid group is the most problematic.The highest misclassification rate is observed for d int [L], with almost all patterns in the noncontrol groups labelled wrongly.For d int [g], one-third of the patterns in the control group (the largest group, containing 33 patterns) are misclassified.Visualisation of the dissimilarities between the elements of this dataset can be seen in Fig. S9.20 and Fig. S9.21 in the Supplementary material.
In conclusion, none of the three summary characteristics considered in this section provides a satisfactory ternary classification.Suppose that we select the control and sodium acid groups and consider binary classification.In that case, we expect good performance from the classifier based on d sup [L], see the number of misclassified patterns from individual groups in Table 1.Similarly, we expect that the classifier based on d int [G] will provide satisfactory results for binary classification between the control and rotenone groups.The same applies to the rotenone and sodium acid groups.To improve the ternary classification, we need to tune up the classifiers, e.g. by identifying another summary characteristic, better capturing the differences between the three groups.

DISCUSSION
This paper proposes a methodology for the supervised classification of point patterns based on their representation by a selected functional summary characteristic.The presented simulation experiments confirm that the Bayes classifier in combination with the k-nearest neighbors algorithms and the kernel regression method is successful in solving the problem.
The simulation experiments cover the three main classes of models: aggregation, complete spatial randomness, and repulsion.The particular models considered in this paper represent the typical behavior in their respective classes and are often used in practice, thus providing a good picture of the problem.Of course, other models could also be considered.
The simulation study focuses mainly on the pair correlation function, selected for its simple interpretation and popularity in the applied literature.However, many other functional summary characteristics are available, and to make an appropriate choice, one should use the expert knowledge of the problem at hand.In a specific application, choosing an appropriate version of the classifier with all tuning constants is a difficult task.For that reason, seeking generally applicable recommendations is useless.Our decisions should be guided by expert knowledge about the particular dataset.When several candidate (versions of) classifiers are assumed to be relevant, we suggest investigating their performance in the training dataset using an appropriate cross-validation scheme.
Finally, we remark that the proposed method can be directly extended to more complicated settings, such as random sets, provided that relevant summary characteristics are available.Table 1.Ternary classification problem: the Bayes classifier in combination with the k-nearest neighbors algorithm and the kernel regression method is applied to the point pattern data from (Schladitz et al., 2003).The misclassification rate is reported, as well as the number of misclassified patterns in each group.The lefthand side of every column corresponds to the integral version of the dissimilarity measure, and the right-hand side corresponds to its maximum absolute deviation counterpart.

Fig. 1 .
Fig. 1.Point patterns represent centres of intramembraneous particles of mitochondrial membranes from HeLa cells under three different conditions: exposition to sodium acid (left), normal conditions (middle), and exposition to rotenone (right).The observation window is the square with a side length of 336 nm.
As an example of real data, we analyze a collection of point patterns representing the intramembranous particles of the mitochondrial membranes of the human HeLa cell line.Three different classes are considered: the cell line exposed to sodium acid, the cell line under normal conditions, and the cell line exposed to rotenone.We aim to predict the class membership for a new observation based on the training set of labelled patterns.Examples of one pattern from each class can be seen in Fig. 1.

Fig. 5 .Fig. 6 .
Fig.5.Average misclassification rates for different choices of the smoothing parameter used while estimating the pair correlation function.The solid line corresponds to the default setting in spatstat package (this values are the same as in Fig.3), the dotted line corresponds to the value 0.5× the default and the dashed line corresponds to the value 1.5× the default.

Fig. 9 .
Fig. 9. Average misclassification rates corresponding to ϕ ⋆ H [α], ϕ ⋆ g,int [α] and ϕ ⋆ g,sup [α] are plotted as functions of the model parameter al pha.For each d ∈ {d H , d int [g], d sup [g]}, the average misclassification rates corresponding to ϕ ⋆ d (including the 90% pointwise envelope based on the individual misclassification rates) are compared to average misclassification rates corresponding to ϕ d (from Experiment 1).
. m.rate Control Sodium acid Rotenone