CLASSIFICATION OF RED BLOOD CELLS FROM A GEOMETRIC MORPHOMETRIC STUDY

Sickle cell disease causes the deformation of erythrocytes into sickle cells. The study of this process using digital images of peripheral blood smears can help specialists to quantify the number of deformed cells in order to gauge the severity of the illness. A new method for classifying red blood cells into three categories: healthy, sickle cell disease, and other deformations is proposed. This method does not require obtaining the contour of each cell but instead utilizes information obtained from a small number of points, obtained through appropriate geometric sampling and the use of stereological formulas. The parameters utilized for classification are the bending energy times length ( E ) and the circular shape factor ( F ). In normal cells, which exhibit an almost circular shape, these parameters typically have values close to ( 1 , 1 ) . To assess the effectiveness of classification using the parameters ( E , F ) , a synthetic curve dataset and a dataset of red blood cells are employed, applying various supervised and unsupervised classification methods.


INTRODUCTION
Normally, red blood cells are flexible and round, moving easily through even the smallest of blood vessels.When a person has sickle cell anemia, many red blood cells assume a rigid sickle-like shape or crescent moon, that can hinder their passage through diminutive capillaries, resulting in oxygen deficiency to certain tissues as blockages form.Moreover, sickled red blood cells are unusually fragile and prone to breakage, so they only survive in the bloodstream for about a tenth of the time that normal erythrocytes remain in circulation, increasing the effects of anemia.Among the most common symptoms of sickle cell anemia are fatigue, breathlessness, joint pain, delayed growth, jaundice, rapid heart rate, increased susceptibility to infections, and sporadic attacks of pain (often termed crises) in the abdomen or other areas of the body.
One method to assess the clinical status of patients is the classification of cells based on their morphology.Cells are generally classified into three categories: normal, sickle, or other abnormalities.Although even today this classification is carried out mostly by specialists, looking directly into the microscope or at the computer screen to decide each cell of what type it is, nowadays there are more and more studies that use automatic cell classification methods, based on image processing techniques and machine learning.Initially, the images are obtained from a peripheral blood smear that gives rise to the microscopic images that are observed by the specialist.However, in order to apply automatic classification techniques, these images are processed and segmented to distinguish the outline of the cells and, in most methods, the outline (flat curve) is used to obtain the classification Sadafi et al. (2023).This classification can be made from characteristics that are extracted from the contour (length, area, eccentricity...) (Bischin el al. (2012)), or considering the contour (curve) as an element of a more complicated geometric manifold (Epifanio et al. (2020)) in which the distances used in the classification are defined or using, for instance, neural nets.The state of the art of segmentation methods and cell classification methods, based on boundary features and geodesic distances in the shape space of curves, can be found at Delgado-Font et al. (2020), and a literature review of image processing methods and machine learning methods can be found in Alzubaidi el al. (2022).In any case, as we will also see in this work, the classification depends on the segmentation of the cell contours, and fluctuations in the segmentation can strongly influence the classification results.
In this paper we propose a semiautomatic method for classifying red blood cells into three groups: normal, elongated (sickle-shaped) and with other deformations; based on stereological estimators of contour features.The specialists must select certain points of the microscope image of the peripheral blood smear, and from these points the characteristics used in the classification will be estimated.These characteristics are the circular shape factor F (based on the isoperimetric inequality and already used in other works) and the bending energy times length E, which is a new proposal based on the bending energy of the flat curve (contour).The objective of the paper is not to compare our method with other existing methods for red blood cell classification based on cell boundary segmentation.Instead, the aim is to provide a new classification method without the need to segment each cell, but by observing what happens at the intersection of lines with the cell boundary through appropriate geometric sampling.The computational burden of our method depends on the density of lines considered in the geometric sampling (see Fig. 1).
Although stereology is defined as a discipline that allows obtaining efficient and unbiased estimates of 3D quantitative characteristics from an appropriate geometric sampling based on 2D measurements of the structure of interest (Baddeley and Jensen (2005); Howard and Reed (2005)), estimates of 2-dimensional parameters on the plane (area, curve length...) based on sampling with measurements of dimension less than two (lines and points) have also been considered in the literature (see, for instance, Eq. (7.5) and Eq.(7.9) of Baddeley and Jensen (2005)).It is in this sense that we will define stereological estimates of the characteristics F and E.
The paper can be divided into two parts.In the first part (Materials and Methods), the circular shape factor F and the bending energy times length E of a planar closed curve are defined, and two approximations ( F, Ẽ) and two stereological estimations ( F, Ê) are proposed along with their corresponding square coefficients of error.In the second part, two databases are considered.One consists of thirty synthetic curves for which we know their parametrizations, allowing us to obtain (F, E) exactly.The other is composed of segmented images of peripheral blood smear samples, therefore, we can only obtain approximations and estimations of (F, E).Based on these curve databases, we propose an unsupervised classification of the curves to observe, firstly, if (F, E) provide an adequate classification, and secondly, if this classification remains suitable and is not significantly altered when considering the approximations ( F, Ẽ) or estimations ( F, Ê).

MATERIALS AND METHODS
Let α : [a, b] −→ R 2 , with α(t) = {x(t), y(t)}, be a natural regular curve.Let C denote the graph of the curve.We suppose that C bounds a domain D in the plane.Let A and B denote the area of D and the length of C, respectively.The curvature of α at the point α(t) is given by (Gual-Arnau et al. (2017)) . (1) The bending energy of the curve α is defined as (Canham (1970), Young et al. (1974)) (2) The parameters A, B and E(C) do not depend on the parameterization α, they only depend on the geometrical curve C. The real interval [a, b] defining the curve will be [0, B] when the curve is parameterized by arc length and [0, 2π] for the closed curves considered in the experimental study.
The bending energy of a closed object in 2D has been frequently used as shape discriminator; in fact, two of the pioneering papers in relating the shape of red blood cells with the bending energy of their boundaries are Canham (1970) and Young et al. (1974).
The two features that we will consider in this paper as shape descriptors are the following.
Definition 1.The circular shape factor of D is defined as The bending energy times length of C is defined as In this paper, the shape of a plane domain will be the geometric information that remains invariant when rotations, translations and/or changes of scale act on the domain.Since the area of a domain and the length and curvature of a curve are invariant under translations and rotations, we have that the circular shape factor F and bending energy times length E will be also invariant.Now we will see that F and E are also invariant under changes of scale.
Let D λ = {λ (x, y) / (x, y) ∈ D} with λ > 0, and C λ be the curve that bounds D λ .Then We conclude from Eq. (3) and Eq. ( 4) that E and F are rescaling invariants and therefore shape descriptors.
In fact, from the well-known isoperimetric inequality in the plane, we have that F ≥ 1 and equality is given if and only if D is a circle.
Proposition 2. Given a domain D bounded by a differentiable simple closed curve C, then E ≥ 1 and equality holds if and only if D is a circle.
Proof of Proposition 2. In the proof, we will use the Fourier series technique employed in Young et al. (1974), adapted to our definition of E.
Let C be a differentiable simple closed curve which bounds a planar domain x(s) and y(s) are periodic functions; that is, x(s + B) = x(s) and y(s + B) = y(s); then, we consider the Fourier series of both functions 2 )ds = B; then, from the derivatives of (5), we have Let ⃗ T be the unit tangent vector to α(s); then, s) 2 ; then, applying the Parseval's identity to the Fourier series of x ′ , y ′ , x ′′ and y ′′ we obtain Having in mind the above equation; to determine the set of Fourier coefficients that yield a minimum E subject to the equality constraint given in Eq. ( 6) for B 2 , we form the function and we apply the method of Lagrange multipliers where the variables will be z n and the Lagrange multiplier λ ; that is Then, from (7), B 2 , which implies that at most one z k is nonzero.
From Eq. ( 6) and Eq. ( 7), it leads to So, the minimum value of E is given when k = 1; that is, E min = 1.Moreover, in this case, from Eq. ( 5) we have Then, we can write Differentiating the functions x and y in Eq. ( 12) and substituting them into x ′ (s) 2 + y ′ (s) 2 = 1, we can write Eq. ( 13) as: Now, from Eq. ( 6), Therefore, to each plane domain D we will associate the pair of points, both greater than or equal to one, (F, E) and whose value will be (1,1) if and only if the domain is a circle.

MATERIALS
The objective of this paper is to characterize a plane shape through the pair (F, E).First we will consider a synthetic base formed by 30 curves of which we know their parameterization α i (see Fig. 2); therefore, we will be able to obtain the pair (F, E) for each shape exactly.Next we will work with a discretization of the curves, so that we will have a big number of points for each curve and, from them, we will obtain the approximate values ( F, Ẽ) of (F, E).Finally, we will use stereological estimators to obtain an estimate ( F, Ê) of the values (F, E) of each curve.
Next, we will consider images of peripheral blood smear samples from patients with sickle cell disease in the Special Department of Hematology of the General Hospital 'Dr.Juan Bruno Zayas Alfonso' from Santiago de Cuba.A specialist prepared the blood samples and manually segmented and classified each cell as normal, sickle (elongated) or with other deformations.The dataset used in this study is available at http://erythrocytesidb.uib.es/.From these images we will obtain the approximate values ( F, Ẽ) and the stereological estimators ( F, Ê) of each red blood cell.We will see that the values of Ẽ and especially F depend heavily on the segmentation performed, as variations in the contour strongly affect the curvature and length of the curve.However, the value of the estimations ( F, Ê) does not depend as much on these variations in the segmentation, and the values are more stable and similar to those obtained in the synthetic dataset.

If we know the parameterizations α
and κ(t) from Eq. ( 1).Then, we associate to each synthetic shape a point in R 2 given by the exact values If we know α(t) in a discrete number of points α(t i ) = (x(t i ), y(t i ))) = (x i , y i ), i = 0, 1, . . ., N, where N , the approximate values of A, B and E(C) are obtained from numerical methods.Then we obtain the approximations ( F, Ẽ) of (F, E).
To approximate A there exist some numerical methods based, for instance, on the Green's theorem or the trapezoidal rule.Here we use the trapezoidal method, then we have two approximations: The length B will be approximated as the sum of the lengths of the segments joining consecutive points α(t i ) and α(t i+1 ); that is, The curvature κ i at each point α(t i ) can be approximated following different methods (Gual-Arnau et al. (2017)), we consider the following, where T i is the triangle formed by the points α(t i−1 ), α(t i ) and α(t i+1 ) and a i , b i , c i the lengths of the triangle sides.Now, E(C) is approximated as Finally, to use stereological estimators, we need a suitable geometric sampling.We will consider a square grid of test lines which is isotropic uniform random (IUR) relative to the cells to be estimated.A square grid of test lines is the union of two mutually perpendicular IUR series of parallel test lines a constant distance T > 0 apart (see Fig. 1).In practice, isotropic uniform randomness is attempted by superimposing the grid "at random", without looking at the image.Unbiased estimators of A, B and E(C), in the line, for example, of equations (4.4), (4.5), (4.38), (7.5), and (7.9) in Baddeley and Jensen (2005), are: where P is the number of test points hitting D, I the number of intersection points of lines with C = ∂ D and κ i denotes the curvature of C at an intersection point of C with a line.In Fig. 1, P = 3 and I = 8.In practice, κ i is substituted by κi .
Then, the estimates of F and E are, respectively, The error variance predictors of the above estimators are detailed in Appendix A.

RESULTS
Classification is a type of machine learning task that involves training an algorithm to identify which category or class an observation belongs to.Classification methods are divided into supervised and unsupervised methods.The cluster analysis, also known as unsupervised classification, is a technique used to identify natural groupings or clusters in a dataset based on the values of one or more variables; in our case in the variables F and E. There exist several unsupervised classification methods.We consider a model-based classification method provided by the Expectation-Maximization (EM) algorithm using the MClust library in R, to see if the unsupervised classification we obtain with the factors (F, E), and their approximations and estimations aligns with the known classification.On the other hand, the calculation of ( F, Ẽ) and ( F, Ê) has been carried out using custom codes developed in MATLAB.

EXPERIMENTAL STUDY OF SYNTHETIC FIGURES
In this section we consider the set of synthetic curves given in Fig. 2 and we apply the classification method from the values of (F, E).
In Fig. 3, we have an example of each class into which we have divided the synthetic figures.
The classification algorithm divides the cells using the factors (F, E) into three distinct groups.In the first group we have the normal ('almost circular') cells, whose values of (F, E) are close to (1,1), the sickle cells form the second group and, finally, we have the group of other cells (see Fig. 4).The cluster vector obtained using the MClust library is provided in Eq. ( 24).This vector indicates the group to which each curve in Fig. 2 has been classified.Therefore, considering the shapes of the synthetic figures, the result aligns with expectations, with ten cells in each group.To provide a rough characterization of the defining features of each group, we have included the average F and E values for each group in Table 1 Using the classification method provided by the MClust library in R and utilizing the approximate values ( F, E), we also derive the cluster vector presented in Eq. ( 24).Consequently, a classification with ten cells in each expected group is achieved (refer to Fig. 5).In Table 2 we have the average F and Ẽ values of each group.Now, we consider the unbiased estimators of A, B and E(C) derived from Eq. ( 22).To classify the estimates of F and E, a code draws the parameterizations and the grid of UR lines.The estimation of length and area has been manually calculated by counting the interior points and intersection points; and then, the value of F has been obtained.To calculate Ê, we must find the sum of squared curvatures at each intersection point of the grid with the curve.To do this, the corresponding points must be clicked, and when all points are clicked, a code calculates Ê.To calculate the curvature at a point, we need the point itself and two neighboring points.For this reason, we have only calculated the estimated ( F, Ê) values for twelve cells.
As our database is segmented, we have utilized the segmentation points.However, since the curvature depends on the segmentation, our proposal for the future is to not segment the image and have the specialist mark only three points at each intersection between curve and straight line.
We have already seen that the classification of the synthetic figures in Table 2 using the real values of F and E, as well as the approximate values ( F, Ẽ), yields the expected results.However, with stereological estimations, the classification will depend on the errors made in these estimations, specifically with the number of lines used and therefore the value of T .We will only use a fixed value of T , and we will give a predicted approximations of the square coefficients of error of F and Ê.It is possible to adjust the value of T as we please or depending on how precise we want the estimation to be.
In our case, using the line density shown in Fig. 1, and based on the classification method provided by the MClust library in R, employing the estimated values ( F, E), we also obtain the cluster vector presented in Eq. ( 24).(see Fig. 6).Fig. 6.MClust method for data classification from ( F, Ê).
In Table 3 we have the average F and Ê values of each group.Therefore, given the shapes of the synthetic figures, the classification outcome is as expected, with ten cells in each group, whether we use E and F or their approximations or estimates.Furthermore, the probability that each curve belongs to the group to which it has been classified consistently exceeds 0.9999987; thus, the classification is accurate, wellseparated, and correct for all three groups in all cases.
Since we know the exact values of F and E for synthetic figures, we can obtain the sample variance and squared error coefficient of F and Ê for each curve, and we have obtained the following results: For nearly circular curves: ce 2 ( F) < 3% and ce 2 ( Ê) < 3%.

SEGMENTED CELLS
In this section we first consider a database from the Department of Hematology of the General Hospital 'Dr.Juan Bruno Zayas Alfonso' from Santiago de Cuba, formed by 513 cells of which 202 are normal, 100 have the sickle cell disease and 211 present other deformations.The initial idea is to use the approximations and estimations of F and E for performing an unsupervised classification of these cells.In Wheeless et al. (1994) the circular shape factor is used for the same purpose.However, when performing a classification process, it is advisable to remove from the database those elements considered atypical.In this regard, if we observe Fig. 7, we can see that the length of a curve and, especially, the integral of the squared curvature are highly sensitive to curve perturbations; therefore, especially the value of E can vary abruptly with curve perturbations.Therefore, before initiating the classification process with the complete database, we have proceeded to eliminate elements with F > 92.211 or E > 92.229.As a result, we are left with a database consisting of 202 normal cells and 100 sickle cells, as well as 192 cells with other deformations.
In Fig. 8, we have one erythrocyte from each class of the database and their segmentation (in blue).To obtain the values of Ẽ and F, complete segmentation is necessary.However, the values of Ê and F do not require the entire segmentation but only the values at the intersection points of the boundary with a square grid of test lines.In this section, we are going to calculate the approximate descriptors of E and F for the cells corresponding to the refined and outlier-free database.Let's remember that this is a real database, and therefore we do not have the curves given as parameterizations but as a set of points.The code used to calculate the approximate F and Ẽ values is the same as we used in the previous section when calculating a series of points for each parameterization and working with them.
Using Mclust, we have three groups, each composed of 202, 106, and 186 cells, representing normal, sickle, and other cells, respectively.In Fig. 9, we can observe how the groups have been classified.As we can see, three groups emerge.The 202 cells considered normal appear in a single group; therefore, all normal cells are classified within a group where cells from the other two groups do not appear.Then we have a group where the majority cells are sickle cells but some 'other' cells are also classified in this group.The last group is formed by other cells.Fig. 9. MClust method for database classification from approximations.
Finally, we need to calculate the Ê and F values for each cell using stereology.The code to be used is the same as in the preceding section, with the only difference that now we cannot draw the cell in order to click on the intersection points and calculate the two descriptors Ê and F. The solution to this problem has been to use the scatter function provided by MATLAB, which draws the contour formed by connecting the given points.Therefore, we can use the same code, but using also the code that draw the cell.
We have considered the cell dataset after the removal of atypical elements, meaning we start with the database in which 202 cells are normal, 100 cells are sickle cells, and 192 exhibit other deformations.As a result, we have calculated the estimated Ê and F for all the cells in this dataset and subsequently classified them.Applying MClust again, we obtained three groups, one composed of 202 normal cells, a second group consisting of 127 cells classified as sickle cells, and a third group with 165 cells classified with other deformations.Five sickle cells were classified in the group of other deformations, and 32 cells that belonged to the group of other deformations were classified as sickle cells.
The classification results of the MClust method based on the estimates depend on the grid of test lines used and their density.In our case, the grid of lines used and one of the cells can be seen in Fig. 1, and the classification results are shown in Fig. 10.In this case, since we do not have the exact values of F and E, we will approximate the squared error coefficient of F and Ê of the cells of Fig. 8 using the approximations from the Appendix, and we obtain the following results: For the normal cell: ce 2 ( F) = 2.69% and ce 2 ( Ê) = 1.97%.

DISCUSSION
In this paper, we have proposed two parameters, one known as the circular shape factor F and a new one based on the curvature of the curve, the bending energy times length E, which, when associated with red blood cells, have allowed for the classification of these cells into three groups: healthy, with sickle cell disease, and with other deformations.Furthermore, it has been demonstrated that both the values of F and E are greater than one, and when they take the value of 1, they characterize a circle.
The classification has been carried out using both supervised and unsupervised methods, and it has been found that when the values of F and E are replaced with approximations F and Ẽ or estimations F and Ê, the classification does not vary substantially.
The idea would be to incorporate the calculation of F and Ê into interactive stereology programs adapted to microscopes so that, by directly observing the microscope images without segmenting the cells, these estimations could be obtained.In this study, since we had a database of segmented images, we used them to compare the results with those obtained using the estimations F and Ẽ.The two factors F and E vary with fluctuations in the cell segmentation, especially the factor E heavily relies on variations in the contour, as it is based on the curvature of the curve.Therefore, approximating the curvature at a few points directly on the microscope, without the need to segment all the cells, could be better for both the microscope observer and the cell classification.This way of classifying cells based on information from a small number of points on the boundary can also be useful when there is overlap between cells and it is difficult to obtain the segmentation of each one separately.

Fig. 1 .
Fig. 1.A square grid of test lines and a cell in R 2 .

Fig. 8 .
Fig. 8.Some types of erythrocytes from the database.

Table 1 .
. Average F and E values of each group.

Table 2 .
Average F and Ẽ values of each group.

Table 3 .
Average F and Ê values of each group.