Phenotype clustering of breast epithelial cells in confocal images based on nuclear protein distribution analysis
© Long et al; licensee BioMed Central Ltd. 2007
Published: 10 July 2007
Skip to main content
Volume 8 Supplement 1
© Long et al; licensee BioMed Central Ltd. 2007
Published: 10 July 2007
The distribution of chromatin-associated proteins plays a key role in directing nuclear function. Previously, we developed an image-based method to quantify the nuclear distributions of proteins and showed that these distributions depended on the phenotype of human mammary epithelial cells. Here we describe a method that creates a hierarchical tree of the given cell phenotypes and calculates the statistical significance between them, based on the clustering analysis of nuclear protein distributions.
Nuclear distributions of nuclear mitotic apparatus protein were previously obtained for non-neoplastic S1 and malignant T4-2 human mammary epithelial cells cultured for up to 12 days. Cell phenotype was defined as S1 or T4-2 and the number of days in cultured. A probabilistic ensemble approach was used to define a set of consensus clusters from the results of multiple traditional cluster analysis techniques applied to the nuclear distribution data. Cluster histograms were constructed to show how cells in any one phenotype were distributed across the consensus clusters. Grouping various phenotypes allowed us to build phenotype trees and calculate the statistical difference between each group. The results showed that non-neoplastic S1 cells could be distinguished from malignant T4-2 cells with 94.19% accuracy; that proliferating S1 cells could be distinguished from differentiated S1 cells with 92.86% accuracy; and showed no significant difference between the various phenotypes of T4-2 cells corresponding to increasing tumor sizes.
This work presents a cluster analysis method that can identify significant cell phenotypes, based on the nuclear distribution of specific proteins, with high accuracy.
Histological classification of biopsied breast tissue plays a key role in mammary cancer detection and in determining patient treatment. Current methods rely on gross signatures of cellular and tissue organization including tubular formation, nuclear pleomorphism and mitotic activity. To aid the early detection and diagnosis of mammary tumors, quantitative techniques are highly needed that could not only help automate the classification process but also provide subcellular information that could be used to reveal new subclasses of tumor within each pathological grade.
Increasing evidence has shown that chromatin-associated proteins are important in directing nuclear functions involved in the control of cell proliferation and differentiation [1–3]. Using tissue models, formed by culturing human mammary epithelial cells (HMECs) from the HMT-3522 cancer progression series in Matrigel™ (3D culture), earlier studies showed that the distribution of Nuclear Mitotic Apparatus (NuMA) protein was remarkably different in non-neoplastic cells that were proliferating compared to those that had completed acinar morphogenesis by forming polarized glandular tissue structures . For instance, during the 10-day in vitro morphogenesis process, NuMA staining was reported as diffusely distributed within the nuclei of proliferating cells, and had aggregated into foci of increasing size as cells arrested proliferation and completed acinar morphogenesis .
Based on these findings, Knowles et al then developed an image-based technique, called local bright feature (LBF) analysis . The technique uses fluorescence images of total DNA and specifically stained nuclear proteins and calculates the radial distribution of the density of bright immunostained features as a function of the distance from the perimeter of the nucleus to its center. The LBF analysis was used to quantify the distribution of fluorescently stained NuMA from confocal images of non-neoplastic (S1) and malignant (T4-2) HMT-3522 HMECs, cultured in 3D for up to 12 days . By averaging the LBF distributions over populations of cells with the same phenotype, the study showed that the LBF analysis reproducibly captured changes in NuMA distribution along the morphogenic process in non-neoplastic S1 cells. It also revealed that the NuMA distribution in malignant T4-2 cells was diffuse and independent of the number of days the cells were in culture .
Here we report a cluster analysis approach, based on the distribution of nuclear proteins, that robustly calculates the statistical significance between cell phenotypes, which are defined by the behavior of the cells in 3D culture. The method first groups LBF distributions into clusters using multiple traditional clustering methods. The results are then combined by a probabilistic ensemble approach into a set of consensus clusters that can be used to reliably define all possible LBF distributions that exist within a data set. This then allows cluster histograms to be computed which show how the LBF distributions in individual cells from a group are distributed over the consensus clusters. These cluster histograms represent a new way of linking the phenotype of groups of phenotypically similar cells, defined by their behavior in 3D culture, with their LBF distributions, quantified microscopically. Further, by grouping the LBF cluster histograms in multiple ways, the method is then able to build a phenotype tree and to calculate the statistical significance between each grouping. Each level of the tree corresponds to a different phenotype division of the cells and provides a way to predict which of the cell phenotypes, or grouping of cell phenotypes are significantly different from each other. These methods were then applied to the LBF distributions of NuMA in S1 and T4-2 cells, previously reported in Knowles et al . The resulting cluster histograms clearly showed that the distribution of NuMA changes during the morphogenic process as non-neoplastic S1 cells growth arrest and differentiate. The resulting phenotype tree showed that non-neoplastic S1 cells could be distinguished from malignant T4-2 cells with 94.19% accuracy; that proliferating S1 cells could be distinguished from differentiated S1 cells with 92.86% accuracy; and clearly indicated that NuMA distribution was unchanged in the various phenotypes of malignant T4-2 cells.
As described in , non-neoplastic HMT-3522 S1 cells were cultured in 3D in the presence of Matrigel™ for up to 12 days to induce acinar morphogenesis. Malignant HMT-3522 T4-2 cells were cultured under similar conditions for a maximum of 11 days to avoid the overgrowth of tumor nodules. DNA was stained with DAPI to visualize the limits of the nuclear volume and NuMA proteins were labeled with Texas red. Three-dimensional images were acquired using a Zeiss 410 confocal laser-scanning microscope with planapochromatic 63×, 1.4 numerical aperture lens. The resulting voxel dimensions of the 3D images were 0.08 × 0.08 μm in the plane of the slide and 0.5 μm along the optical direction.
We used three image datasets to test our phenotype clustering approach. The first dataset contains 2673 non-neoplastic S1 cells taken from 77 confocal images. Images 1–25, 26–45, 46–61, and 62–77 are S1 cells cultured for 12 days, 10 days, 5 days, and 3 days respectively. The second dataset contains 3535 malignant T4-2 cells taken from 44 images. Images 1–14, 15–26, 27–36, and 37–44 are T4-2 cells cultured in 5 days, 10 days, 11 days, and 4 days respectively. The third dependent dataset contains both malignant T4-2 and non-neoplastic S1 cells taken from the direct combination of all the 121 images. The time points were selected to span the growth progression of the non-neoplastic cultured cells. Optical sections from 3D images of individual nuclei, showing representative NuMA staining for each of the phenotypes, are displayed in the Methods section.
Using an automated image analysis method developed earlier , we extracted the local bright staining features of NuMA protein and quantified their radial distribution in each nucleus in all the 121 S1 and T4 images. In this way, we obtained 2673 and 3535 LBF distributions for S1 and T4 cells respectively. Each distribution is represented by the normalized density of bright NuMA protein feature as a function of the normalized distance from the perimeter of the nucleus to its center (see Methods for further details).
Pair-wise F-measures for the clustering results generated by the five traditional clustering approaches, as shown in Figure 1.
As shown in Table 1, different clustering methods may generate different results for the same dataset and the agreement between them can be low. This is because each clustering method assumes certain data distributions and cluster characteristics. For instance, the Gaussian mixture model assumes clusters satisfy the Gaussian distribution. K-means works well for clusters of convex shapes. Thus, some algorithms might perform well for specific datasets and not for others. In general, no single clustering method can successfully handle different types of cluster structure. In addition, even different initializations and parameter settings of the same method, for instance, K-means and Gaussian mixture model, may generate different clustering results. As a result, selecting an optimal clustering method is non-trivial or even impossible in many cases. A reasonable way to get a reliable partition of a dataset is to derive a consensus from multiple clustering results, the assumption being that the judgment made by a committee is more robust and unbiased than those made by individuals. This idea, called ensemble clustering, has been investigated in some literatures and several major benefits have been identified [15–21]. First, ensemble-clustering can improve the robustness of clustering. The clusters generated tend to be less sensitive to noise, outliers, initialization, or sampling variations compared to individual clustering methods. Second, ensemble clustering does not need a priori information about the number of clusters, but can effectively determine the most probable number of clusters. Third, ensemble clustering can detect outliers. This ability is closely associated with the ability of determining the number of clusters.
Several different ensemble-clustering methods have become available. In , a voting algorithm based on hierarchical clustering of the co-association matrix (which represents how often each pair of data appears in the same cluster) is used to derive the consensus clusters. In , Strehl and Ghosh developed an evidence accumulation and a hypergraph representation ensemble clustering method. In , Topchy et al proposed a mutual-information-based method. In , Fischer and Buhmann developed a bootstrap algorithm by first relabeling the data in each clustering result to find the correspondence and then using a voting scheme to find consensus.
In this work, we used a probabilistic ensemble approach based on Bayesian latent variable induction [21–23] (see Methods). Assuming that the clustering results generated by individual methods, i.e., Gaussian mixture model, fuzzy C-means, K-Means, hierarchical clustering, and spectral clustering, are independent of each other, the Bayesian latent variable induction method is able to obtain the statistically optimal combination of individual clustering results as shown by Chickering and Heckerman in . A similar probabilistic ensemble approach has also been adopted by Topchy in  where accurate consensus was obtained from unreliable individual clustering results.
Number of clusters (the second row) predefined in the individual clustering methods (i.e., Gaussian mixture model, fuzzy C-means, hierarchical clustering, K-means and spectral clustering) and those automatically determined by the probabilistic ensemble clustering method for both S1 and T4-2 cells (the third row).
Number of Clusters
Using the same approach, we constructed the phenotype trees for malignant T4-2 cells and for the combination of S1 and T4-2 cells, as shown in Figure 4b and Figure 4c respectively. Figure 4b shows that we can distinguish T4-2 cells cultured at day 4, day 5, day 10 from those cultured at day 11 in relatively high confidence (0.8591; the first level of Figure 4b). However, if we want to distinguish T4-2 cells cultured for different numbers of days, the confidence drops to 0.5748. Figure 4c shows that we can distinguish S1 and T4-2 cells with very high confidence (0.9419; see the first level of Figure 4c). However, the confidence drops as level increases. The certainty in distinguishing all the 8 phenotypes drops to 0.5508 at the highest level of the tree. In general, the phenotype trees provide us a way to evaluate how the phenotypes, defined by the behavior of the cells in 3D culture, can be hierarchically grouped and the statistical significance between each grouping calculated.
We have developed a cluster analysis approach that can robustly link any given set of multivariate features measured on a per cell basis to the phenotype of the cells as defined by their macroscopic biology. The technique uses a probabilistic ensemble approach to group the measured multivariate features into a set of consensus clusters. This method provides a novel way of linking the phenotypes of groups of cells to cluster histograms that describe the distribution of the measured features across the consensus clusters. Then, by forming various groupings of the cluster histograms, the technique permits the formation of a phenotype tree and calculations of the statistical significance between each of the groups. If two groups of cells are found to be significantly different, one can conclude that the features measured in the cells can distinguish the groups that are indeed different. If the two groups are not significantly different, one can only conclude that the measured feature does not change between these groups. It does not imply that that the groups are necessarily identical.
We have shown how the cluster analysis technique can be applied to the radial LBF distributions of a chromatin-associated protein, NuMA , measured on a per cell basis from non-neoplastic S1 and malignant T4-2 HMECs, cultured in a 3D environment for up to 12 days. The results showed, that for this measured feature, the method can distinguish the non-neoplastic S1 cells and malignant T4-2 cells with 94.19% accuracy, and proliferating S1 cells from S1 cells differentiated into acinar structures with 92.86% accuracy. The phenotype tree also shows that the method only distinguishes the four phenotypes of S1 cells with 68.22% accuracy. However, when the two phenotypes S1-day 10 and S1-day 12 are considered as one group, the ability to distinguish that group from S1-day 5 and S1-day 3 jumps to 85.11%. This result demonstrates the power of the phenotype tree, which in this case shows that the distribution of NuMA changes moderately between the phenotypes S1-day3 and S1-day 5, markedly between the phenotypes S1-day 5 and S1-day 10 but then does not changed significantly in S1 cells at 10 days compared to 12 days in culture. These results correlate with the behavior of cultured S1 cells and clearly show that the reorganization of NuMA that occurs during the morphogenic process of these cells is almost complete at 10 days of culture. In other words, S1-day 10 and S1-day 12 are not significantly different phenotypes, based on NuMA distribution. These results are echoed by the cluster histograms for the S1 cells. Clearly marked differences are seen between cluster histograms of the phenotypes S1-day 5 and S1-day 10 and not between the phenotypes S1-day 10 and S1-day 12. Further, the method only distinguishes the four phenotypes of T4-2 cells with 57.48% accuracy. This result also correlates with the behavior of these malignant cells that continue to proliferate throughout the 12 day culture period. This result simply demonstrates that based on NuMA distribution, the phenotypes T4-2-day 4, T4-2-day 5, T4-2-day 10 and T4-2-day 11 are not significantly different. It does not rule out the possibility that introducing other measured features could reveal differences between such phenotypes.
Collectively our data demonstrate the quantitative ability of clustering-based analysis to link microscopically measurable features with the behavior of the cells. The methods described demonstrate that it is possible to distinguish populations of cells based on the nuclear organization of a chromatin-associated protein, NuMA. This work paves the way for our longer term goal of producing a method capable of turning high resolution fluorescence images of human mammary epithelial tissue into tissue-maps that report the probable non-neoplastic, premalignant and malignant phenotype at cellular resolution.
Our phenotype clustering algorithm is based on the radial distribution of LBFs. To group the LBF distribution of thousands of nuclei into clusters of similar patterns, we first tested traditional clustering approaches, including the most widely used K-means, fuzzy C-means clustering, Gaussian mixture model (with a spherical kernel), hierarchical clustering (with the complete link scheme), and the spectral clustering methods [6–14].
Since different clustering methods generate different clusters, we computed the pair-wise F-measure score to evaluate the consistencies between different clustering results. The F-measure is defined as follows. For any two data partition U and V, denote the i th cluster in partition U as u i , and the j th cluster in partition V as v j . The proportion of data in u i that is also in v j is R = |u i ⋂ v j |/|u i |, and the portion of data in v j that is also in u i is P = |u i ⋂ v j |/|v j |. Define F(i, j) = 2PR/(P+R). The score to measure the consistency of the partition V with partition U is F0 = [Σ|u i |max j F(i, j)]/[Σ|u i |], where |u i | is the number of data point in u i . To make it symmetrical, the final F-measure is defined as F = (F0+F0')/2, where F0' denotes the transpose of F0.
The probabilistic ensemble clustering approach we used to derive the consensus clusters from multiple clustering results is based on general Bayesian latent variable induction [21–23]. Let us suppose we have M different clustering approaches, generating M data partition C i (i = 0,..., M) of the same dataset D containing N data points. Our purpose is to infer the optimal consensus data partition L from the multiple partitions C i . We notice that one simple yet reasonable assumption is that we can treat all the M clustering results C1,..., C M as independent samples drawn from the same underlying distribution L. In another words, we can assume that the distributions of C1,..., C M are conditionally independent of each other given the latent variable L. This assumption allows us consider the following Bayesian latent variable induction model.
Let us suppose the i th clustering approach divides the dataset into r i clusters, then each C i has r i states (categorical labels), i.e., 1,..., r i . Initially the consensus L may divide the dataset into k clusters (the final value k* is automatically determined; see below), then L has k states, i.e., 1,..., k. Since each LBF distribution vector in the dataset is assigned a cluster label by C i , it takes a specific state value on C i . Denote s = (C1 = c1, C2 = c2,...., C M = c M ), where c i (i ∈ [0, M]) takes one state in 1,..., r i .
where j denotes the j th data in the dataset D, P(C i = c i |L = l) (i ∈ [0, M]) can be easily obtained by counting and normalizing the occurrence frequency of data that are assigned the state label c i by the clustering method C i , given the data is assigned the state label l in L. Once P(L = l|s) is available, we use it to resample and update the state label of each data in L. The above process repeats until all the data do not change states. This will lead to the estimation of an optimal consensus function L for a specified number of clusters, k.
It is apparent that we can maximize the likelihood in Eq. (2) to find the best k over a specified range. In practice, we can often avoid iteration in Eq. (2) by directly assigning a big k. After convergence in solving Eq. (1), there are k* (k ≥ k*) states in L that have non-zero number of data points. This k* value is the statistically optimal k value automatically determined.
Once we obtained reliable clusters of LBF distributions of individual nuclei, we analyzed how the cells belonging to different phenotypes, defined by the behavior of the cells, (i.e., S1 and T4-2 cells cultured in different days) were distributed across the various LBF clusters. For this purpose, we counted the number of nuclei whose LBF distribution fell into each cluster for each phenotype, i.e., S1 cells cultured for 3, 5, 10, and 12 days, and T4-2 cells cultured for 4, 5, 11, and 12 days. By doing so, we obtained the cluster histogram of each phenotype, represented by the percentile of nuclei as a function of clusters. The cluster histograms do not only directly link to predefined phenotypes (as shown in Figure 3) but also provided more detail information compared to cell malignancy and days in culture.
Our next step is to determine the likelihood of these potential groupings. Assume we want to divide the predefined phenotypes into p groups (where p = 2,3,4 in the above example). We then grouped the cluster histogram of the 77 S1 cell images into the same number of clusters. To improve reliability we again used multiple clustering algorithms, including K-means, fuzzy C-means clustering, hierarchical clustering, Gaussian Mixture model, and spectral clustering, as used in generating the LBF clusters (see Figure 9b). We then paired each clustering result with the phenotype grouping under consideration, and calculated the degree of agreement between them using the F-measure. We then selected the maximum F-score as the confidence of the corresponding cell phenotype grouping (see Figure 9c). By repeating the process for each potential phenotype grouping, we finally obtained the value of the confidence as the function of the different cases of phenotype grouping.
To further test the sensitivity of this method to the number of clusters predefined when generating the clusters of LBF distributions using the five traditional clustering approaches, we repeated the process for different numbers of clusters predefined for the traditional methods and obtained a set of confidence values for each phenotype grouping case as indicated by the colored dots in each bin of Figure 9d. The result exhibits a central tendency, indicating that the method is insensitive to the number of clusters predefined in clustering the LBF distributions. We then took the median of the confidence values obtained under different number of clusters on each bin as the overall confidence value of the corresponding phenotype grouping.
Given p, the number of groups that the predefined phenotype should be grouped into, we selected from all the phenotype grouping cases that have the same number of groups the one that has the maximum confidence value, as the most likely phenotype grouping case under the given p. For instance, if we want to group the predefined phenotypes into 2 groups, i.e., p = 2, there are three phenotype grouping cases, corresponding to the first three bins in Figure 9d and the first three rows in Figure 9a. The second case has the maximum confidence value (indicated by the left-most dashed ellipse in Figure 9d, which corresponds to the second row of Figure 9a) and is thus taken as the right way of grouping the predefined phenotypes into 2 groups. This means that S1 cells cultured for 10 and 12 days (i.e., images 1–45) belong to one group, and those cultured for 3 and 5 days belong to another (i.e., images 46–77). Using this approach, we determined the most likely phenotype grouping for p = 3 and p = 4, which correspond to the 6th and 7th bin in Figure 9d and the 6th and 7th row in Figure 9a respectively. These three phenotype groupings constitute the first to the third level of the phenotype tree as shown in Figure 4a.
This work was supported by the Department of Defense-Breast Cancer Research Program/DOD-BCRP (DAMD-170210440 to D.W.K.), the National Institutes of Health, National Cancer Institute (1 R33 CA118479-01 to D.W.K.), and a grant from the "Friends For An Earlier Breast Cancer Test" Foundation to S.A.L.
This article has been published as part of BMC Cell Biology Volume 8 Supplement 1, 2007: 2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2121/8?issue=S1
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.