Introduction
“The Bethesda System for Reporting Thyroid Cytopathology” (TBSRTC) classifies patients with thyroid nodules into six Bethesda categories (category I-VI) according to risk of malignancy (ROM) and for therapeutic decision making (e.g. active surveillance, surgery, radioactive iodine, targeted therapy). TBSRTC not only gives the definitions, cytomorphologic criteria, explanatory notes, but also mentions clinical management plan for each of the six diagnostic categories including the role of molecular testing.1, 2 The 2023 edition of the TBSRTC, has clarified the diagnostic criteria and uses a single name for each of the six categories, i.e. non-diagnostic (ND); benign (B); atypia of undetermined significance (AUS); follicular neoplasm (FN), suspicious for malignancy (SM), and malignant (M).3 Statistical validity of FNAC in patients with thyroid nodules has been very high including diagnostic accuracy, which reaches almost near to 100% in thyroid malignancy.4, 5, 6 However, the indeterminate categories (categories III-V) may require molecular testing to decide the mode of treatment, i.e. surgery versus conservative. But some of the new borderline ‘low-risk’ entities are ‘molecular indeterminate’ and molecular testing is not available at many institutions.7 So, morphologic evaluation of FNAC smears continue to remain as the gold standard screening test for assessment of patients with thyroid nodules. However, significant interobserver variabilities exist for indeterminate categories due to the subjective morphological interpretation. The authors aim to investigate the interobserver reproducibility and usefulness of 2023 TBSRTC classification scheme, which can help in the daily routine diagnosis and management of thyroid lesions.
Materials and Methods
FNAC smears from patients with thyroid nodules from October 2022 to September 2024, over a period of 2 years were analysed retrospectively in this observational study. Wright Giemsa and Leishman-stained smears and relevant clinical and radiological details of 100 patients were retrieved from cytopathology records. Then two pathologists reviewed and reclassified the cases according to 2023 edition of TBSRTC. FNAC smears with well-defined cellular morphology were included in the study, where as those smears with extensive cellular overlapping, unclear morphology and degenerative changes were excluded.
The reclassification of FNAC smears was done based on the diagnostic criteria described in TBSRTC, 2023 and categorized using a single name, i.e. non-diagnostic (ND); benign (B); atypia of undetermined significance (AUS); follicular neoplasm (FN), suspicious for malignancy (SM), and malignant (M) (Figure 1 A-F). 3 A category was assigned to each case independently by each reviewer and interobserver variability was calculated as the percentage of agreement between the results of the two pathologists. Cross tabulation and Cohen's kappa was used to estimate the degree of reproducibility and to confirm statistically significant agreement. 8 The interpretation of Cohen's kappa value i.e. percent agreement as follows: 0–0.2 shows poor agreement, 0.3–0.4 shows fair agreement, 0.5–0.6 shows moderate agreement, 0.7–0.8 shows strong agreement, and >0.8 shows excellent agreement.
As it was an observational study, the patient’s identity was not disclosed and institutional ethical clearance was not obtained.
Results
Demographic evaluation of 100 patients with thyroid nodules revealed 82 females (82%) and 18 males (18%), with a female: male ratio of 4.5:1. The age distribution ranged from 5 to 70 years, with a mean of 38.5 years. The most commonly affected age group was 21–50 years.
In the present study, interobserver agreement was seen in 95 out of 100 cases (95%) and disagreement was seen in 5 cases (5%). The concordance rate was 100% in the ND and ‘M’ category, where as in the benign category (Figure 1 A, B) interobserver agreement was 95% (76 cases). The agreement rate was lower in the AUS (02 cases: 66%), FN (08 cases: 66%), and SM categories (02 cases; 66%). So, interobserver variation was seen among Bethesda category II, III, IV and V. (Table 1). There were four cytologically discordant cases in the ‘B’ category, which were previously diagnosed as hyperplastic nodule in nodular goiter by first observer and were put into FN category by the second observer. Disagreement was also seen in one case in the AUS category by the first observer, which was put into SM category by the second observer. In this case, there was higher suspicion of malignancy than AUS but lower suspicion than malignant. In Bethesda category VI (M), cytomorhologic features in all 06 cases were consistent with papillary thyroid carcinoma (PTC). (Figure 1E, F) Percent agreement using Cohen's kappa was calculated to confirm an agreement beyond chance, which revealed “excellent agreement” (86.7%; 0.867).
In summary, there was interobserver agreement in 95 out of 100 cases (95%). The interobserver agreement was highest in benign (B) category (76 out of 80 cases; 95%), followed by 08 cases in follicular neoplasm (FN), 06 cases in malignant (M), 02 cases each in atypia of undetermined significance (AUS) and suspicious for malignancy (SM).
Discussion
The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) is an international reporting system, which not only standardizes the reporting of thyroid FNA but also improves communication between pathologists and surgeons. In order to match with the 2022 WHO Thyroid Tumor Classification, the 2023 TBSRTC diagnostic criteria is simplified and additional terminology is added. Instead of multiple names, which is confusing, each category is assigned a single name as follows: Bethesda I-nondiagnostic (ND); Bethesda II- benign (B); Bethesda III-atypia of undetermined significance (AUS); Bethesda IV- follicular neoplasm (FN); Bethesda V- suspicious for malignancy (SM); and Bethesda VI- malignant (M). The implied risk of malignancy (ROM) for each category has been updated, which is based on data reported after 2017 edition of TBSRTC (2017).9 However, indeterminate categories continue to exist in the 2023 edition of the reporting system.3 In this context, excellent interobserver reproducibility with Cohen’s kappa (0.867), observed in this study is particularly valuable as clinical relevance of any categorization scheme requires diagnostic accuracy along with excellent interobserver reproducibility. So, “excellent agreement” (86.7%; 0.867) seen in this study is clinically relevant and in corroboration with findings of other studies in the literature.8, 10, 11, 12, 13 Cohen’s kappa value range in these studies was 0.61-0.99, which is comparable to our kappa value (Table 2).
Table 2
Study |
Interobserver reproducibility |
---|---|
Awasthi et al.8 |
Good agreement (Cohen’s kappa score 0.613) |
Ahmed et al.11 |
Strong agreement (Cohen’s kappa score 0.735) |
Anand et al.13 |
Excellent agreement (Cohen’s kappa score 0.99) |
Our study |
Excellent agreement (Cohen’s kappa score 0.867) |
However, diagnostic disparity was found mostly in the Bethesda II category with four discrepant cases, where first observer diagnosed it as ‘B’ category and the second observer diagnosed it as FN category because of the presence of high cellularity and the microfollicular pattern (Figure 1 D). Similar disagreement has been observed in earlier studies.8 So, a clear distinction between hyperplastic nodule in multinodular goiter and follicular adenoma is not possible by FNA and does not have much clinical significance.14, 15
Interobserver disagreement was also seen in one case of Bethesda III, where first observer diagnosed it as AUS with nuclear atypia and second observer diagnosed as SM category. This was due to low cellularity and focal papillary-like nuclear features. (Figure 1 C). Cytologic features were strongly suspicious of malignancy but were not sufficient for a conclusive diagnosis. The purpose of separating Bethesda V from Bethesda VI is to preserve the very high positive predictive value of the malignant category without compromising the overall sensitivity of FNAC. However, the mode of treatment for both Bethesda V and Bethesda VI is near total thyroidectomy. So, the distinction between Bethesda V and Bethesda VI does not have much clinical significance.4
TBSRTC has proven to be a highly efficient diagnostic tool for reporting thyroid lesions. The six tiers of TBSRTC is not only contributes to a solid stratification and management, but also is helpful for pathologists faced with challenging cases, allowing them to approach the case considering the clinical outcome first, rather than focusing solely on labeling a diagnosis.7
The 2023 edition of TBSRTC, compared to previous editions, results in reducing the number of indeterminate categories, which is probably due to the introduction of more strict diagnostic criteria. In addition, TBSRTC provides a common language for communication and clarity, reduces the number of inappropriate surgery in benign cases, and enables to perform the operation on time in patients with malignant lesions. It also provides a simple and reliable exchange of data between different institutions throughout the world.16, 17
Although the interobserver reliability observed in this study is excellent, a few pitfalls were noted in the present study. The overlapping cytological features of hyperplastic nodule and follicular adenoma led to interobserver variability. Lastly, for the AUS and SM category, the focal nuclear atypia may be easily missed by inexperienced pathologist. However, differentiation of nuclear atypia in AUS and SM categories could be improved by molecular testing.
Conclusion
To conclude, TBSRTC should be encouraged in our country because it reduces interoberver variability. Although there is excellent interobserver agreement seen in this study, disagreements were seen in the Bethesda categories II, III, IV and V, which corroborated with the findings of the studies done elsewhere. By incorporating novel molecular data into cytology classification scheme could help in improving morphologic diagnosis in indeterminate categories and preventing unnecessary surgery