한국 성인의 치주질환 예측을 위한 머신러닝 알고리즘 성능 평가 및 분석
Performance Evaluation and Analysis of Machine Learning Algorithms for Predicting Periodontal Disease in Korean Adults
Article information
Abstract
목적
본 연구는 머신러닝 알고리즘을 활용하여 치주질환 예측의 정확성을 향상시키고, 맞춤형 예방 및 관리 전략 수립에 필요한 주요 위험 요인을 규명하고자 하였다.
방법
2016-2018년 제7기 국민건강영양조사에서 19세 이상 성인 11,781명의 데이터를 활용하였다. 로지스틱 회귀, 결정 트리, 랜덤 포레스트, XGBoost, CatBoost 등 5가지 기계 학습 알고리즘을 적용하였으며, 복합표본설계 및 10-폴드 교차 검증을 통해 모델을 학습하고 평가하였다.
결과
치주질환의 유병률은 27.8%로 나타났다. CatBoost 모델이 가장 높은 예측 성능(AUC: 0.760)을 보였으며, 연령, 성별, 교육 수준이 주요 예측 요인으로 확인되었으며, 이들 요인은 모델의 예측력에 큰 영향을 미쳤다.
결론
본 연구는 기계 학습 기반 예측 모델이 치주질환의 조기 발견 및 개인 맞춤형 예방 전략 개발에 활용될 가능성을 보여준다.
Trans Abstract
Objectives
This study aimed to enhance the accuracy of predicting periodontal disease using machine learning algorithms and to identify key risk factors essential for developing personalized prevention and management strategies.
Methods
Data from 11,781 adults aged 19 years or older were obtained from the 7th Korea National Health and Nutrition Examination Survey (2016–2018). Five machine learning algorithms, including logistic regression, decision tree, random forest, extreme gradient boosting, and CatBoost, were applied. Models were trained and evaluated using a complex sampling design and 10-fold cross-validation.
Results
The prevalence of periodontal disease was 27.8%. The CatBoost model demonstrated the highest predictive performance (AUC: 0.760). Age, sex, and education level were identified as key predictors, significantly influencing model accuracy.
Conclusions
This study highlights the potential of machine learning-based prediction models in the early detection of periodontal disease and the development of personalized prevention strategies.
INTRODUCTION
Periodontal disease is a highly prevalent inflammatory condition glob-ally, affecting the tissues that support the teeth [1]. It is characterized by gum inflammation and bleeding and, in severe cases, tooth loss, which can lead to long-term health complications. Therefore, early detection and timely treatment are critical [2,3]. While traditional diagnostic methods for periodontal disease are effective, they often rely heavily on clinical expertise and involve invasive procedures such as probing depth mea-surement, which requires the use of a periodontal probe to assess pocket depth and inflammation levels. These methods can be uncomfortable for patients and may yield inconsistent results due to variations in exam-iner proficiency. Consequently, there has been increasing interest in non-invasive, data-driven predictive approaches, which offer a more standardized and patient-friendly alternative in recent years [4].
With recent advancements in artificial intelligence, particularly machine learning, novel solutions to complex medical challenges have emerged [5]. Unlike traditional statistical methods that rely on predefined assumptions and linear relationships, machine learning algorithms can inde-pendently process large datasets, uncover intricate nonlinear patterns, and significantly improve predictive accuracy [6,7]. This capability makes them a valuable tool for predicting periodontal disease.
Early studies on machine learning applications in periodontal disease prediction primarily focused on conventional algorithms with limited datasets. For instance, Farhadian et al. [8] utilized a support vector machine (SVM) model to classify periodontal disease, but the study was constrained by a small sample size, limiting its generalizability. Similarly, Patel et al. [9] employed an extreme gradient boosting (XGBoost) algorithm to categorize periodontal disease severity; however, the model lacked external validation, raising concerns about its robustness across diverse populations. More recent studies have incorporated multiple machine learning algorithms to improve predictive accuracy. Elani et al. [10] compared various models, including logistic regression (LR), random forest (RF), light gradient boosting machine, XGBoost, and artificial neural networks, for tooth loss prediction, leveraging a larger dataset. Additionally, Kim et al. [5] applied machine learning to analyze bacterial compo-sition in saliva, offering a novel approach to predicting chronic periodontitis severity. Despite these advancements, many existing studies still face limitations such as small sample sizes, lack of external validation, and re-liance on single-source datasets. These challenges underscore the need for further research utilizing large-scale, nationally representative datasets to enhance model robustness and clinical applicability.
Despite the promise of artificial intelligence technology in dentistry, various significant challenges hinder its widespread application. One of the key obstacles in developing and applying machine learning-based models for predicting periodontal disease is the need for reliable, large-scale datasets. Insufficient data can undermine the accuracy and robustness of these models, thus limiting their effective implementation in clinical practice [5]. To improve the predictive capabilities of machine learning models and manage periodontal disease effectively, it is crucial to collect diverse and comprehensive patient data. These challenges can be mitigated by harnessing big data, which encompasses thousands to tens of thousands of individuals, reducing biases and overfitting risks during model development, and thereby improving model reliability [7].
This study aims to develop and evaluate machine learning models for predicting periodontal disease using big data from the Korea National Health and Nutrition Examination Survey (KNHANES). Furthermore, this study aims to identify key determinants of periodontal disease and compare the performance of different machine learning algorithms to determine the most effective model.
Through this method, the study aspires to enhance the diagnosis of periodontal disease by offering a non-invasive, efficient, and precise means for early detection. Such advancements are expected to improve clinical decision-making and patient outcomes by integrating machine learning models into routine practice, thereby reducing the overall burden of periodontal disease. Ultimately, this research aims to expand the body of knowledge in medical artificial intelligence and demonstrate the practi-cal application of machine learning in predicting periodontal disease.
METHODS
Survey and participants
This study utilized raw data from the 7th KNHANES, conducted between 2016 and 2018 [11]. KNHANES is a nationally representative survey designed to assess the health and nutritional status of the Korean population using a stratified, multistage probability sampling method. Given its robust sampling design, the findings of this study can be reliably gen-eralized to the broader Korean adult population. The study population was limited to adults aged ≥19 years from a total of 24,214 participants. Among those who participated in the health survey, oral health survey, and oral examination, individuals with missing data for key variables were excluded from the analysis. Consequently, the final analytical sample consisted of 11,781 participants with complete data. This study utilized publicly available data from the KNHANES, and the use of this data was exempt from requiring additional ethical approval. The survey protocol of KNHANES was approved by the Institutional Review Board of the Korea Centers for Disease Control and Prevention (Approval Num-ber: 2018-01-03-P-A).
Variable definitions
Based on existing research and literature, variables closely associated with periodontal disease were selected to enhance the accuracy of the predictive model and to assess the impact of each variable on the condition. The demographic variables included sex, age, education level, and household income. Age was categorized into three groups: 19-39, 40-64, and ≥65 years. Education level was classified into four categories: elementary school graduation or less, middle school graduation, high school graduation, and college graduation or higher. Household income was divided into quartiles and categorized as low, lower-middle, upper-middle, and high.
Health and oral health behavior factors considered in the study included smoking status, alcohol consumption, daily toothbrushing frequency, and whether a dental check-up was performed in the past year. Smoking status was divided into three categories: never smoker, former smoker, and current smoker. Alcohol consumption was classified as less than once per month or once or more per month. Daily toothbrushing frequency was categorized as less than three times per day or three or more times per day. The variable indicating whether a dental check-up had been conducted within the past year was classified as yes or no.
Systemic diseases included hypertension and diabetes, and the presence of these conditions was categorized as yes or no. Mental health factors included levels of stress and depression, with stress categorized as high or low, and the presence of depression categorized as yes or no.
Periodontal disease was assessed using the Community Periodontal Index (CPI) based on the oral health survey methods outlined in the KNHANES. The examination was performed on six designated sextants, including the upper right anterior, upper anterior, upper left posterior, lower right posterior, lower anterior, and lower left posterior teeth. Periodontal status was defined according to the CPI scoring system as follows: 0 indicated healthy periodontal tissue, 1 indicated bleeding tissue, 2 indicated tissue with calculus, 3 indicated shallow periodontal pockets, and 4 indicated deep periodontal pockets. The highest score recorded among the six sextants determined the periodontal status of each partic-ipant. Participants were classified as having periodontal disease if their CPI score was 3 or 4.
Statistical analysis
To determine the prevalence of periodontal disease among the study participants, a chi-square test was performed. A complex sampling design was employed in the data analysis, incorporating individual weights based on cluster sampling variables and variance estimation strata. Statistical analysis was conducted using SAS version 9.4 software (SAS In-stitute Inc., Cary, NC, USA).
For the machine learning analysis, the entire dataset was randomly divided into a training set (80%) and a test set (20%) using a fixed random seed to ensure reproducibility. To mitigate overfitting in predictive models, a 10-fold cross-validation technique was employed, and hyperparameter tuning was performed via grid search.
Logistic regression (LR), decision tree (DT), random forest (RF), XG-Boost, and categorical boosting (CatBoost) were utilized for model development. These machine learning algorithms were selected based on their validated performance in medical prediction research, interpretability, and ability to process structured health data. Traditional models such as LR and DT were included for their simplicity and interpretability, while ensemble models like RF, XGBoost, and CatBoost were chosen due to their strong predictive performance and robustness in structured tabular datasets, as demonstrated in previous studies. The entire machine learning pipeline, including data preprocessing, model training, and evaluation, is described in detail in Figure 1.
The predictive model building pipeline. Split the dataset into 80% training set and 20% testing set. Train the model using the training set and find the optimal hyperparameters through 10-fold cross-validation. Retrain the model using the entire training set with the optimal hyperparameters. Evaluate the final model performance on the testing set.
The optimal hyperparameter combinations for each machine learning model were selected based on the ROC-AUC score as the optimization criterion. Logistic regression applied an L2 penalty to prevent overfitting while maintaining interpretability. Decision tree was configured with a maximum depth of 8 and a minimum leaf sample size of 13 to control model complexity. Random forest was set with a maximum depth of 5, while XGBoost was optimized with a maximum depth of 10 and a learning rate of 0.01. CatBoost was configured with a learning rate of 0.01, depth of 5, and L2 regularization of 5 to enhance model stability and predictive performance.
These hyperparameter values were carefully selected to optimize the predictive performance of each model while maintaining generalizability. To address class imbalance in the dataset, the Synthetic Minority Over-sampling Technique (SMOTE) was applied during the training phase to augment the minority class (periodontal disease group) and improve model learning. This approach ensures that the model is not biased to-wards the majority class and enhances predictive performance.
The performance of these predictive models was assessed using various metrics. Accuracy represented the proportion of correct predictions out of all predictions. Sensitivity measured the proportion of actual positives correctly identified as positive, while Recall indicated the proportion of true positives that were correctly predicted. Specificity measured the proportion of actual negatives correctly identified, and the F1 Score represented the harmonic mean of Precision and Recall. The receiver operating characteristic area under the curve (ROC AUC) score was also used to assess overall model performance, with values closer to 1 indicating superior predictive ability. After evaluating model performance, the importance of predictive variables was analyzed using the machine learning model with the highest AUC score.
The development and evaluation of the periodontal disease prediction models were implemented using Scikit-learn (Version 1.3.0), Numpy (Version 1.22.0), Python (Version 3.8.8), and related library packages.
RESULTS
Prevalence of periodontal disease in study participants
The overall prevalence of periodontal disease among the study participants was 27.8%, with men (34.1%) showing a higher prevalence than women (22.2%). The prevalence increased with age, from 7.9% in the 19-39 age group to 42.8% in those aged 65 years or older. Higher levels of education and income were associated with lower prevalence, with 44.6% among those with an elementary school education or less compared to 17.6% among college graduates, and 38.5% among low-income individu-als compared to 20.6% among high-income individuals.
Current smokers had the highest prevalence at 38.3%, followed by former smokers (33.2%) and never smokers (22.1%). The effect of alcohol consumption was minimal, with a prevalence of 27.7% among those who drank at least once per month compared to 26.7% among those who drank less frequently. Individuals who brushed their teeth fewer than three times per day had a higher prevalence (32.0%) compared to those who brushed three or more times per day (22.8%). Regular dental check-ups appeared to have a protective effect, with a prevalence of 22.4% among those who had regular check-ups compared to 30.0% among those who did not.
Hypertension and diabetes were strongly associated with higher prevalence, with 43.0% among individuals with hypertension compared to 22.8% among those without, and 47.5% among individuals with diabetes compared to 25.2% among those without. Stress and depression were also influential, with higher prevalence among those reporting high stress (28.1%) compared to low stress (24.7%), and among those with depression (31.0%) compared to those without (27.0%) (Table 1).
Comparison of machine learning algorithm performance for predicting periodontal disease
A comparison of the predictive performance of various machine learning algorithms for periodontal disease revealed that CatBoost exhibited the highest ROC AUC score (0.760±0.013), indicating superior discrimi-native ability between positive and negative cases. This was followed by XGBoost (0.752±0.014), decision tree (0.750±0.014), random forest (0.750±0.016), and logistic regression (0.731±0.012).
In terms of accuracy, CatBoost (0.691±0.013) and decision tree (0.690±0.014) achieved the highest scores, followed closely by random forest (0.684±0.013) and XGBoost (0.684±0.013). Logistic regression demonstrated slightly lower accuracy (0.672±0.012).
Regarding precision, which measures the proportion of correctly identified positive cases among all predicted positives, logistic regression (0.677±0.012) performed best, followed by CatBoost (0.669±0.011) and decision tree (0.667±0.012).
For recall (sensitivity), which reflects the model's ability to correctly identify true positive cases, XGBoost (0.778±0.031) achieved the highest score, followed by CatBoost (0.763±0.019), random forest (0.760±0.025), and decision tree (0.762±0.035). Logistic regression exhibited the lowest recall (0.656±0.014), suggesting that it may have missed a higher proportion of true positive cases compared to other models.
Specificity, which assesses the ability to correctly classify true negative cases, was highest for logistic regression (0.688±0.016), indicating strong performance in correctly identifying negative cases. Decision tree (0.620±0.025) and CatBoost (0.615±0.015) had lower specificity, meaning they were more prone to false positive classifications.
The F1 score, which balances precision and recall, was highest for both CatBoost and XGBoost (0.712±0.013 and 0.712±0.016, respectively), followed by random forest (0.710±0.014) and decision tree (0.711±0.017). Logistic regression had the lowest F1 score (0.666±0.012), reflecting its trade-off between high precision and lower recall.
In terms of actual classification performance, XGBoost recorded the highest number of true positives (5,220), followed by CatBoost (5,187), random forest (5,164), and decision tree (5,046). However, logistic regression identified the highest number of true negatives (4,679), which aligns with its superior specificity.
Overall, CatBoost and XGBoost demonstrated strong predictive power with high ROC AUC and recall, while logistic regression excelled in precision and specificity. Random forest and decision tree showed balanced performance across different metrics, with decision tree demonstrating the second-highest accuracy after CatBoost (Table 2).
The ROC curves illustrating the performance of each algorithm are displayed in Figure 2.
Ranking of importance for periodontal disease risk factors in the CatBoost model
The analysis of variable importance in the CatBoost model, which demonstrated the highest ROC AUC performance, revealed that age was the most important variable for predicting periodontal disease. Sex was identified as the second most important variable, followed by education level as the third (Figure 3).
Importance ranking of 12 periodontal disease risk factors in the CatBoost model. CatBoost, categorical boosting.
SHapley additive explanations (SHAP) analysis was performed using the CatBoost model to further interpret feature importance. The results indicated that age, sex, and education level were the most significant predictors of periodontal disease, with older age being the strongest risk factor, followed by male sex and lower education levels.
DISCUSSION
This study aimed to develop and evaluate predictive models for periodontal disease using multiple machine learning algorithms. A total of five machine learning algorithms were evaluated, all of which demonstrated relatively high ROC AUC values, confirming the utility of big data-driven models in predicting periodontal disease. Previous studies have reported ROC AUC values ranging from 0.60 to 0.80 [12], 0.69 to 0.72 [9], 0.702 to 0.712 [13], and 0.770 to 0.878 [14]. The ROC AUC values obtained in this study were comparable to those reported in prior studies, but not exceptionally high. This indicates that while the models exhibit solid predictive performance, there is room for further improve-ment [15,16].
Among the five algorithms evaluated, CatBoost achieved the highest ROC AUC score. CatBoost offers unique advantages, such as automati-cally handling categorical variables and enabling efficient training and prediction for large datasets [17]. However, its performance may significantly depend on the characteristics of the dataset, which warrants caution when interpreting the results. CatBoost's gradient boosting techniques and regularization methods effectively capture complex relationships specific to the training data, but this does not guarantee the same level of performance with external datasets [17,18]. Therefore, further validation with external datasets is necessary to confirm the generalizability of CatBoost's performance.
The analysis of variable importance revealed that age was the most significant factor across all models, highlighting its strong association with the occurrence and progression of periodontal disease [1,19]. As age increases, periodontal tissues degenerate, and immune responses weak-en, leading to increased susceptibility to periodontal disease [3]. Additionally, periodontal disease has been identified as a major cause of tooth loss after the age of 40 [20]. These findings underscore the importance of developing age-specific prevention and management strategies to reduce disease progression in older adults. Early detection is crucial for preventing disease progression and minimizing long-term complications. Machine learning models for periodontal disease prediction can aid early diagnosis by identifying high-risk individuals before clinical symptoms become severe. This is especially valuable in resource-limited settings where access to specialized dental care is scarce. By integrating these predictive models with existing diagnostic tools, such as clinical exami-nations and radiographic assessments, healthcare providers can enhance risk stratification and implement timely interventions, ultimately improving periodontal disease management and reducing its burden in vulner-able populations.
Sex was identified as the second most significant factor, consistent with studies highlighting differences in periodontal disease prevalence and progression between men and women [21]. Men typically exhibit higher prevalence rates due to greater exposure to risk factors such as smoking and alcohol consumption [22]. In contrast, hormonal changes in women during pregnancy, menstrual cycles, and menopause can exacerbate periodontal disease by altering inflammatory responses and the oral micro-bial environment [23]. These findings highlight the need for sex-specific management strategies: men require targeted interventions addressing behavioral risk factors, while women may benefit from tailored programs that consider hormonal influences on periodontal health.
Education level emerged as the third most significant variable, reflecting its impact on oral health awareness and preventive behaviors. Higher education levels are associated with better oral hygiene practices and more frequent dental check-ups, reducing the risk of periodontal disease [24]. In contrast, individuals with lower education levels are more likely to face barriers to accessing preventive care, resulting in poorer periodontal health outcomes. Expanding oral health education programs targeting populations with lower education levels is essential for reducing the incidence and severity of periodontal disease [25].
Based on these findings, several policy recommendations can be pro-posed. First, age-specific programs should target middle-aged and older adults, emphasizing regular dental check-ups and preventive care. Second, sex-specific strategies are needed. For men, educational interventions addressing behavioral risk factors such as smoking and alcohol consumption are essential, while tailored programs for women that consider hormonal influences should be prioritized. Third, oral health education programs should be expanded for individuals with lower education levels to improve preventive behaviors and access to care, ultimately contributing to better oral health outcomes.
This study has certain limitations. First, while the use of nationally representative public data enhances the reliability of our findings, it may limit their generalizability to specific populations or regions. Additionally, data heterogeneity remains a challenge, as variations in demographic characteristics, clinical assessment methods, and data collection protocols can influence model performance. Ethical considerations, such as appropriate data usage and privacy protection, are also crucial when utilizing large-scale health datasets. Second, the lack of external validation restricts the ability to assess the generalizability of the predictive models across diverse datasets. Although the selected machine learning algorithms are well suited for handling categorical data and training on large datasets, their performance may vary depending on dataset characteristics. To address these challenges, this study leveraged the KNHANES dataset, which follows standardized data collection protocols and pro-vides a nationally representative sample, thereby helping to reduce vari-ability in data collection. However, external validation using independent datasets remains necessary to further assess model robustness and generalizability. Third, potential biases in variable selection and model optimization may have influenced the results. Nonetheless, this study highlights the utility of machine learning algorithms for predicting periodontal disease. The use of cross-validation and grid search enhanced the reliability of the models, providing valuable insights into their practi-cal applications.
Future studies should investigate additional variables influencing periodontal disease risk, such as genetic predisposition, dietary habits, and systemic inflammatory markers. External validation using independent datasets is also necessary to validate the robustness and generalizability of the predictive models.
In addition, long-term follow-up studies are needed to track disease progression over time and enhance predictive models by identifying key early indicators. Such efforts will contribute to the development of more effective prevention and management strategies for periodontal disease.
