Articles & Issues

Conflict of Interest: In relation to this article, we declare that there is no conflict of interest.

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/bync/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in press

기계학습을 활용한 공융용매 밀도 예측의 정확도: 입력 조건과 데이터 종류의 영향을 중심으로

Machine learning for deep eutectic solvent density: Impact of feature representations and dataset complexity on predictive reliability

YoonKook Park^1†

¹홍익대학교 바이오화학공학과

In Press, Journal Pre-proof, Available online 1 May 2026

Abstract

공융용매(DES)의 밀도를 정확히 예측하는 것은 친환경 분리 공정을 최적화하는 데 매우 중요하다. 본 연구는 특성 표현(ChemBERTa, 하이브리드, 임계성질 모델)과 데이터 분할(이성분, 삼성분, 전체데이터)이 기계학습 예측에 미치는 영향을 조사한다. 랜덤 포레스트(RF), XGBoost, CatBoost, 인공신경망(ANN)을 평가한 결과, 트리 기반 앙상블은 제한된 데이터를 이용한 예측에서도 일관되게 좋은 결과 (R2>0.93)를 달성했다. 반면 ANN은 과적합을 방지하기 위해 명시된 물리적 기술자 또는 12,000개 이상의 대규모 데이터를 필요로 하였다. 교차 도메인 검증에서는 단순계에서 복잡계로의 외삽이 열역학적 다양성의 제약이 컸으나, 전체 데이터에 특화된 모델은 우수한 전이성을 보이는 것으로 나타났다. 이런 결과는 대규모, 다양한 데이터를 앙상블 알고리즘이나 물리정보 기반 특성과 결합하는 것이 다성분 공융용매 특성의 신뢰할 수 있는 계산에 필수 조건임을 보였다.

Accurately predicting the density of deep eutectic solvents (DESs) is crucial for optimizing green separation processes. This study investigates the impact of feature representations (ChemBERTa, hybrid, and critical property models) and data partitioning (binary, ternary, and comprehensive datasets) on machine learning predictions. Evaluating RF, XGBoost, CatBoost, and ANN models revealed that tree-based ensembles are highly robust, consistently achieving R² > 0.93 on limited datasets. Conversely, ANNs required explicit physical descriptors or massive datasets (>12,000 points) to prevent overfitting. Cross-domain validations demonstrated that extrapolating from simple to complex systems fails due to restricted thermodynamic diversity, whereas specializing from a comprehensive dataset ensures excellent transferability. These findings establish that combining large, diverse datasets with ensemble algorithms or physics-informed features is essential for the reliable computational design of multicomponent DES properties.

Keywords

deep eutectic solvent; density machine learning; ensemble; feature