Integrated Multivariate Segmentation Tree for Heterogeneous Credit Data Analysis in Small- and Medium-Sized Enterprises
Traditional decision tree models, which rely exclusively on numerical variables, often face challenges in handling high-dimensional data and are limited in their ability to incorporate textual information effectively. To address these limitations, we propose the integrated multivariate segmentation tree (IMST), a comprehensive framework designed to improve credit evaluation for small- and medium-sized enterprises (SMEs) by integrating financial data with textual sources. This method comprises three core stages: (1) transforming textual data into numerical matrices through matrix factorization, (2) selecting salient financial features using Lasso regression, and (3) constructing a multivariate segmentation tree based on either the Gini index or entropy, with weakest-link pruning applied to control model complexity. Experimental results based on a dataset of 1,428 Chinese SMEs demonstrated that IMST achieved an accuracy rate of 88.9%, surpassing both baseline decision trees (87.4%) and conventional models such as support vector machines and neural networks. Furthermore, the proposed model demonstrated superior interpretability and computational efficiency, featuring a more streamlined architecture and improved risk detection capabilities.