利用随机森林算法预测中国东部海域表层沉积物有机碳含量分布

Predicting the distribution of organic carbon content in surface sediments of the eastern China seas using random forest algorithm

  • 摘要: 厘清中国边缘海沉积有机碳的分布特征和控制因素有助于建立东亚边缘海有机碳循环模型及其“源-汇”格局。当前中国东部边缘海有机碳分布图的绘制,主要是通过数学插值对采样点之间进行填充。该方法一方面极大地受限于采样站位的位置和数量,另一方面通过数学插值填图也忽视了样品与海水理化性质、海底地形和洋流等环境因素的差异,将复杂的地质问题简单化。机器学习方法能够从高维和复杂数据中提取关键信息,构建环境属性特征和预测变量的映射关系。本文借助机器学习方法中常用的随机森林算法,通过对405个海洋沉积物有机碳数据与50个环境属性特征映射关系的学习,预测了中国东部边缘海表层沉积物的有机碳含量。相比根据同样数量样品由克里金插值计算绘制的有机碳分布图,随机森林算法对沉积物有机碳含量预测结果的平均绝对误差、均方根误差、最大残差等误差评价指标均更小,十折交叉检验的R2达到0.6,表现出较高的拟合精度。尤其对于采样密度较低或因采样困难存在样品空缺的海区,随机森林算法能更准确的预测表层沉积物有机碳含量,体现出更符合实际情况的预测潜力和外推性优势。本文所建立的随机森林算法对于未来其他海洋沉积物地球化学指标的预测也同样具有借鉴作用,对于中国东部边缘海的资源调查和环境保护具有重要的现实意义。

     

    Abstract: Clarifying the distribution characteristics and controlling factors of sedimentary organic carbon in China marginal seas is crucial for establishing an organic carbon cycle model for the East Asian marginal seas and its “source-to-sink” pattern. Currently, the distribution map of organic carbon in the Eastern China marginal seas is constructed mainly based on mathematical interpolation of existing data. However, this method is significantly limited by the location and quantity of sampling stations, and in addition, the mathematical interpolation mapping neglects the differences between the samples and environmental factors such as seawater physicochemical properties, seabed topography, and ocean currents, thus oversimplifying the complex geological issues. Machine learning methods can extract key information from high-dimensional and complex data and establish mapping relationships between geological property features and predictive variables. In this study, the commonly used Random Forest (RF) algorithm in machine learning was employed to predict the organic carbon content in the surface sediments of the Eastern China marginal seas by learning the mapping relationship among 405 marine sediment organic carbon data and 50 geological property features. Compared to the organic carbon distribution map generated by the Kriging interpolation calculations based on the same number of samples, the RF algorithm showed smaller errors of evaluation indicators, including mean absolute error, root mean square error, and maximum residual error. The ten-fold cross-validation R2 reached 0.60, indicating high fitting accuracy. Notably, for regions with low sampling density or missing data due to sampling difficulties, the RF algorithm demonstrated a superior predictive accuracy for surface sediment organic carbon content, reflecting its potential for more realistic predictions and extrapolation advantages. The RF model established in this study provided valuable insights for predicting other geochemical indicators of marine sediments in the future and holds significant practical implications for resource investigation and environmental protection in the Eastern China marginal seas.

     

/

返回文章
返回