本文已被:浏览 242次 下载 1082次
Received:September 23, 2023 Revised:October 20, 2023
Received:September 23, 2023 Revised:October 20, 2023
中文摘要: 数据在机器学习、人工智能等领域的研究和开发工作中占据了极其重要的地位. 然而现实中存在的一些因素导致数据需求者无法获得符合工作要求的真实数据集, 例如隐私问题、数据稀缺和数据质量较差等. 针对此现状, 在 SI (sampling-iteration) technique的基础上改进出一种非正态数据合成算法(KMSI). 该算法使用混合类型相关系数矩阵以减小SI technique在目标设定、控制循环等步骤中的度量误差, 通过替换Bootstrap采样法为核密度估计采样法以避免使用真实数据. 实验结果表明, KMSI相较SI technique能够应对复杂分布和混合类型的数据集, 且在合成结果中不包含真实数据; 相较于其他改进方法, KMSI在合成数据集样本量上能够给予使用者更大的自定义空间.
Abstract:Data plays an extremely important role in research and development in fields such as machine learning and artificial intelligence. However, some real-world factors prevent data consumers from obtaining real datasets that meet their work requirements, such as privacy issues, data scarcity, and poor data quality. In response to this situation, this study develops a non-normal data synthesis algorithm (KMSI) as an improvement to the sampling-iteration (SI) technique. This algorithm utilizes a mixed-type correlation coefficient matrix to reduce measurement errors in various steps of the SI technique, including target setting and control loops. It replaces Bootstrap sampling with kernel density estimation sampling to avoid using real data. Experimental results show that, compared to the SI technique, KMSI is capable of handling complex and mixed-type datasets and does not include real data in the synthetic results. Furthermore, compared to other enhancement methods, KMSI offers users more customization options for the sample size in synthetic datasets.
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金联合基金(U1536122); 天津市科委重大专项(15ZXDSGX00030)
引用文本:
王春东,张世鹏.基于混合数据类型相关性度量的非正态数据合成.计算机系统应用,2024,33(3):195-205
WANG Chun-Dong,ZHANG Shi-Peng.Non-normal Data Synthesis Based on Mixed Data Type Correlation Measurement.COMPUTER SYSTEMS APPLICATIONS,2024,33(3):195-205
王春东,张世鹏.基于混合数据类型相关性度量的非正态数据合成.计算机系统应用,2024,33(3):195-205
WANG Chun-Dong,ZHANG Shi-Peng.Non-normal Data Synthesis Based on Mixed Data Type Correlation Measurement.COMPUTER SYSTEMS APPLICATIONS,2024,33(3):195-205