面向股票交易分析场景的流式大数据系统测试框架
作者:
基金项目:

北京市自然科学基金(4172013);北京市自然科学基金-海淀原始创新联合基金(L182007);国家自然科学基金(61802377,61702020)及其配套项目(PXM2018_014213_000033);国家重点研发计划(2016YFD0401104)


System Test Framework of Stream Data for Stock Trading Analysis Scenario
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    分布式集群环境使得数据实时计算更为复杂,流式大数据处理系统的正确性难以保障.现有的大数据基准测试框架可以测试流式大数据处理系统的性能表现,但是普遍存在应用场景设计简单、评价指标不充分等不足.针对这一挑战,本文构造了一个面向股票交易场景的流式大数据基准测试框架,通过生成股票高频交易数据,测试系统在高流速场景下的延迟、吞吐量、GC时间、CPU资源等的性能表现.同时,通过横向测试验证流式大数据系统的扩展性.本文以Apache Spark Streaming为待测系统进行测试,实验结果表明,高流速场景下出现延迟增加、GC时间提高等性能下降问题,原因是系统输入速率的提高及并行度的增加.

    Abstract:

    Distributed cluster environment makes real-time data computation more complex, and the correctness of stream large data processing system is difficult to guarantee. The existing large data benchmarking framework can test the performance of stream large data processing system, but there are many shortcomings such as simple application scenario design and inadequate evaluation index. To address this challenge, this study constructs a stream large data benchmarking framework for stock trading scenarios, generates high-frequency stock trading data through a flow-based data generator, and tests the performance of the system in high-speed scenarios in terms of delay, throughput, GC time, CPU resources, and so on. At the same time, the scalability of large data streaming system is verified by horizontal test. In this study, Apache Spark Streaming is used as the test system to test. The experimental results show that the performance degradation problems such as delay increase and GC time increase occur in high-speed scenarios because of the increase of input rate and parallelism of the system.

    参考文献
    [1] Manyika J, Chui M, Brown B, et al. Big data:The next frontier for innovation, competition, and productivity. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation, 2011.
    [2] 李国杰. 大数据研究的科学价值. 中国计算机学会通讯, 2012, 8(9):8-15
    [3] 孙大为, 张广艳, 郑纬民. 大数据流式计算:关键技术及系统实例. 软件学报, 2014, 25(4):839-862
    [4] Apache Hadoop v.3.2.0. http://hadoop.apache.org/.[2019-07-31].
    [5] Apache Spark v.2.4.0. https://spark.apache.org/.[2019-07-31].
    [6] Apache Storm v.2.0.0. http://storm.apache.org/.[2019-07-31].
    [7] Apache Flink v.1.7. https://flink.apache.org/.[2019-07-31].
    [8] Apache spark streaming guid. http://spark.apache.org/streaming/.[2019-03-31].
    [9] Chintapalli S, Dagit D, Evans B, et al. Benchmarking streaming computation engines:Storm, flink and spark streaming. Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. Chicago, IL, USA. 2016. 1789-1792.
    [10] Huang SS, Huang J, Dai JQ, et al. The HiBench benchmark suite:Characterization of the mapreduce-based data analysis. In:Agrawal D, Candan KS, Li WS, eds. New Frontiers in Information and Software as Services. Berlin, Heidelberg:Springer, 2011. 209-228.
    [11] Hesse G, Lorenz M. Conceptual survey on data stream processing systems. Proceedings of 2015 IEEE 21st International Conference on Parallel and Distributed Systems. Melbourne, VIC, Australia. 2015. 797-802.
    [12] Gradvohl ALS, Senger H, Arantes L, et al. Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 2014, 6(2):174-179
    [13] Akidau T, Balikov A, Bekiroğlu K, et al. MillWheel:Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11):1033-1044.[doi:10.14778/2536222.2536229
    [14] Neumeyer L, Robbins B, Nair A, et al. S4:Distributed stream computing platform. Proceedings of 2010 IEEE International Conference on Data Mining Workshops. Sydney, NSW, Australia. 2010. 170-177.
    [15] Nabi Z, Bouillet E, Bainbridge A, et al. Of streams and storms. IBM White Paper, 2014:1-31.
    [16] Dayarathna M, Suzumura T. A performance analysis of system S, S4, and Esper via two level benchmarking. In:Joshi K, Siegle M, Stoelinga M, et al., eds. Quantitative Evaluation of Systems. Berlin, Heidelberg:Springer, 2013. 225-240.
    [17] Toshniwal A, Taneja S, Shukla A, et al. Storm@twitter. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. Snowbird, UT, USA. 2014. 147-156.
    [18] Lopez MA, Lobato AGP, Duarte OCMB. A performance comparison of open-source stream processing platforms. Proceedings of 2016 IEEE Global Communications Conference. Washington, DC, USA. 2016. 1-6.
    [19] Karimov J, Rabl T, Katsifodimos A, et al. Benchmarking distributed stream data processing systems. Proceedings of 2018 IEEE 34th International Conference on Data Engineering. Paris, France. 2018. 1507-1518.
    [20] Karamel, Orchestrating Chef Solo. http://www.karamel.io/.[2019-07-31].
    [21] Perera S, Perera A, Hakimzadeh K. Reproducible experiments for comparing apache flink and apache spark on public clouds. arXiv preprint arXiv:1610.04493, 2016.
    [22] DataArtisans. Extending the Yahoo! Streaming Benchmark. https://www.ververica.com/blog/extending-the-yahoo-streaming-benchmark. (2016-02-02)[2019-07-31].
    [23] 程学旗, 靳小龙, 王元卓, 等. 大数据系统和分析技术综述. 软件学报, 2014, 25(9):1889-1908.[doi:10.13328/j.cnki.jos.004674
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

史凌云,郑莹莹,谭励,许利杰,王伟,魏峻.面向股票交易分析场景的流式大数据系统测试框架.计算机系统应用,2020,29(4):76-83

复制
分享
文章指标
  • 点击次数:1639
  • 下载次数: 3629
  • HTML阅读次数: 1489
  • 引用次数: 0
历史
  • 收稿日期:2019-09-04
  • 最后修改日期:2019-09-23
  • 在线发布日期: 2020-04-09
  • 出版日期: 2020-04-15
文章二维码
您是第11208384位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号