基于注意力机制与局部交互的视觉惯性里程计
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金 (62172061); 四川省揭榜挂帅项目(2023YFG0374)


Visual-inertial Odometry Based on Attention and Local Interaction
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    视觉惯性里程计(visual-inertial odometry, VIO)通过融合视觉和惯性数据来实现位姿估计. 在复杂环境中, 惯性数据受噪声干扰, 长时间运动会导致累积误差, 同时大多数VIO忽略了模态间局部信息交互, 未充分利用不同模态的互补性, 从而影响位姿估计精度. 针对上述问题, 本文提出了一种基于注意力机制与局部交互的视觉惯性里程计(attention and local interaction-based visual-inertial odometry, ALVIO)模型. 首先, 该模型分别提取到视觉特征和惯性特征. 其次, 保留惯性特征的历史时序信息, 并通过基于离散余弦变换 (discrete cosine transform, DCT)的通道注意力机制增强低频有效特征, 抑制高频噪声. 接着, 设计了多模态局部交互与全局融合模块, 利用改进的分散注意力机制与MLP-Mixer逐步实现模态间的局部交互与全局融合, 根据不同模态的贡献调节局部特征权重, 实现模态间互补, 再在全局维度上整合特征, 得到统一表征. 最后, 将融合的特征进行时间建模和位姿回归得到相对位姿. 为了验证模型在复杂环境下的有效性, 对公开数据集KITTI和EuRoC进行了低质量处理并实验, 实验表明, ALVIO相较于直接特征拼接模型、多头注意力融合模型、软掩码融合模型, 平移误差分别减少了49.92%、32.82%、37.74%, 旋转误差分别减少了51.34%、25.96%、29.54%, 且具有更高的效率和鲁棒性.

    Abstract:

    Visual-inertial odometry (VIO) achieves pose estimation by fusing visual and inertial data. In complex environments, inertial data are prone to noise interference, and long-term motion leads to cumulative errors. Additionally, most VIO models overlook local information interaction between modalities and fail to fully utilize their complementary nature, thereby compromising pose estimation accuracy. To address these issues, this study proposes an attention and local interaction-based visual-inertial odometry (ALVIO) model. First, the model extracts visual features and inertial features. Then, the historical time-series information of the inertial features is preserved, and a channel attention mechanism based on discrete cosine transform (DCT) is applied to enhance low-frequency effective features and suppress high-frequency noise. Next, a multi-modal local interaction and global fusion module is designed, which gradually achieves local interaction and global fusion between modalities through improved split-attention mechanism and MLP-Mixer. This module adjusts the local feature weights based on the contributions of different modalities to realize inter-modal complementarity and then integrates the features globally to obtain a unified representation. Finally, the fused features are used for temporal modeling and pose regression to obtain relative poses. To verify the effectiveness of the model in complex environments, this paper conducts experiments on processed low-quality versions of the public KITTI and EuRoC datasets. The results show that, compared to the direct feature concatenation model, the multi-head attention fusion model, and the soft mask fusion model, ALVIO reduces the translation error by 49.92%, 32.82%, and 37.74%, respectively, and the rotation error by 51.34%, 25.96%, and 29.54%, respectively, while also demonstrating higher efficiency and robustness.

    参考文献
    相似文献
    引证文献
引用本文

王顺兰,沈艳.基于注意力机制与局部交互的视觉惯性里程计.计算机系统应用,,():1-14

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-12-10
  • 最后修改日期:2025-02-12
  • 录用日期:
  • 在线发布日期: 2025-06-20
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号