To address the limitations of traditional evaluation methods for underactuated robot design schemes in terms of efficiency, accuracy, and automation, the work aims to propose a novel evaluation method based on vision-text multimodal large models. Firstly, a tri-modal input system comprising structural parameters, design documentation, and structural images was constructed, followed by data cleaning, standardization, and unified preprocessing. Then, modality-specific feature extraction was performed with a multilayer perceptron for structured data, a pretrained BERT model for text, and a vision transformer for images. These features were fused through a cross-attention mechanism to capture deep multimodal correlations. A nonlinear mapping network was subsequently designed to model the relationship between the fused features and core evaluation metrics such as functionality, safety, and control performance. Finally, a structured evaluation report was automatically generated with the locally deployed DeepSeek-VL R1 7B large language model, enabling intelligent transformation from feature understanding to semantic output. Experiments were conducted on a self-constructed robot dataset, involving 300 complete design cases. A specific bridge crane control system design was selected as a test case to validate the model. The results showed that the proposed method achieved an average deviation of 1.8% from expert scores and a correlation coefficient of 0.94. The automatically generated reports demonstrated strong professionalism and engineering applicability. The proposed evaluation method integrates multimodal semantic modeling with language generation capabilities, which significantly enhances the intelligence, standardization, and interpretability of underactuated robot design evaluations, providing robust technical support for design quality control and intelligent decision-making in complex engineering systems.
Key words
multi-modal vision and text /
underactuated robots /
deep evaluation model /
cross-attention mechanism /
large language model
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
References
[1] MINOR M, DULIMARTA H, DANGHI G, et al.Design, Implementation, and Evaluation of an Under-actuated Miniature Biped Climbing Robot[C]//Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000). Paris: IEEE, 2000, 3: 1999-2005.
[2] 李懿, 秦鹏, 杨会丰. 悬挂式协作机器人设计与分析[J]. 重庆理工大学学报(自然科学), 2020, 34(4): 130-135.
LI Y, QIN P, YANG H F.Design and Simulation of Suspension Cooperative Robot[J]. Journal of Chongqing University of Technology (Science) , 2020, 34(4): 130-135.
[3] CHAPELLE F, BIDAUD P.Evaluation Functions Synthesis for Optimal Design of Hyper-redundant Robotic Systems[J]. Mechanism and Machine Theory, 2006, 41(10): 1196-1212.
[4] 朱娟娟, 蔡星娟. 基于多准则决策与 GO 法的工业机器人可靠性评估方法研究[J]. 机床与液压, 2024, 52(17): 39-45.
ZHU J J, CAI X J.Research on Reliability Method for Industrial Robot Based on Multi-criteria Decision-making and GO Method[J]. Machine tool and Hydraulics. 2024, 52(17): 39-45.
[5] 陈鹏. 基于改进四阶矩的机器人运动可靠性评估方法研究[D]. 邯郸: 河北工程大学, 2019: 13-18.
CHEN P.Research on Motion Reliability of Robot Based on Improved Fourth -order Moment Estimation Method[D]. Handan: Hebei University of Engineering, 2019: 13-18.
[6] URREA C, PASCAL J. Design, Simulation, Comparison and Evaluation of Parameter Identification Methods for an Industrial Robot[J]. Computers & Electrical Engineering, 2018(67): 791-806.
[7] REYES F, KELLY R.Experimental Evaluation of Identification Schemes on a Direct Drive Robot[J]. Robotica, 1997, 15(5): 563-571.
[8] ALT B, ZAHN J, KIENLE C, et al.Human-AI Interaction in Industrial Robotics: Design and Empirical Evaluation of a User Interface for Explainable AI-Based Robot Program Optimization[J]. Procedia CIRP, 2024(130): 591-596.
[9] 王家玮, 罗静静, 王洪波, 等. 软镜递送机器人平台的设计与性能评估[J]. 中国科技论文, 2023, 18(8): 921-926.
WANG J W, LUO J J, WANG H B, et al.Design and Performance Evaluation of Visual Endotracheal Intubation Robot Platform[J]. China SciencePaper, 2023, 18(8): 921-926.
[10] CARRARA G, KALAY Y E, NOVEMBRI G.Multi-modal Representation of Design Lnowledge[J]. Automation in Construction, 1992, 1(2): 111-121.
[11] FENG F, WANG X, LI R.Cross-modal Retrieval with Correspondence Autoencoder[C]//Proceedings of the 22nd ACM International Conference on Multimedia, Orlando: ACM, 2014: 7-16.
[12] 马进, 范明浩, 马良山, 等. 基于图文多模态融合推理的产品创新方案设计方法研究[J]. 包装工程, 2024, 45(8): 21-28.
MA J, FAN M H, MA L S, et al.Innovative Product Design Schemes Based on Image-text Multi-modal Fusion Reasoning[J]. Packaging Engineering, 2024, 45(8): 21-28.
[13] SONG B, MILLER S, AHMED F.Attention-enhanced Multimodal Learning for Conceptual Design Evaluations[J]. Journal of Mechanical Design, 2023, 145(4): 041410.
[14] SU H, SONG B, AHMED F.Multi-modal Machine Learning for Vehicle Rating Predictions Using Image, Text, and Parametric Data[C]//Proceedings of the International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. Boston:ASME, 2023.
[15] FAN Y, ZHOU Y, YUAN Z.Interior Design Evaluation Based on Deep Learning: A Multi-Modal Fusion Evaluation Mechanism[J]. Mathematics, 2024, 12(10): 1560.
[16] HE B, WANG S, LIU Y.Underactuated Robotics: A Review[J]. International Journal of Advanced Robotic Systems, 2019, 16(4): 1729881419862164.
[17] DEVLIN J, CHANG M W, LEE K, et al.Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis:ACL , 2019: 4171-4186.
[18] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[EB/OL]. (2022-02-12) [2024- 03-23]. arXiv preprint arXiv: 2010.11929, 2020.
[19] LU H, LIU W, ZHANG B, et al. Deepseek-vl: towards Real-world Vision-language Understanding[EB/OL]. (2022-02-12) [2024-09-23]. arXiv preprint arXiv: 2403. 05525, 2024.