OptiM-RoViT: A Robust Multimodal Sentiment Analysis Framework with Dynamic Fusion and Noise-Aware Vision Transformers
DOI:
https://doi.org/10.47852/bonviewJCCE62026113Keywords:
multimodal sentiment analysis, RoBERTa, Vision Transformer, hyperparameter optimization, deep learningAbstract
Multimodal sentiment analysis has been studied by many researchers because of its capability to understand human emotions through textual and visual data. Despite that, current approaches suffer from feature misalignment, modality imbalance, and a lack of robustness against noisy inputs. A new framework called OptiM-RoViT is proposed, which is built on RoBERTa for processing text and Vision Transformer (ViT) for analyzing images and enhances the model with dynamic modality weighting and Gaussian noise injection techniques. To achieve the best performance, the model uses Optuna-based hyperparameter optimization and can achieve 98.89% accuracy, 0.96 F1-score, and 0.04 false negative rate on the dataset of 10,000 product reviews. Complementary ablation studies also quantify the improvement of each component by enabling significant improvements over baseline architectures. Evaluations against stronger backbones (DeBERTa, Swin-V2) confirm that the proposed fusion mechanism is the primary driver of robustness, outperforming generic state-of-the-art baselines on noisy inputs. The critical challenges in multimodal fusion, including computational scalability and general efficacy, are addressed by the proposed approach, which is also well suited for the real-world applications such as e-commerce and social media analytics. Such adaptation may be explored in other domains as future work, and other modalities can be used to achieve similarly broad applicability.Received: 8 May 2025 | Revised: 2 February 2026 | Accepted: 13 March 2026
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in the GitHub Repository at https://github.com/1987Naveenv/sentiment.
Author Contribution Statement
Naveen Vasudevan: Conceptualization, Methodology, Software, Investigation, Data curation, Writing – original draft, Visualization. Sountharrajan Sehar: Validation, Formal analysis, Resources, Writing – review & editing, Supervision, Project administration.
Downloads
Published
2026-04-27
Issue
Section
Research Articles
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Vasudevan, N., & Sehar, S. (2026). OptiM-RoViT: A Robust Multimodal Sentiment Analysis Framework with Dynamic Fusion and Noise-Aware Vision Transformers. Journal of Computational and Cognitive Engineering. https://doi.org/10.47852/bonviewJCCE62026113