OptiM-RoViT: A Robust Multimodal Sentiment Analysis Framework with Dynamic Fusion and Noise-Aware Vision Transformers

Naveen Vasudevan; Sountharrajan Sehar

doi:10.47852/bonviewJCCE62026113

Authors

Naveen Vasudevan Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham-Chennai, India https://orcid.org/0000-0002-0829-618X
Sountharrajan Sehar Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham-Chennai, India https://orcid.org/0000-0003-4248-3875

DOI:

https://doi.org/10.47852/bonviewJCCE62026113

Keywords:

multimodal sentiment analysis, RoBERTa, Vision Transformer, hyperparameter optimization, deep learning

Abstract

Multimodal sentiment analysis has been studied by many researchers because of its capability to understand human emotions through textual and visual data. Despite that, current approaches suffer from feature misalignment, modality imbalance, and a lack of robustness against noisy inputs. A new framework called OptiM-RoViT is proposed, which is built on RoBERTa for processing text and Vision Transformer (ViT) for analyzing images and enhances the model with dynamic modality weighting and Gaussian noise injection techniques. To achieve the best performance, the model uses Optuna-based hyperparameter optimization and can achieve 98.89% accuracy, 0.96 F1-score, and 0.04 false negative rate on the dataset of 10,000 product reviews. Complementary ablation studies also quantify the improvement of each component by enabling significant improvements over baseline architectures. Evaluations against stronger backbones (DeBERTa, Swin-V2) confirm that the proposed fusion mechanism is the primary driver of robustness, outperforming generic state-of-the-art baselines on noisy inputs. The critical challenges in multimodal fusion, including computational scalability and general efficacy, are addressed by the proposed approach, which is also well suited for the real-world applications such as e-commerce and social media analytics. Such adaptation may be explored in other domains as future work, and other modalities can be used to achieve similarly broad applicability.

Received: 8 May 2025 | Revised: 2 February 2026 | Accepted: 13 March 2026

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available in the GitHub Repository at https://github.com/1987Naveenv/sentiment.

Author Contribution Statement

Naveen Vasudevan: Conceptualization, Methodology, Software, Investigation, Data curation, Writing – original draft, Visualization. Sountharrajan Sehar: Validation, Formal analysis, Resources, Writing – review & editing, Supervision, Project administration.

OptiM-RoViT: A Robust Multimodal Sentiment Analysis Framework with Dynamic Fusion and Noise-Aware Vision Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

CImago Journal

Make a Submission

Keywords

Announcements