Machine Learning-Based Theme Classification for Video Content Analysis: A Bilingual Approach on the StoryBox
DOI:
https://doi.org/10.47852/Keywords:
video content analysis, bilingual video classification, multilingual transformer models, theme classification, natural language processing, LLMs, clustering algorithmsAbstract
This research introduces an advanced hybrid machine learning framework for the automatic thematic classification of video content in a bilingual (Turkish–English) setting, with a particular focus on the YouTube StoryBox dataset (172 videos). The proposed pipeline integrates sentence-level embeddings from Sentence-BERT, multilingual zero-shot classification with XLM-RoBERTa, classic clustering algorithms (HDBSCAN, k-means, and spectral clustering), dimensionality reduction via UMAP, and large language model (LLM) based theme labeling with Flan-T5. The StoryBox collection is thematically rich and highly heterogeneous, covering entrepreneurship, education, technology, and industry. As a result, the video embeddings occupy a continuous semantic manifold rather than form compact, well-separated clusters. This leads to weak hard-clustering scores in the original embedding space (negative silhouette values and 100% noise assignments for HDBSCAN), which we interpret as an evidence of intrinsically fuzzy, overlapping themes rather than a failure of the algorithms. Nevertheless, the bilingual use of multilingual transformer models yields a 23% improvement in F1-score over a monolingual baseline for theme consistency. LLM-assisted inspection of UMAP-projected clusters further reveals that “education and development” emerges as the dominant macro-theme, accounting for 54.1% of the corpus. We position the framework as a proof-of-concept for real-world video platforms and discuss its implications for scalable content organization, recommendation systems, and decision support in global, bilingual media environments.Received: 25 July 2025 | Revised: 5 December 2025 | Accepted: 5 January 2026
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in GitHub at https://github.com/onder-ozturk/youtube-video-analyzer.
Author Contribution Statement
Hüseyin Parmaksız: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Önder Öztürk: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Osman Akarsu: Conceptualization, Validation, Investigation, Resources, Writing – review & editing, Supervision, Project administration.Downloads
Published
2026-01-20
Issue
Section
Research Article
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Parmaksız, H., Öztürk, Önder, & Akarsu, O. (2026). Machine Learning-Based Theme Classification for Video Content Analysis: A Bilingual Approach on the StoryBox. Artificial Intelligence and Applications. https://doi.org/10.47852/