Machine Learning-Based Theme Classification for Video Content Analysis: A Bilingual Approach on the StoryBox

Hüseyin Parmaksız; Önder Öztürk; Osman Akarsu

doi:10.47852/

Authors

Hüseyin Parmaksız Department of Management Information Systems, Bilecik Şeyh Edebali University, Türkiye https://orcid.org/0000-0001-8455-5625
Önder Öztürk Information Technology Department, Rectorate, Kütahya Health Sciences University, Türkiye https://orcid.org/0000-0001-6460-9497
Osman Akarsu Department of Management Information Systems, Bilecik Şeyh Edebali University, Türkiye https://orcid.org/0000-0002-0595-6795

DOI:

https://doi.org/10.47852/

Keywords:

video content analysis, bilingual video classification, multilingual transformer models, theme classification, natural language processing, LLMs, clustering algorithms

Abstract

This research introduces an advanced hybrid machine learning framework for the automatic thematic classification of video content in a bilingual (Turkish–English) setting, with a particular focus on the YouTube StoryBox dataset (172 videos). The proposed pipeline integrates sentence-level embeddings from Sentence-BERT, multilingual zero-shot classification with XLM-RoBERTa, classic clustering algorithms (HDBSCAN, k-means, and spectral clustering), dimensionality reduction via UMAP, and large language model (LLM) based theme labeling with Flan-T5. The StoryBox collection is thematically rich and highly heterogeneous, covering entrepreneurship, education, technology, and industry. As a result, the video embeddings occupy a continuous semantic manifold rather than form compact, well-separated clusters. This leads to weak hard-clustering scores in the original embedding space (negative silhouette values and 100% noise assignments for HDBSCAN), which we interpret as an evidence of intrinsically fuzzy, overlapping themes rather than a failure of the algorithms. Nevertheless, the bilingual use of multilingual transformer models yields a 23% improvement in F1-score over a monolingual baseline for theme consistency. LLM-assisted inspection of UMAP-projected clusters further reveals that “education and development” emerges as the dominant macro-theme, accounting for 54.1% of the corpus. We position the framework as a proof-of-concept for real-world video platforms and discuss its implications for scalable content organization, recommendation systems, and decision support in global, bilingual media environments.

Received: 25 July 2025 | Revised: 5 December 2025 | Accepted: 5 January 2026

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/onder-ozturk/youtube-video-analyzer.

Author Contribution Statement

Hüseyin Parmaksız: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Önder Öztürk: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Osman Akarsu: Conceptualization, Validation, Investigation, Resources, Writing – review & editing, Supervision, Project administration.

Machine Learning-Based Theme Classification for Video Content Analysis: A Bilingual Approach on the StoryBox

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

Make a Submission

Keywords