Out-Heroding Herod? — Author-trained GPTs and Original Works in the Perspective of Quantitative Linguistics
DOI:
https://doi.org/10.14712/23362189.2025.4945Keywords:
quantitative linguistics, stylometry, AI, chatbot, ChatGPT, Czech literatureAbstract
Goals: The paper compares texts created by GPT models trained on the works of prominent Czech authors and the pieces of literature they actually wrote. The goal is to find out (1) whether there are any differences between the two; and if so, (2) in what sphere of language these differences are the most prominent.
Methods: The authors used for building GPTs are Karel Čapek, Jaroslav Hašek, Franz Kafka, and Vladislav Vančura. The corpus contains 40 1,000-word text samples per each, 20 of them produced by the respective GPT and 20 taken from the original works. Two investigations are carried out – the first includes calculating 30 morphological, syntactic, and lexical markers for each text; the second is based on most-frequent-element analyses. The results of the first set are tested on statistical significance via Mann–Whitney U Test.
Results: The chatbots do not reflect colloquiality of style and conversation interaction very well, and tend to make texts more narrative. The best results are obtained for Karel Čapek, the worst for Franz Kafka. The stylometric analyses almost always distinguish the AI- and human-generated pieces of language.
Conclusions: The texts produced by the author-trained GPTs are still very well distinguishable from those produced by real writers.
References
Burrows, J. F. (2002). "Delta": A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267-287.
https://doi.org/10.1093/llc/17.3.267
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94-100.
https://doi.org/10.1080/09296171003643098
Cvrček, V., Laubeová, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2020a). Registry v češtině. Lidové noviny.
Cvrček, V., Čech, R., & Kubát, M. (2020b). QuitaUp - nástroj pro kvantitativní stylometrickou analýzu. Czech National Corpus and University of Ostrava. https://korpus.cz/quitaup/
Dahl, Ö. (Ed.). (2000). Tense and aspect in the languages of Europe. Mouton de Gruyter.
https://doi.org/10.1515/9783110197099
Daneš, F. (1954). Příspěvek k poznání jazyka a slohu Haškových "Osudů dobrého vojáka Švejka". Naše řeč, 37(3-6), 124-139.
Davidson, D. (2001). Subjective, intersubjective, objective. Oxford University Press.
https://doi.org/10.1093/0198237537.001.0001
DeepL Translator Team. (2025). DeepL Translator. https://www.deepl.com
Eder, M., Rybicki, J., & Kestemont, M. (2016). Stylometry with R: A package for computational text analysis. The R Journal, 8(1), 107-121.
https://doi.org/10.32614/RJ-2016-007
Janda, L. A., Fidler, M., Cvrček, V., & Obukhova, A. (2022). The case for case in Putin's speeches. Russian Linguistics, 47(1), 15-40.
https://doi.org/10.1007/s11185-022-09269-2
Kalantzis, M., & Cope, B. (2025). Literacy in the time of artificial intelligence. Reading Research Quarterly, 60(1), 1-34.
https://doi.org/10.1002/rrq.591
Kosmas, P., Nisiforou, E. A., Kounnapi, E., Sophocleous, S., & Theophanous, G. (2025). Integrating artificial intelligence in literacy lessons for elementary classrooms: A co-design approach. Educational Technology Research and Development, 73(3), 2589-2615.
https://doi.org/10.1007/s11423-025-10492-z
Kubát, M. (2016). Kvantitativní analýza žánrů. Filozofická fakulta Ostravské univerzity.
Kundera, M. (1960). Umění románu: cesta Vladislava Vančury za velkou epikou. Československý spisovatel.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50-60.
https://doi.org/10.1214/aoms/1177730491
Mikros, G. K. (2025). Beyond the surface: Stylometric analysis of GPT-4o's capacity for literary style imitation. Digital Scholarship in the Humanities, 40(2), 587-600.
https://doi.org/10.1093/llc/fqaf035
Milička, J., Marklová, A., & Cvrček, V. (2025). Benchmark of stylistic variation in LLM-generated texts. arXiv, 2509.10179v1.
Místecký, M., & Melka, T. S. (2021). Literary "higher dimensions" quantified: A stylometric study of nine stories. Glottotheory, 12(2), 129-157.
https://doi.org/10.1515/glot-2021-2021
Místecký, M., & Radková, L. (2020). School and gender in numbers: A stylometric insight into the lexis of teenagers' description essays. Glottometrics, 49, 52-65.
Mukařovský, J. (1939). Próza K. Čapka jako lyrická melodie a dialog. Slovo a slovesnost, 5(1), 1-12.
O'Sullivan, J. (2024). Stylometric comparisons of human versus AI-generated creative writing. Humanities and Social Sciences Communications, 12(1), 1708.
https://doi.org/10.1057/s41599-025-05986-3
OpenAI. (2025). ChatGPT (May 13 version) [Large language model]. https://chat.openai.com/
Piorecký, K., & Husárová, Z. (2018). Tvořivost literatury v éře umělé inteligence. Česká literatura, 67(2), 145-169.
Rebora, S. (2023). GPT-3 vs. Delta: Applying stylometry to large language models. In E. Carbé, G. Lo Piccolo, A. Valenti, & F. Stella (Eds.), La memoria digitale: Forme del testo e organizzazione della conoscenza. Atti del XII Convegno Annuale AIUCD (pp. 292-297). Associazione per l'Informatica Umanistica e la Cultura Digitale (AIUCD).
Mishra, S. (2023). Dissolution of language leads to dissolution of self: Pragmatic analysis of Franz Kafka's Trial. Indian Journal of Language and Literary Studies, 4(4), 1-8.
https://doi.org/10.54392/ijll2341
Straková, J., Straka, M., & Hajič, J. (2014). Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In K. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 13-18). Association for Computational Linguistics.
https://doi.org/10.3115/v1/P14-5003
Tao, M., Tao, J., & Xu, Q. (2025). A quantitative study on the improvement of students' reading literacy by AI-assisted English reading comprehension training platform. International Journal of Environmental Sciences, 11(20), 1298-1306.
Vondráček, M. (2013). Vlastnosti slova a slovní druhy. In O. Uličný & O. Bláha (Eds.), Úvahy o české morfologii. Studie k moderní mluvnici češtiny 6 (pp. 17-32). Univerzita Palackého v Olomouci.
Zaitsu, W., & Jin, M. (2023). Distinguishing ChatGPT(-3.5, -4)-generated and human-written papers through Japanese stylometric analysis. PLOS ONE, 18(8), e0288453.
https://doi.org/10.1371/journal.pone.0288453
PMid:37556434 PMCid:PMC10411719
