[8]{.chapter-number}  [An Introduction to Large Language Models in Education]{.chapter-title}

Eduardo Oliveira; Yige Song; Mohammed Saqr; Sonsoles López-Pernas

8 An Introduction to Large Language Models in Education

Authors

Eduardo Oliveira

Yige Song

Mohammed Saqr

Sonsoles López-Pernas

Abstract

Large language models (LLMs) have become central to contemporary advancements in education. LLMs facilitate applications such as automated feedback, question generation, sentiment analysis, and multilingual accessibility. This chapter examines the mechanisms that underpin LLMs —namely, transformer architecture, pre-training, and generative abilities. Moreover, we present the diverse applications of LLMs in education, some of which will be covered in subsequent tutorial chapters in the present book. Lastly, we explore tools for interacting with LLMs, from beginner-friendly web interfaces to more advanced tools like APIs and frameworks such as the OpenAI API and Hugging Face’s Transformers.

1 Introduction

Large Language Models (LLMs) are advanced AI systems primarily designed to process and generate human-like text, but their capabilities extend far beyond natural language tasks, enabling transformative applications across diverse domains, including education, research, software development, and more [1, 2]. Taking advantage of massive datasets and neural network architectures —such as the transformer mechanism [2]— LLMs can analyse context, predict sequential elements (e.g., the next word or token), and produce coherent, contextually appropriate outputs.

While text generation and understanding are at the core of LLMs, their versatility stems from their ability to generalize patterns in data. This enables them to tackle tasks like code generation [3], and even multimodal applications that integrate text with other data types, such as images and videos [4]. For instance, specialized models like Claude Sonnet excel in programming tasks, while multimodal extensions of GPT-4 demonstrate the ability to describe images and interpret visual data alongside textual inputs.

LLMs are opening new possibilities in education by transforming how students learn and how educators teach and research [5]. These models can dynamically generate test questions tailored to diverse learning levels [6], translate educational materials to improve accessibility [7], summarize complex concepts for clearer understanding [8], and simulate conversational practice to enhance language skills [9]. The personalization of learning experiences through LLMs can support students in mastering content at their own pace. In research, they enable the creation of synthetic datasets [10], simulate experimental conditions for pedagogical studies [11], and analyse vast corpora of text and structured data to uncover actionable insights.

The scalability of LLMs, exemplified by models like GPT-3 with 175 billion parameters (i.e., the weights and biases of the different layers), has accelerated their adoption across fields. However, they are not without limitations. Issues such as biased, inaccurate, or fabricated outputs highlight the need for critical human oversight [12]. A human-in-the-loop approach is crucial to ensuring that LLMs align with educational goals and ethical standards, serving as tools to enhance learning rather than replace human expertise [13, 14]. Despite these challenges, LLMs are transforming how information is delivered, processed, and utilized. Their ability to integrate language understanding with broader data-processing capabilities positions them as invaluable tools for solving complex problems and enhancing accessibility across disciplines. The following sections explore the transformative applications of LLMs in education research and practice.

2 Behind the scenes of LLMs

To better understand how large language models function, it is helpful to break down their core components and processes into three interrelated aspects based on the acronym “GPT”: the Transformer architecture (T), which underpins their computational ability; the Pre-training phase (P), where the model learns patterns and knowledge from vast datasets; and the Generative abilities (G), which showcase their practical applications. The sequence T → P → G provides a logical progression from the foundational mechanics of LLMs to their training processes and, finally, to their real-world outputs. This structure offers a holistic view of these transformative technologies: first understanding the architecture that powers LLMs, then exploring how they are trained to predict and fine-tune their outputs, and ultimately seeing what they can generate. In the following sections, we explore each of these phases in detail, shedding light on their individual contributions to the overall capabilities of LLMs.

2.1 How LLMs Work: The “Transformer” Architecture

The “Transformer” in GPT refers to the architectural framework that powers modern LLMs, enabling their remarkable fluency and adaptability. This architecture revolutionized natural language processing by introducing the attention mechanism [2], which allows the model to understand relationships within a sequence and process context effectively. Without the “T,” LLMs could not achieve the coherence and precision that define their generative abilities.

At its core, a LLM is a sophisticated function designed to predict the next element in a sequence. For example, given the input “The sun is shining”, the model might predict “brightly” or “today”. While this prediction task may sound straightforward, this isn’t a simple function like y = 2x + 1, in which the parameters are the coefficient of the linear function 2 and 1, and input of x=1 would give an output of 3. The underlying computations in LLMs are vastly more intricate. Instead of a handful of parameters, LLMs employ billions of them within multi-layered neural networks. Each layer refines the model’s understanding of the input, performing complex transformations based on the outputs of previous layers.

Prior to the attention mechanism, models like recurrent neural networks (RNNs) [15] and convolutional neural networks (CNNs) [16] struggled to effectively process relationships in long or complex sequences, often treating input elements in isolation or with limited context [17]. The attention mechanism transformed this by allowing the model to weigh the importance of different parts of the input sequence based on their relevance to the task [2]. For example, consider the sentences “There’s a bat on the tree” and “He swings the bat”. In the first, “bat” refers to an animal, while in the second, it refers to a sports tool. The attention mechanism enables the model to focus on surrounding words like “tree” or “swings” to deduce the correct meaning. This ability to dynamically assign importance to specific words based on context is the key to LLMs’ accuracy and coherence.

The attention mechanism calculates attention weights, assigning varying levels of importance to each word based on its relevance to the task at hand. For instance, in the sentence “The sun is shining”, the word “sun” is given more weight than less significant words like “the”. These weighted representations of words are passed through multiple layers of the neural network, with each layer building on the contextual understanding established by the previous one. Early layers might identify simple word relationships, like “sun” being associated with “shining”, while later layers incorporate broader knowledge, such as recognizing weather-related descriptions that often lead to words like “today” or “brightly”.

In summary, the Transformer architecture is highly effective because it dynamically focuses attention on the most relevant parts of an input sequence at any given moment. This mechanism mirrors human behavior, where we concentrate on the information most pertinent to the task at hand while filtering out distractions. This synergy between attention and neural network computation forms the foundation of the architecture, making it the core of modern generative AI.

2.2 Why LLMs Work: The Power of Being “Pre-trained”

Previously, we discussed that LLMs are predictive models with billions of parameters, enabling them to perform complex calculations. But how are these parameters determined? That’s where the “P” in GPT - Pre-training - comes in.

In any prediction task, the goal is to learn from existing data to make accurate predictions for new inputs. For example, imagine you’re predicting the next number in a sequence like 2, 4, 6, 8. Based on the pattern, you might predict 10 as the next number. Similarly, LLMs predict the next word in a sentence based on patterns they’ve learned from vast amounts of text. For instance, given the sentence “The sun is shining”, the model might predict “brightly” as the next word. Before training begins, the structure of the model must be defined. In our example, this would mean deciding the type of rule used to predict the sequence, like identifying it as “add 2 to the previous number”. For LLMs, this involves defining the number of layers in the model, the capacity of the attention mechanism to focus on different parts of the input, and the dimensions of its internal representations and computations. These decisions shape how well the model can understand and generate language.

Once the model structure is determined, the training process begins. In the numerical example of 2, 4, 6, 8, if we assume that we have an arithmetic sequence if we assume that we have an arithmetic sequence (that is, x_t = 1 * x_t-1 + 2), the process involves identifying the parameters a=1 and b=2, which defines the rule for predicting the next number in the sequence. For LLMs, determining parameter values is far more complex, involving billions of parameters and advanced optimisation techniques. The training process is generally divided into two main stages: pre-training and fine-tuning, with reinforcement learning from human feedback (RLHF) often added for further refinement. Let us explore how these processes work in more detail in the following sections.

2.2.1 Pre-training and Fine Tuning

Pre-training is the initial phase [18] where the model is exposed to a vast corpus of text data, such as books, articles, and websites. The size and diversity of this dataset are crucial for the model to learn general patterns of language, including grammar, semantics, and common word associations. Importantly, this process is self-supervised [19], meaning the model learns from the structure of the data itself without requiring manually labelled examples. For instance, given the input “The sun is shining”, the model predicts the next word based on patterns it has seen in similar contexts during pre-training. In this case, it might predict words like “brightly” or “today” depending on the associations it has learned from the training data.

On the technical side, the accuracy of these predictions is measured using a loss function, which quantifies how far the predicted word is from the actual next word in the sequence. For example, if the model predicts “cloudy” instead of “brightly,” the loss function assigns a higher value, indicating a larger error. These errors are minimized through a process called backpropagation, which calculates how each parameter in the model contributes to the error. Optimisation algorithms then adjust the parameters to reduce the loss, gradually improving the model’s ability to make accurate predictions.

After pre-training, the model undergoes fine-tuning on smaller, domain-specific datasets. For example, if the model is to generate weather reports, it might be fine-tuned on a specialized corpus of meteorological data. This step builds on the general language patterns learned during pre-training, refining the model to perform specialized tasks with high accuracy. Returning to the earlier example, fine-tuning might teach the model to complete “The sun is shining” with specific terms like “in the afternoon” or “through the clouds”, based on its specialized training data.

2.2.2 Reinforcement Learning with Human Feedback (RLHF)

To further enhance the model’s alignment with human expectations, a process called reinforcement learning with human feedback (RLHF) [20] is often applied. This technique fine-tunes the model beyond its technical accuracy, ensuring it generates outputs that are clear, relevant, and aligned with human preferences. The approach was popularized by the work of Christiano et al. [2017], which demonstrated how human preferences could be used to guide models in learning complex tasks where clear evaluation criteria are difficult to define.

In RLHF, human evaluators review the model’s outputs and rank them based on criteria like clarity, appropriateness, and usefulness. For instance, if the model is tasked with completing the phrase “The sun is shining”, it might produce options like “brightly”, “on the horizon”, or “through the clouds”. Evaluators would rank these outputs, perhaps preferring “brightly” for its clarity and generality over “on the horizon”, which might seem overly specific. These rankings are then used to train a reward model, which predicts scores for outputs based on their alignment with human preferences.

During the reinforcement learning phase, the LLM generates new predictions, and the reward model assigns scores to these predictions. Using reinforcement learning algorithms, such as Proximal Policy Optimisation (PPO) [21], the LLM updates its parameters to maximize the reward from the reward model. During this process, desirable outputs are assigned higher scores, while less preferred options are penalized. Through this iterative process, the model improves its ability to align with human-provided feedback, producing outputs that better meet expectations for clarity, relevance, and user-friendliness.

2.3 What LLMs Do: The “Generative” Core of GPT

At the heart of large language models lies their generative ability - the “G” in GPT - which enables them to produce coherent, contextually relevant, and often human-like text. This generative capability transforms LLMs from passive tools into active collaborators across a wide range of applications, from creative writing and summarizing complex documents to coding and conversational AI.

The generative process begins with a prompt, which serves as the input to the model. For instance, given the phrase “The sun is shining”, the model predicts likely continuations one word after another based on patterns it has learned during training. It might generate “brightly through the clouds” or “today after a long storm,” depending on its understanding of the context and relationships within the data it was trained on. This prediction is not random but calculated from probabilities assigned to thousands of potential next words, allowing the model to choose the most appropriate continuation.

LLMs excel in generating text that reflects the tone, style, and intent of the input. For example, when prompted with “Write a formal letter about sunny weather”, the model might begin: “Dear Sir/Madam, I am writing to express my appreciation for the delightful sunny weather we have been experiencing”. Conversely, a casual prompt like “Tell me something cool about sunny days” could result in: “Did you know sunshine boosts your serotonin levels, making you feel happier?”. These examples demonstrate the model’s ability to adapt to the user’s intent and produce contextually appropriate outputs.

Beyond generating text, LLMs can perform tasks such as aiding in language learning [9], creating summaries [8], and writing code [3]. This flexibility stems from their pre-training on diverse datasets that include examples across various domains, enabling them to handle a broad range of tasks. Despite its versatility, the generative process is not without challenges. LLMs may occasionally produce outputs that are factually inaccurate, contextually inappropriate, or overly verbose—a phenomenon known as “hallucination”. This highlights the importance of human oversight to ensure the quality and reliability of their outputs. Table 1 summarizes the main uses of LLMs and how they can be applied to different tasks in education and learning analytics.

Summary of main LLM tasks
Task	Description	Example Models	How It Works Technically	Application in education
Text Completion	Predicting and generating text to complete a prompt.	GPT-4, GPT-o1, Claude, LLaMA	Utilizes autoregressive transformers to predict the next token based on preceding tokens, employing learned probability distributions.	Auto-generating personalized feedback for students [22] .
Text Generation (Creative Writing)	Generating creative and coherent long-form text.	GPT-4, GPT-o1, Claude, LLaMA	Uses autoregressive token prediction to produce semantically coherent, fluent, and diverse text.	Supporting creative writing assignments, generating essay ideas, or storytelling exercises for students [23].
Question Answering	Providing answers to factual or contextual questions.	GPT-4, Claude, Gemini, RoBERTa	Employs attention mechanisms to focus on relevant parts of the input context, generating precise answers. In extractive QA, identifies and extracts specific text spans.	Automated Q&A systems for course material, providing quick answers to student queries [24]
Question Generation	Generating questions based on input content.	GPT-4, T5, BERT-QG	Sequence-to-sequence models generate question tokens conditioned on input context.	Auto-generating quiz questions or comprehension tests from textbook material or lecture notes [25].
Text Summarization	Generating concise summaries from longer texts.	GPT-4, Gemini, Claude, BART, T5	Applies sequence-to-sequence transformer architectures to condense input text into coherent summaries, preserving essential information.	Summarizing lecture notes, research papers, or learning materials for quicker understanding [26].
Translation	Translating text between languages.	GPT-4, Gemini, mT5, LLaMA	Uses encoder-decoder transformer models to map input tokens from the source language to the target language, aligning syntactic and semantic structures.	Supporting multilingual learners by translating course materials, assignments, or instructions [27].
Text Classification	Assigning labels or categories to text.	BERT, RoBERTa, DistilBERT	Tokenizes input text and processes it through transformer layers to produce embeddings, which are then classified into categories using a task-specific head.	Analyzing student responses for sentiment (e.g., identifying frustration) [28], or automatic discourse coding [29].
Chat and Dialogue	Generating conversational responses for chatbots.	Claude, GPT-4, Gemini, LLaMA	Maintains dialogue context by incorporating previous conversation turns, generating contextually relevant responses through autoregressive token prediction.	Powering tutoring chatbots that assist students with real-time explanations and guidance [30].
Code Generation	Writing or completing code snippets.	GPT-o1, Codex, Gemini	Predicts subsequent code tokens, based on previous training on a large corpora of code, ensuring syntactic and semantic correctness.	Assisting students in programming courses by auto-generating or debugging code snippets [31].
Paraphrasing and Rewriting	Generating rephrased or simplified versions of input text.	GPT-4, Claude, T5	Fine-tuned models rephrase input text while preserving semantic meaning, often using reinforcement learning.	Helping students rephrase ideas to avoid plagiarism, simplifying content, or generating alternative explanations of concepts [32, 33].
Style Transfer	Rewriting text with a different style or tone.	GPT-4, CTRL, Claude, LLaMA	Generates text conditioned on desired stylistic attributes, often fine-tuned on datasets exemplifying target styles to learn appropriate transformations.	Simplifying advanced academic content for younger students or adapting tone for academic writing [34].
Knowledge Extraction	Extracting entities or relationships from text.	GPT-4, BERT, Gemini	Utilizes attention layers to identify and classify entities and their relationships within text, enabling structured information extraction.	Extracting key concepts, relationships, or named entities from lecture transcripts and research articles.
Task automation	Performing tasks with minimal or no labeled examples.	Claude, GPT-4, GPT-o1, BART	Takes advantage of in-context learning by interpreting task instructions and examples provided within the input, adapting to new tasks without explicit retraining.	Automating grading of open-ended responses with minimal training or applying rubrics [35], performing tedious technical tasks such as converting a screenshot of a table into LATEX code.
Data-to-Text Generation	Generating natural language summaries or explanations from structured data.	GPT-4, T5	Encoder-decoder architectures map structured inputs (e.g., tables) into coherent text.	Generating textual explanations of student performance data, or dashboards [36].
Multimodal Integration	Combining text with images, audio, or other modalities.	Gemini, GPT-4V, CLIP	Aligns representations from different modalities using combined encoder architectures, facilitating tasks that require understanding across multiple data types (image, audio, video, etc.).	Analyzing student engagement via multimodal data (e.g., combining video, text, and audio) [37].
Text-to-Speech	Converting written text into spoken language.	Tacotron, Gemini, WaveNet, VALL-E	Deep learning architectures generate natural-sounding speech from tokenized text inputs.	Assisting visually impaired students, enabling spoken lecture notes, teaching young children who cannot read yet, or improving pronunciation for language learners [38, 39].
Text-to-Video	Converting written text into video.	Sora, Runway Gen 2	Multimodal transformer architectures process text as input and generate video sequences using latent diffusion or frame interpolation.	Creating engaging educational videos from course materials [40]
Reasoning & Problem Solving	Solving logical, mathematical, or structured reasoning tasks.	Claude, GPT-4, GPT-o1, Gemini	Employs chain-of-thought prompting and step-by-step token generation to tackle complex reasoning tasks, enhancing problem-solving capabilities	Assisting students with step-by-step solutions in math or logical reasoning exercises, enhancing comprehension [41].

2.4 Using LLMs: From Web Interfaces to Advanced Frameworks

After understanding how LLMs work and what they can be used for, the next step is exploring how to use them in practical scenarios. Whether you are a beginner looking to interact with LLMs through simple web interfaces or an advanced user building custom applications using APIs and frameworks, there are many ways to harness their capabilities.

For beginners, web interfaces like ChatGPT, Claude (by Anthropic), and Perplexity AI offer user-friendly ways to explore LLM capabilities without requiring technical expertise. These platforms can assist with summarizing academic papers, generating insights from student feedback, generating or assessing code, brainstorming interventions for struggling learners, and more. These tools support uploading files, such as PDFs, or accessing content from open-access URLs, making them versatile for analyzing publicly available academic content.

Such tools demonstrate the transformative power of LLMs across diverse domains, enabling students to engage more deeply with complex concepts while allowing educators to focus on higher-level instruction and personalized support. Whether applied to programming, language learning, or data analysis, these tools foster iterative learning: if the initial output does not fully meet the user’s needs, prompts and inputs can be refined to generate more tailored and meaningful results. This interaction not only enhances understanding but also encourages critical thinking and adaptability in both learners and educators.

The second way to interact with large language models is through an Application Programming Interface (API). An API acts as a bridge, allowing software applications to communicate with the LLM programmatically. This makes APIs essential for developers seeking to integrate LLMs into custom applications, websites, or tools. Using APIs offers several significant advantages, particularly in education and learning analytics. One of the key benefits is integration, as APIs can seamlessly connect LLM functionalities to various platforms, enabling educators and developers to embed advanced capabilities into their own applications or systems. This flexibility allows for the customization of LLM features to suit specific educational needs, such as automating feedback, analyzing student performance data, or generating personalized learning materials tailored to individual progress. APIs also enhance efficiency by automating repetitive tasks, reducing the need for manual interaction with web interfaces, and enabling large-scale operations. However, while APIs provide powerful tools for enhancing educational workflows, they require basic programming knowledge for implementation and careful management of API keys to ensure security and prevent unauthorized access. Additionally, API usage may incur costs, which could pose a limitation for projects with tight budgets.

API implementations are available in many programming languages, being Python the most common one. In the R programming language —the main language used in this book—, the elmer package [42] provides wrappers to interact with the APIs of the most common LLMs. In Chapter 11 [43] of this book, we provide a tutorial on how to use LLMs via API to obtain personalized feedback.

While APIs like OpenAI’s are excellent for quick, scalable interactions with powerful pre-trained language models, they operate as black boxes, where you rely on the provider to host and manage the models. This simplicity is a significant strength, but it may not meet the needs of users seeking greater control for performing context-specific tasks. This is where frameworks like Hugging Face’s Transformers come in. While transformers are not the only option within the Hugging Face ecosystem, they are among the most powerful and widely used tools for leveraging state-of-the-art language models. Hugging Face provides an open-source library —with over 1 million models (Dec, 2024)— that allows users to download and run these models locally or on the cloud infrastructure you control. A comprehensive list of pre-trained models is available on the Hugging Face Models Hub.

In the context of education and learning analytics, transformers are capable of making sense of complex data, such as analyzing student feedback, extracting themes from surveys, or identifying trends in course engagement. Moreover, users can further customize these models and fine-tune them to address their specific education needs. By offering flexibility, greater control, and adaptability, transformers expand the potential of LLMs beyond the simplicity of many API-based interactions.

Consider a scenario where an educator needs to analyse student feedback to understand sentiments and identify areas for improvement. A pre-trained sentiment analysis model from Hugging Face can quickly classify feedback as positive, neutral, or negative, offering actionable insights for educators. For example, a model such as bert-base-multilingual-uncased can be used, which has been fine-tuned specifically for sentiment analysis of product reviews in six languages. It is designed for immediate use in analyzing sentiment across multilingual product reviews and can also serve as a robust starting point for further fine-tuning on related sentiment analysis tasks, such as analyzing students’ course feedback or collaborative discourse. A few LLMs that have been specifically trained for education purposes exist, such as EduBERT [44] or K-12BERT [45]. In Chapter 10 [46] of this book, we provide a tutorial on how to use language models locally to automatically classify students’ discourse in collaborative problem-solving.

3 Conclusion

This chapter provided an introduction to LLMs, exploring their foundational components —transformer architecture, pre-training, and generative abilities—- while demonstrating their transformative potential across applications in education. Building on this, the chapter also explored tools for interacting with LLMs, from beginner-friendly web interfaces to more advanced tools like APIs and frameworks such as OpenAI API and Hugging Face’s Transformers. These approaches enable users to operationalize LLMs for tasks such as summarizing academic papers, automatic text classification, and generating of learning materials.

However, as the integration of LLMs in education becomes increasingly widespread, it is crucial to critically examine their limitations. The potential for biases in their outputs, issues of factual inaccuracies, and the challenges of ensuring transparency and interpretability must be addressed to harness their capabilities effectively. A careful balance between taking advantage of LLMs’ strengths and mitigating their shortcomings is necessary to ensure they serve as tools that improve, rather than hinder, educational and research practices [47].

An important avenue for addressing these challenges is the integration of explainable AI (XAI) techniques with LLMs. XAI methods aim to make the predictions and operations of LLMs more transparent [48, 49], helping users understand the factors driving their outputs. In educational contexts, for example, XAI can clarify why specific feedback was generated for a student or reveal how particular patterns in data influence predictions, such as grading or performance analytics. Transparency is needed to build trust by enabling users to identify potential biases or inaccuracies in model outputs. As the adoption of LLMs continues, embedding XAI into their workflows will be critical for ensuring ethical and equitable outcomes.

References

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Proceedings of the 34th international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA

Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Neural Information Processing Systems 30:5998–6008

Wang J, Chen Y (2023) A review on code generation with LLMs: Application and evaluation. In: 2023 IEEE international conference on medical artificial intelligence (MedAI). IEEE, pp 284–289

Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D (2024) MM-LLMs: Recent advances in MultiModal large language models. arXiv [cs.CL]

Jeon J, Lee S (2023) Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and information technologies 28:15873–15892. https://doi.org/10.1007/s10639-023-11834-1

Mistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ (2024) Large language models as tools to generate radiology board-style multiple-choice questions. Academic radiology 31:3872–3878. https://doi.org/10.1016/j.acra.2024.06.046

Huang C-Y, Wei J, Huang T-HK (2024) Generating educational materials with different levels of readability using LLMs. In: Proceedings of the third workshop on intelligent and interactive writing assistants. ACM, New York, NY, USA, pp 16–22

Areces C, Benotti L, Bulgarelli F, Echeveste E, Finzi N (2024) Leveraging language models and automatic summarization in online programming learning environments. Communications of the ACM 67:86–87. https://doi.org/10.1145/3653323

Liu Z, Yin SX, Lee C, Chen NF (2024) Scaffolding language learning via multi-modal tutoring systems with pedagogical instructions. arXiv [cs.CL]

10.

Khalil M, Vadiee F, Shakya R, Liu Q (2025) Creating artificial students that never existed: Leveraging large language models and CTGANs for synthetic data generation. arXiv [cs.LG]

11.

Aher GV, Arriaga RI, Kalai AT (2023) Using large language models to simulate multiple humans and replicate human subject studies. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J (eds) Proceedings of the 40th international conference on machine learning. PMLR, pp 337–371

12.

Liyanage UP, Ranaweera ND (2023) Ethical considerations and potential risks in the deployment of large language models in diverse societal contexts. Journal of Computational Social Dynamics 8:15–25

13.

Drori I, Te’eni D (2024) Human-in-the-loop AI reviewing: Feasibility, opportunities, and risks. Journal of the Association for Information Systems 25:7. https://doi.org/10.17705/1jais.00867

14.

Wang Y, Zhong W, Li L, Mi F, Zeng X, Huang W, Shang L, Jiang X, Liu Q (2023) Aligning large language models with human: A survey. arXiv [cs.CL]

15.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

16.

Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE Institute of Electrical and Electronics Engineers 86:2278–2324. https://doi.org/10.1109/5.726791

17.

Cho K, Merrienboer B van, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1724–1734

18.

Wang H, Li J, Wu H, Hovy E, Sun Y (2022) Pre-trained language models and their applications. Engineering (Beijing, China) 25:51–65. https://doi.org/10.1016/j.eng.2022.04.024

19.

Rani V, Nabi ST, Kumar M, Mittal A, Kumar K (2023) Self-supervised learning: A succinct review. Archives of Computational Methods in Engineering State of the Art Reviews 30:2761–2775. https://doi.org/10.1007/s11831-023-09884-2

20.

Kaufmann T, Weng P, Bengs V, Hüllermeier E (2023) A survey of reinforcement learning from human feedback. arXiv [cs.LG]

21.

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv [cs.LG]

22.

Koutcheme C, Hellas A (2024) Propagating large language models programming feedback. In: Proceedings of the eleventh ACM conference on learning @ scale. ACM, New York, NY, USA, pp 366–370

23.

Woo DJ, Wang Y, Susanto H, Guo K (2023) Understanding english as a foreign language students’ idea generation strategies for creative writing with natural language generation tools. Journal of educational computing research 61:1464–1482. https://doi.org/10.1177/07356331231175999

24.

Wu Y, Henriksson A, Duneld M, Nouri J (2023) Towards improving the reliability and transparency of ChatGPT for educational question answering. In: Lecture notes in computer science. Springer Nature Switzerland, Cham, pp 475–488

25.

Lee U, Jung H, Jeon Y, Sohn Y, Hwang W, Moon J, Kim H (2024) Few-shot is enough: Exploring ChatGPT prompt engineering method for automatic question generation in english education. Education and information technologies 29:11483–11515. https://doi.org/10.1007/s10639-023-12249-8

26.

Kolagar Z, Zarcone A (2024) HumSum: A personalized lecture summarization tool for humanities students using LLMs. In: Proceedings of the 1st workshop on personalization of generative AI systems (PERSONALIZE 2024). pp 36–70

27.

Ohashi L (2024) AI in language education: The impact of machine translation and ChatGPT. In: Intelligent systems reference library. Springer Nature Switzerland, Cham, pp 289–311

28.

Lundqvist K, Liyanagunawardena T, Starkey L (2020) Evaluation of student feedback within a MOOC using sentiment analysis and target groups. The International Review of Research in Open and Distributed Learning 21:140–156. https://doi.org/10.19173/irrodl.v21i3.4783

29.

Misiejuk K, Kaliisa R, Scianna J (2024) Augmenting assessment with AI coding of online student discourse: A question of reliability. Computers and Education: Artificial Intelligence 6:100216. https://doi.org/10.1016/j.caeai.2024.100216

30.

Labadze L, Grigolia M, Machaidze L (2023) Role of AI chatbots in education: Systematic literature review. International journal of educational technology in higher education 20: https://doi.org/10.1186/s41239-023-00426-1

31.

Jin H, Lee S, Shin H, Kim J (2024) Teach AI how to code: Using large language models as teachable agents for programming education. In: Proceedings of the CHI conference on human factors in computing systems. ACM, New York, NY, USA, pp 1–28

32.

Lin J, Han Z, Thomas DR, Gurung A, Gupta S, Aleven V, Koedinger KR (2024) How can i get it right? Using GPT to rephrase incorrect trainee responses. International journal of artificial intelligence in education 1–27. https://doi.org/10.1007/s40593-024-00408-y

33.

Perkins M (2023) Academic integrity considerations of AI large language models in the post-pandemic era: ChatGPT and beyond. Journal of university teaching & learning practice 20: https://doi.org/10.53761/1.20.02.07

34.

Zhang C, Liu X, Ziska K, Jeon S, Yu C-L, Xu Y (2024) Mathemyths: Leveraging large language models to teach mathematical language through child-AI co-creative storytelling. In: Proceedings of the CHI conference on human factors in computing systems. ACM, New York, NY, USA, pp 1–23

35.

Pinto G, Cardoso-Pereira I, Monteiro D, Lucena D, Souza A, Gama K (2023) Large language models for education: Grading open-ended questions using ChatGPT. In: Proceedings of the XXXVII brazilian symposium on software engineering. ACM, New York, NY, USA, pp 293–302

36.

Pinargote A, Calderón E, Cevallos K, Carrillo G, Chiluiza K, Echeverria V (2024) Automating data narratives in learning analytics dashboards using GenAI. In: 2024 joint of international conference on learning analytics and knowledge workshops. CEUR-WS, pp 150–161

37.

Vaiani L, Cagliero L, Garza P (2024) Emotion recognition from videos using multimodal large language models. Future internet 16:247. https://doi.org/10.3390/fi16070247

38.

Dai L, Kritskaia V, Velden E van der, Jung MM, Postma M, Louwerse MM (2022) Evaluating the usage of text-to-speech in K12 education. In: Proceedings of the 2022 6th international conference on education and e-learning. ACM, New York, NY, USA, pp 182–188

39.

Wei X (2024) Text-to-speech technology and math performance: A comparative study of students with disabilities, english language learners, and their general education peers. Educational researcher (Washington, DC: 1972) 53:285–295. https://doi.org/10.3102/0013189x241232995

40.

Adetayo AJ, Enamudu AI, Lawal FM, Odunewu AO (2024) From text to video with AI: The rise and potential of sora in education and libraries. Library hi tech news. https://doi.org/10.1108/lhtn-02-2024-0028

41.

Fung SCE, Wong MF, Tan CW (2023) Chain-of-thoughts prompting with language models for accurate math problem-solving. In: 2023 IEEE MIT undergraduate research technology conference (URTC). IEEE, pp 1–5

42.

Wickham H (2024) Elmer: Call LLM APIs from r. Github

43.

López-Pernas S, Song Y, Oliveira E, Saqr M (2025) LLMs for explainable artificial intelligence: Automating natural language explanations of predictive analytics models. In: Saqr M, López-Pernas S (eds) Advanced learning analytics methods: AI, precision and complexity. Springer Nature Switzerland, Cham

44.

Clavié B, Gal K (2019) EduBERT: Pretrained deep language models for learning analytics. arXiv [cs.CY]

45.

Goel V, Sahnan D, Venktesh V, Sharma G, Dwivedi D, Mohania M (2022) K-12BERT: BERT for k-12 education. In: Artificial intelligence in education. Posters and late breaking results, workshops and tutorials, industry and innovation tracks, practitioners’ and doctoral consortium. Springer International Publishing, Cham, pp 595–598

46.

López-Pernas S, Misiejuk K, Saqr M (2025) Using BERT-like language models for automated discourse coding: A primer and tutorial. In: Saqr M, López-Pernas S (eds) Advanced learning analytics methods: AI, precision and complexity. Springer Nature Switzerland, Cham

47.

Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L (2023) ChatGPT and other large language models are double-edged swords. Radiology 307:e230163

48.

López-Pernas S, Oliveira E, Song Y, Saqr M (2025) AI, explainable AI and evaluative AI: An introduction to informed data-driven decision-making in education. In: Saqr M, López-Pernas S (eds) Advanced learning analytics methods: AI, precision and complexity. Springer Nature Switzerland, Cham

49.

Zytek A, Pidò S, Veeramachaneni K (2024) LLMs for XAI: Future directions for explaining explanations. arXiv [cs.AI]