8 An Introduction to Large Language Models in Education
1 Introduction
Large Language Models (LLMs) are advanced AI systems primarily designed to process and generate human-like text, but their capabilities extend far beyond natural language tasks, enabling transformative applications across diverse domains, including education, research, software development, and more [1, 2]. Taking advantage of massive datasets and neural network architectures —such as the transformer mechanism [2]— LLMs can analyse context, predict sequential elements (e.g., the next word or token), and produce coherent, contextually appropriate outputs.
While text generation and understanding are at the core of LLMs, their versatility stems from their ability to generalize patterns in data. This enables them to tackle tasks like code generation [3], and even multimodal applications that integrate text with other data types, such as images and videos [4]. For instance, specialized models like Claude Sonnet excel in programming tasks, while multimodal extensions of GPT-4 demonstrate the ability to describe images and interpret visual data alongside textual inputs.
LLMs are opening new possibilities in education by transforming how students learn and how educators teach and research [5]. These models can dynamically generate test questions tailored to diverse learning levels [6], translate educational materials to improve accessibility [7], summarize complex concepts for clearer understanding [8], and simulate conversational practice to enhance language skills [9]. The personalization of learning experiences through LLMs can support students in mastering content at their own pace. In research, they enable the creation of synthetic datasets [10], simulate experimental conditions for pedagogical studies [11], and analyse vast corpora of text and structured data to uncover actionable insights.
The scalability of LLMs, exemplified by models like GPT-3 with 175 billion parameters (i.e., the weights and biases of the different layers), has accelerated their adoption across fields. However, they are not without limitations. Issues such as biased, inaccurate, or fabricated outputs highlight the need for critical human oversight [12]. A human-in-the-loop approach is crucial to ensuring that LLMs align with educational goals and ethical standards, serving as tools to enhance learning rather than replace human expertise [13, 14]. Despite these challenges, LLMs are transforming how information is delivered, processed, and utilized. Their ability to integrate language understanding with broader data-processing capabilities positions them as invaluable tools for solving complex problems and enhancing accessibility across disciplines. The following sections explore the transformative applications of LLMs in education research and practice.
2 Behind the scenes of LLMs
To better understand how large language models function, it is helpful to break down their core components and processes into three interrelated aspects based on the acronym “GPT”: the Transformer architecture (T), which underpins their computational ability; the Pre-training phase (P), where the model learns patterns and knowledge from vast datasets; and the Generative abilities (G), which showcase their practical applications. The sequence T → P → G provides a logical progression from the foundational mechanics of LLMs to their training processes and, finally, to their real-world outputs. This structure offers a holistic view of these transformative technologies: first understanding the architecture that powers LLMs, then exploring how they are trained to predict and fine-tune their outputs, and ultimately seeing what they can generate. In the following sections, we explore each of these phases in detail, shedding light on their individual contributions to the overall capabilities of LLMs.
2.1 How LLMs Work: The “Transformer” Architecture
The “Transformer” in GPT refers to the architectural framework that powers modern LLMs, enabling their remarkable fluency and adaptability. This architecture revolutionized natural language processing by introducing the attention mechanism [2], which allows the model to understand relationships within a sequence and process context effectively. Without the “T,” LLMs could not achieve the coherence and precision that define their generative abilities.
At its core, a LLM is a sophisticated function designed to predict the next element in a sequence. For example, given the input “The sun is shining”, the model might predict “brightly” or “today”. While this prediction task may sound straightforward, this isn’t a simple function like y = 2x + 1, in which the parameters are the coefficient of the linear function 2 and 1, and input of x=1 would give an output of 3. The underlying computations in LLMs are vastly more intricate. Instead of a handful of parameters, LLMs employ billions of them within multi-layered neural networks. Each layer refines the model’s understanding of the input, performing complex transformations based on the outputs of previous layers.
Prior to the attention mechanism, models like recurrent neural networks (RNNs) [15] and convolutional neural networks (CNNs) [16] struggled to effectively process relationships in long or complex sequences, often treating input elements in isolation or with limited context [17]. The attention mechanism transformed this by allowing the model to weigh the importance of different parts of the input sequence based on their relevance to the task [2]. For example, consider the sentences “There’s a bat on the tree” and “He swings the bat”. In the first, “bat” refers to an animal, while in the second, it refers to a sports tool. The attention mechanism enables the model to focus on surrounding words like “tree” or “swings” to deduce the correct meaning. This ability to dynamically assign importance to specific words based on context is the key to LLMs’ accuracy and coherence.
The attention mechanism calculates attention weights, assigning varying levels of importance to each word based on its relevance to the task at hand. For instance, in the sentence “The sun is shining”, the word “sun” is given more weight than less significant words like “the”. These weighted representations of words are passed through multiple layers of the neural network, with each layer building on the contextual understanding established by the previous one. Early layers might identify simple word relationships, like “sun” being associated with “shining”, while later layers incorporate broader knowledge, such as recognizing weather-related descriptions that often lead to words like “today” or “brightly”.
In summary, the Transformer architecture is highly effective because it dynamically focuses attention on the most relevant parts of an input sequence at any given moment. This mechanism mirrors human behavior, where we concentrate on the information most pertinent to the task at hand while filtering out distractions. This synergy between attention and neural network computation forms the foundation of the architecture, making it the core of modern generative AI.
2.2 Why LLMs Work: The Power of Being “Pre-trained”
Previously, we discussed that LLMs are predictive models with billions of parameters, enabling them to perform complex calculations. But how are these parameters determined? That’s where the “P” in GPT - Pre-training - comes in.
In any prediction task, the goal is to learn from existing data to make accurate predictions for new inputs. For example, imagine you’re predicting the next number in a sequence like 2, 4, 6, 8. Based on the pattern, you might predict 10 as the next number. Similarly, LLMs predict the next word in a sentence based on patterns they’ve learned from vast amounts of text. For instance, given the sentence “The sun is shining”, the model might predict “brightly” as the next word. Before training begins, the structure of the model must be defined. In our example, this would mean deciding the type of rule used to predict the sequence, like identifying it as “add 2 to the previous number”. For LLMs, this involves defining the number of layers in the model, the capacity of the attention mechanism to focus on different parts of the input, and the dimensions of its internal representations and computations. These decisions shape how well the model can understand and generate language.
Once the model structure is determined, the training process begins. In the numerical example, if we assume that we have an arithmetic sequence (that is, xt = 1 * xt-1 + 2), the process involves identifying the parameters a=1 and b=2, which defines the rule for predicting the next number in the sequence. For LLMs, determining parameter values is far more complex, involving billions of parameters and advanced optimisation techniques. The training process is generally divided into two main stages: pre-training and fine-tuning, with reinforcement learning with human feedback (RLHF) often added for further refinement. Let us explore how these processes work in more detail in the following sections.
2.2.1 Pre-training and Fine Tuning
Pre-training is the initial phase [18] where the model is exposed to a vast corpus of text data, such as books, articles, and websites. The size and diversity of this dataset are crucial for the model to learn general patterns of language, including grammar, semantics, and common word associations. Importantly, this process is self-supervised [19], meaning the model learns from the structure of the data itself without requiring manually labelled examples. For instance, given the input “The sun is shining”, the model predicts the next word based on patterns it has seen in similar contexts during pre-training. In this case, it might predict words like “brightly” or “today” depending on the associations it has learned from the training data.
On the technical side, the accuracy of these predictions is measured using a loss function, which quantifies how far the predicted word is from the actual next word in the sequence. For example, if the model predicts “cloudy” instead of “brightly,” the loss function assigns a higher value, indicating a larger error. These errors are minimized through a process called backpropagation, which calculates how each parameter in the model contributes to the error. Optimisation algorithms then adjust the parameters to reduce the loss, gradually improving the model’s ability to make accurate predictions.
After pre-training, the model undergoes fine-tuning on smaller, domain-specific datasets. For example, if the model is to generate weather reports, it might be fine-tuned on a specialized corpus of meteorological data. This step builds on the general language patterns learned during pre-training, refining the model to perform specialized tasks with high accuracy. Returning to the earlier example, fine-tuning might teach the model to complete “The sun is shining” with specific terms like “in the afternoon” or “through the clouds”, based on its specialized training data.
2.2.2 Reinforcement Learning with Human Feedback (RLHF)
To further enhance the model’s alignment with human expectations, a process called reinforcement learning with human feedback (RLHF) [20] is often applied. This technique fine-tunes the model beyond its technical accuracy, ensuring it generates outputs that are clear, relevant, and aligned with human preferences. The approach was popularized by the work of Christiano et al. [2017], which demonstrated how human preferences could be used to guide models in learning complex tasks where clear evaluation criteria are difficult to define.
In RLHF, human evaluators review the model’s outputs and rank them based on criteria like clarity, appropriateness, and usefulness. For instance, if the model is tasked with completing the phrase “The sun is shining”, it might produce options like “brightly”, “on the horizon”, or “through the clouds”. Evaluators would rank these outputs, perhaps preferring “brightly” for its clarity and generality over “on the horizon”, which might seem overly specific. These rankings are then used to train a reward model, which predicts scores for outputs based on their alignment with human preferences.
During the reinforcement learning phase, the LLM generates new predictions, and the reward model assigns scores to these predictions. Using reinforcement learning algorithms, such as Proximal Policy Optimisation (PPO) [21], the LLM updates its parameters to maximize the reward from the reward model. During this process, desirable outputs are assigned higher scores, while less preferred options are penalized. Through this iterative process, the model improves its ability to align with human-provided feedback, producing outputs that better meet expectations for clarity, relevance, and user-friendliness.
2.3 What LLMs Do: The “Generative” Core of GPT
At the heart of large language models lies their generative ability - the “G” in GPT - which enables them to produce coherent, contextually relevant, and often human-like text. This generative capability transforms LLMs from passive tools into active collaborators across a wide range of applications, from creative writing and summarizing complex documents to coding and conversational AI.
The generative process begins with a prompt, which serves as the input to the model. For instance, given the phrase “The sun is shining”, the model predicts likely continuations one word after another based on patterns it has learned during training. It might generate “brightly through the clouds” or “today after a long storm,” depending on its understanding of the context and relationships within the data it was trained on. This prediction is not random but calculated from probabilities assigned to thousands of potential next words, allowing the model to choose the most appropriate continuation.
LLMs excel in generating text that reflects the tone, style, and intent of the input. For example, when prompted with “Write a formal letter about sunny weather”, the model might begin: “Dear Sir/Madam, I am writing to express my appreciation for the delightful sunny weather we have been experiencing”. Conversely, a casual prompt like “Tell me something cool about sunny days” could result in: “Did you know sunshine boosts your serotonin levels, making you feel happier?”. These examples demonstrate the model’s ability to adapt to the user’s intent and produce contextually appropriate outputs.
Beyond generating text, LLMs can perform tasks such as aiding in language learning [9], creating summaries [8], and writing code [3]. This flexibility stems from their pre-training on diverse datasets that include examples across various domains, enabling them to handle a broad range of tasks. Despite its versatility, the generative process is not without challenges. LLMs may occasionally produce outputs that are factually inaccurate, contextually inappropriate, or overly verbose—a phenomenon known as “hallucination”. This highlights the importance of human oversight to ensure the quality and reliability of their outputs. Table 1 summarizes the main uses of LLMs and how they can be applied to different tasks in education and learning analytics.
Task | Description | Example Models | How It Works Technically | Application in education |
---|---|---|---|---|
Text Completion | Predicting and generating text to complete a prompt. | GPT-4, GPT-o1, Claude, LLaMA | Utilizes autoregressive transformers to predict the next token based on preceding tokens, employing learned probability distributions. | Auto-generating personalized feedback for students [22] . |
Text Generation (Creative Writing) | Generating creative and coherent long-form text. | GPT-4, GPT-o1, Claude, LLaMA | Uses autoregressive token prediction to produce semantically coherent, fluent, and diverse text. | Supporting creative writing assignments, generating essay ideas, or storytelling exercises for students [23]. |
Question Answering | Providing answers to factual or contextual questions. | GPT-4, Claude, Gemini, RoBERTa | Employs attention mechanisms to focus on relevant parts of the input context, generating precise answers. In extractive QA, identifies and extracts specific text spans. | Automated Q&A systems for course material, providing quick answers to student queries [24] |
Question Generation | Generating questions based on input content. | GPT-4, T5, BERT-QG | Sequence-to-sequence models generate question tokens conditioned on input context. | Auto-generating quiz questions or comprehension tests from textbook material or lecture notes [25]. |
Text Summarization | Generating concise summaries from longer texts. | GPT-4, Gemini, Claude, BART, T5 | Applies sequence-to-sequence transformer architectures to condense input text into coherent summaries, preserving essential information. | Summarizing lecture notes, research papers, or learning materials for quicker understanding [26]. |
Translation | Translating text between languages. | GPT-4, Gemini, mT5, LLaMA | Uses encoder-decoder transformer models to map input tokens from the source language to the target language, aligning syntactic and semantic structures. | Supporting multilingual learners by translating course materials, assignments, or instructions [27]. |
Text Classification | Assigning labels or categories to text. | BERT, RoBERTa, DistilBERT | Tokenizes input text and processes it through transformer layers to produce embeddings, which are then classified into categories using a task-specific head. | Analyzing student responses for sentiment (e.g., identifying frustration) [28], or automatic discourse coding [29]. |
Chat and Dialogue | Generating conversational responses for chatbots. | Claude, GPT-4, Gemini, LLaMA | Maintains dialogue context by incorporating previous conversation turns, generating contextually relevant responses through autoregressive token prediction. | Powering tutoring chatbots that assist students with real-time explanations and guidance [30]. |
Code Generation | Writing or completing code snippets. | GPT-o1, Codex, Gemini | Predicts subsequent code tokens, based on previous training on a large corpora of code, ensuring syntactic and semantic correctness. | Assisting students in programming courses by auto-generating or debugging code snippets [31]. |
Paraphrasing and Rewriting | Generating rephrased or simplified versions of input text. | GPT-4, Claude, T5 | Fine-tuned models rephrase input text while preserving semantic meaning, often using reinforcement learning. | Helping students rephrase ideas to avoid plagiarism, simplifying content, or generating alternative explanations of concepts [32, 33]. |
Style Transfer | Rewriting text with a different style or tone. | GPT-4, CTRL, Claude, LLaMA | Generates text conditioned on desired stylistic attributes, often fine-tuned on datasets exemplifying target styles to learn appropriate transformations. | Simplifying advanced academic content for younger students or adapting tone for academic writing [34]. |
Knowledge Extraction | Extracting entities or relationships from text. | GPT-4, BERT, Gemini | Utilizes attention layers to identify and classify entities and their relationships within text, enabling structured information extraction. | Extracting key concepts, relationships, or named entities from lecture transcripts and research articles. |
Task automation | Performing tasks with minimal or no labeled examples. | Claude, GPT-4, GPT-o1, BART | Takes advantage of in-context learning by interpreting task instructions and examples provided within the input, adapting to new tasks without explicit retraining. | Automating grading of open-ended responses with minimal training or applying rubrics [35], performing tedious technical tasks such as converting a screenshot of a table into LATEX code. |
Data-to-Text Generation | Generating natural language summaries or explanations from structured data. | GPT-4, T5 | Encoder-decoder architectures map structured inputs (e.g., tables) into coherent text. | Generating textual explanations of student performance data, or dashboards [36]. |
Multimodal Integration | Combining text with images, audio, or other modalities. | Gemini, GPT-4V, CLIP | Aligns representations from different modalities using combined encoder architectures, facilitating tasks that require understanding across multiple data types (image, audio, video, etc.). | Analyzing student engagement via multimodal data (e.g., combining video, text, and audio) [37]. |
Text-to-Speech | Converting written text into spoken language. | Tacotron, Gemini, WaveNet, VALL-E | Deep learning architectures generate natural-sounding speech from tokenized text inputs. | Assisting visually impaired students, enabling spoken lecture notes, teaching young children who cannot read yet, or improving pronunciation for language learners [38, 39]. |
Text-to-Video | Converting written text into video. | Sora, Runway Gen 2 | Multimodal transformer architectures process text as input and generate video sequences using latent diffusion or frame interpolation. | Creating engaging educational videos from course materials [40] |
Reasoning & Problem Solving | Solving logical, mathematical, or structured reasoning tasks. | Claude, GPT-4, GPT-o1, Gemini | Employs chain-of-thought prompting and step-by-step token generation to tackle complex reasoning tasks, enhancing problem-solving capabilities | Assisting students with step-by-step solutions in math or logical reasoning exercises, enhancing comprehension [41]. |
2.4 Using LLMs: From Web Interfaces to Advanced Frameworks
After understanding how LLMs work and what they can be used for, the next step is exploring how to use them in practical scenarios. Whether you are a beginner looking to interact with LLMs through simple web interfaces or an advanced user building custom applications using APIs and frameworks, there are many ways to harness their capabilities.
For beginners, web interfaces like ChatGPT, Claude (by Anthropic), and Perplexity AI offer user-friendly ways to explore LLM capabilities without requiring technical expertise. These platforms can assist with summarizing academic papers, generating insights from student feedback, generating or assessing code, brainstorming interventions for struggling learners, and more. These tools support uploading files, such as PDFs, or accessing content from open-access URLs, making them versatile for analyzing publicly available academic content.
Such tools demonstrate the transformative power of LLMs across diverse domains, enabling students to engage more deeply with complex concepts while allowing educators to focus on higher-level instruction and personalized support. Whether applied to programming, language learning, or data analysis, these tools foster iterative learning: if the initial output does not fully meet the user’s needs, prompts and inputs can be refined to generate more tailored and meaningful results. This interaction not only enhances understanding but also encourages critical thinking and adaptability in both learners and educators.
The second way to interact with large language models is through an Application Programming Interface (API). An API acts as a bridge, allowing software applications to communicate with the LLM programmatically. This makes APIs essential for developers seeking to integrate LLMs into custom applications, websites, or tools. Using APIs offers several significant advantages, particularly in education and learning analytics. One of the key benefits is integration, as APIs can seamlessly connect LLM functionalities to various platforms, enabling educators and developers to embed advanced capabilities into their own applications or systems. This flexibility allows for the customization of LLM features to suit specific educational needs, such as automating feedback, analyzing student performance data, or generating personalized learning materials tailored to individual progress. APIs also enhance efficiency by automating repetitive tasks, reducing the need for manual interaction with web interfaces, and enabling large-scale operations. However, while APIs provide powerful tools for enhancing educational workflows, they require basic programming knowledge for implementation and careful management of API keys to ensure security and prevent unauthorized access. Additionally, API usage may incur costs, which could pose a limitation for projects with tight budgets.
API implementations are available in many programming languages, being Python the most common one. In the R programming language —the main language used in this book—, the elmer package [42] provides wrappers to interact with the APIs of the most common LLMs. In Chapter [43] of this book, we provide a tutorial on how to use LLMs via API to obtain personalized feedback.
While APIs like OpenAI’s are excellent for quick, scalable interactions with powerful pre-trained language models, they operate as black boxes, where you rely on the provider to host and manage the models. This simplicity is a significant strength, but it may not meet the needs of users seeking greater control for performing context-specific tasks. This is where frameworks like Hugging Face’s Transformers come in. While transformers are not the only option within the Hugging Face ecosystem, they are among the most powerful and widely used tools for leveraging state-of-the-art language models. Hugging Face provides an open-source library —with over 1 million models (Dec, 2024)— that allows users to download and run these models locally or on the cloud infrastructure you control. A comprehensive list of pre-trained models is available on the Hugging Face Models Hub.
In the context of education and learning analytics, transformers are capable of making sense of complex data, such as analyzing student feedback, extracting themes from surveys, or identifying trends in course engagement. Moreover, users can further customize these models and fine-tune them to address their specific education needs. By offering flexibility, greater control, and adaptability, transformers expand the potential of LLMs beyond the simplicity of many API-based interactions.
Consider a scenario where an educator needs to analyse student feedback to understand sentiments and identify areas for improvement. A pre-trained sentiment analysis model from Hugging Face can quickly classify feedback as positive, neutral, or negative, offering actionable insights for educators. For example, a model such as bert-base-multilingual-uncased can be used, which has been fine-tuned specifically for sentiment analysis of product reviews in six languages. It is designed for immediate use in analyzing sentiment across multilingual product reviews and can also serve as a robust starting point for further fine-tuning on related sentiment analysis tasks, such as analyzing students’ course feedback or collaborative discourse. A few LLMs that have been specifically trained for education purposes exist, such as EduBERT [44] or K-12BERT [45]. In Chapter 10 [46] of this book, we provide a tutorial on how to use language models locally to automatically classify students’ discourse in collaborative problem-solving.
3 Conclusion
This chapter provided an introduction to LLMs, exploring their foundational components —transformer architecture, pre-training, and generative abilities—- while demonstrating their transformative potential across applications in education. Building on this, the chapter also explored tools for interacting with LLMs, from beginner-friendly web interfaces to more advanced tools like APIs and frameworks such as OpenAI API and Hugging Face’s Transformers. These approaches enable users to operationalize LLMs for tasks such as summarizing academic papers, automatic text classification, and generating of learning materials.
However, as the integration of LLMs in education becomes increasingly widespread, it is crucial to critically examine their limitations. The potential for biases in their outputs, issues of factual inaccuracies, and the challenges of ensuring transparency and interpretability must be addressed to harness their capabilities effectively. A careful balance between taking advantage of LLMs’ strengths and mitigating their shortcomings is necessary to ensure they serve as tools that improve, rather than hinder, educational and research practices [47].
An important avenue for addressing these challenges is the integration of explainable AI (XAI) techniques with LLMs. XAI methods aim to make the predictions and operations of LLMs more transparent [48, 49], helping users understand the factors driving their outputs. In educational contexts, for example, XAI can clarify why specific feedback was generated for a student or reveal how particular patterns in data influence predictions, such as grading or performance analytics. Transparency is needed to build trust by enabling users to identify potential biases or inaccuracies in model outputs. As the adoption of LLMs continues, embedding XAI into their workflows will be critical for ensuring ethical and equitable outcomes.